File Formats#
casanovoutils reads and writes three file formats: MGF, mzML (read only), and mzTab. This page describes the relevant fields and conventions expected by each tool.
MGF (Mascot Generic Format)#
MGF is a plain-text format for tandem mass spectrometry data. Each spectrum is
delimited by BEGIN IONS / END IONS blocks.
Example#
BEGIN IONS
TITLE=spectrum_001
PEPMASS=612.3456
CHARGE=2+
SEQ=PEPTIDEK
100.0 1234.5
200.0 5678.9
...
END IONS
Fields used by casanovoutils#
Field |
Description |
|---|---|
|
Ground-truth peptide sequence. Required for evaluation with |
|
Precursor m/z (and optionally intensity). |
|
Precursor charge state. |
|
Spectrum identifier (optional; used by some tools for logging). |
The m/z and intensity peak list follows the header fields, one peak per line, space-separated.
Notes#
For ground-truth evaluation, casanovoutils reads
SEQ=entries from MGF files directly in order of appearance. Other MGF operations use Pyteomics.Spectrum indices used in the mzTab
spectra_refcolumn (see below) are zero-based positions within the MGF file.The
SEQ=field is required for ground-truth evaluation. Casanovo writes this field when it generates annotated MGF output.
mzML#
mzML is the PSI standard XML format for raw and processed mass spectrometry data. casanovoutils reads mzML files via Pyteomics and writes output as MGF.
Fields read from mzML#
Field |
MGF output key |
Notes |
|---|---|---|
|
|
Always present |
|
|
Always present |
|
|
Spectrum identifier, e.g. |
precursor |
|
Written when present |
precursor |
|
Written as |
scan |
|
Written when present |
mzML reader notes#
casanovoutils reads mzML using
pyteomics.mzml.MzML, which supports both indexed and non-indexed mzML files.mzML output is not supported directly. To convert the sampled MGF back to mzML, use msConvert.
mzTab#
mzTab is a tab-delimited PSI standard format for reporting peptide-spectrum matches. casanovoutils reads the PSM section of mzTab files, which is produced by Casanovo as its primary output format.
Required columns#
Column |
Description |
|---|---|
|
Predicted peptide sequence. |
|
Per-PSM confidence score used to rank predictions. |
|
Reference to the originating spectrum. |
spectra_ref format#
casanovoutils expects spectra_ref values of the form:
ms_run[1]:index=<INT>
where <INT> is the zero-based index of the spectrum in the corresponding
MGF file. This is the format written by Casanovo.
Example PSM section#
PSH sequence PSM_ID accession unique database ... spectra_ref search_engine_score[1] ...
PSM PEPTIDEK 1 null null null ... ms_run[1]:index=0 0.9982 ...
PSM ACDEFGHIK 2 null null null ... ms_run[1]:index=1 0.8741 ...
Notes#
casanovoutils reads mzTab files using Pyteomics.
All MGF spectra are considered. Spectra absent from the mzTab are assigned a score of
-1.0and marked as incorrect, so they appear at the bottom of the ranked list and count against coverage.Additional columns present in the mzTab (e.g., per-amino-acid score columns from Casanovo) are preserved in the internal DataFrame but are not used unless explicitly referenced.