File Formats#

casanovoutils reads and writes three file formats: MGF, mzML (read only), and mzTab. This page describes the relevant fields and conventions expected by each tool.


MGF (Mascot Generic Format)#

MGF is a plain-text format for tandem mass spectrometry data. Each spectrum is delimited by BEGIN IONS / END IONS blocks.

Example#

BEGIN IONS
TITLE=spectrum_001
PEPMASS=612.3456
CHARGE=2+
SEQ=PEPTIDEK
100.0 1234.5
200.0 5678.9
...
END IONS

Fields used by casanovoutils#

Field

Description

SEQ=

Ground-truth peptide sequence. Required for evaluation with graph-prec-cov.

PEPMASS=

Precursor m/z (and optionally intensity).

CHARGE=

Precursor charge state.

TITLE=

Spectrum identifier (optional; used by some tools for logging).

The m/z and intensity peak list follows the header fields, one peak per line, space-separated.

Notes#

  • For ground-truth evaluation, casanovoutils reads SEQ= entries from MGF files directly in order of appearance. Other MGF operations use Pyteomics.

  • Spectrum indices used in the mzTab spectra_ref column (see below) are zero-based positions within the MGF file.

  • The SEQ= field is required for ground-truth evaluation. Casanovo writes this field when it generates annotated MGF output.


mzML#

mzML is the PSI standard XML format for raw and processed mass spectrometry data. casanovoutils reads mzML files via Pyteomics and writes output as MGF.

Fields read from mzML#

Field

MGF output key

Notes

m/z array

m/z array

Always present

intensity array

intensity array

Always present

id

TITLE param

Spectrum identifier, e.g. scan=1

precursor selected ion m/z

PEPMASS param

Written when present

precursor charge state

CHARGE param

Written as N+ when present

scan scan start time

RTINSECONDS param

Written when present

mzML reader notes#

  • casanovoutils reads mzML using pyteomics.mzml.MzML, which supports both indexed and non-indexed mzML files.

  • mzML output is not supported directly. To convert the sampled MGF back to mzML, use msConvert.


mzTab#

mzTab is a tab-delimited PSI standard format for reporting peptide-spectrum matches. casanovoutils reads the PSM section of mzTab files, which is produced by Casanovo as its primary output format.

Required columns#

Column

Description

sequence

Predicted peptide sequence.

search_engine_score[1]

Per-PSM confidence score used to rank predictions.

spectra_ref

Reference to the originating spectrum.

spectra_ref format#

casanovoutils expects spectra_ref values of the form:

ms_run[1]:index=<INT>

where <INT> is the zero-based index of the spectrum in the corresponding MGF file. This is the format written by Casanovo.

Example PSM section#

PSH	sequence	PSM_ID	accession	unique	database	...	spectra_ref	search_engine_score[1]	...
PSM	PEPTIDEK	1	null	null	null	...	ms_run[1]:index=0	0.9982	...
PSM	ACDEFGHIK	2	null	null	null	...	ms_run[1]:index=1	0.8741	...

Notes#

  • casanovoutils reads mzTab files using Pyteomics.

  • All MGF spectra are considered. Spectra absent from the mzTab are assigned a score of -1.0 and marked as incorrect, so they appear at the bottom of the ranked list and count against coverage.

  • Additional columns present in the mzTab (e.g., per-amino-acid score columns from Casanovo) are preserved in the internal DataFrame but are not used unless explicitly referenced.