FAQ#
General#
What is the AUPC?
The AUPC (Area Under the Precision–Coverage curve) is a single-number summary of a model’s precision–coverage trade-off. It is computed as the area under the curve obtained by sorting predictions from highest to lowest confidence and computing the running precision and coverage at each threshold. A perfect model has an AUPC of 1.0; a random model will have an AUPC approximately equal to the fraction of correct predictions.
What is a precision–coverage curve?
A precision–coverage curve plots precision (fraction of accepted predictions that are correct) on the y-axis against coverage (fraction of all spectra included) on the x-axis. As the confidence threshold is lowered, more spectra are included (coverage increases) but precision may decrease. The curve summarizes the accuracy–completeness trade-off across all possible thresholds.
Installation#
What Python version is required?
casanovoutils requires Python 3.13 or later.
How do I install casanovoutils in a virtual environment?
python -m venv .venv
source .venv/bin/activate
pip install casanovoutils
Or with uv:
uv venv
uv pip install casanovoutils
Evaluation#
Are isoleucine (I) and leucine (L) treated as the same amino acid?
By default, yes. Pass --noreplace_i_l to require an exact match.
What happens to spectra that are missing from the mzTab output?
Spectra present in the MGF but absent from the mzTab are assigned a prediction
score of -1.0 and marked as incorrect. This means they appear at the bottom
of the ranked list and count against coverage, which accurately reflects that
the model did not return a prediction for those spectra.
My mzTab and MGF files come from different tools. Will casanovoutils work?
casanovoutils expects spectra_ref values of the form
ms_run[1]:index=<INT> where <INT> is the zero-based position of the
spectrum in the MGF file. This is the format written by Casanovo. If your
mzTab uses a different spectra_ref convention, you may need to reformat it
before using casanovoutils.
MGF operations#
What does pipeline do compared to running stages individually?
casanovoutils mgf pipeline chains shuffle → downsample → purge-redundant
in a single pass, writing one output file. Each stage is skipped when its
parameter is omitted. Running the stages individually via separate commands
produces identical results but requires intermediate files.
What does downsampling do to peptides with fewer than k spectra?
All spectra for that peptide are kept. Downsampling only removes spectra when a
peptide has more than k spectra; otherwise the full set is retained.
Is the downsampling reproducible?
Yes, set --random_seed to a fixed integer. The default seed is 42.
What does purge-redundant do exactly?
Peaks within each spectrum are sorted by m/z. Any peak whose m/z differs from
the preceding peak by less than epsilon (default 0.001 Da) is discarded.
This removes near-duplicate peaks that can arise from instrument noise or
rounding.
Residue mass tables#
What is the default residue mass table?
The bundled residues.yaml file contains standard monoisotopic masses for the
20 canonical amino acids plus common modifications used by Casanovo. Export it
with:
casanovoutils dump-residues dump residues.yaml
How do I add a custom modification?
Export the default table, add your modification as a new key–value pair
(residue name: mass in daltons), and pass the edited file back via
--residues_path:
casanovoutils dump-residues dump my_residues.yaml
# edit my_residues.yaml ...