FAQ#

General#

What is the AUPC?

The AUPC (Area Under the Precision–Coverage curve) is a single-number summary of a model’s precision–coverage trade-off. It is computed as the area under the curve obtained by sorting predictions from highest to lowest confidence and computing the running precision and coverage at each threshold. A perfect model has an AUPC of 1.0; a random model will have an AUPC approximately equal to the fraction of correct predictions.

What is a precision–coverage curve?

A precision–coverage curve plots precision (fraction of accepted predictions that are correct) on the y-axis against coverage (fraction of all spectra included) on the x-axis. As the confidence threshold is lowered, more spectra are included (coverage increases) but precision may decrease. The curve summarizes the accuracy–completeness trade-off across all possible thresholds.


Installation#

What Python version is required?

casanovoutils requires Python 3.13 or later.

How do I install casanovoutils in a virtual environment?

python -m venv .venv
source .venv/bin/activate
pip install casanovoutils

Or with uv:

uv venv
uv pip install casanovoutils

Evaluation#

Are isoleucine (I) and leucine (L) treated as the same amino acid?

By default, yes. Pass --noreplace_i_l to require an exact match.

What happens to spectra that are missing from the mzTab output?

Spectra present in the MGF but absent from the mzTab are assigned a prediction score of -1.0 and marked as incorrect. This means they appear at the bottom of the ranked list and count against coverage, which accurately reflects that the model did not return a prediction for those spectra.

My mzTab and MGF files come from different tools. Will casanovoutils work?

casanovoutils expects spectra_ref values of the form ms_run[1]:index=<INT> where <INT> is the zero-based position of the spectrum in the MGF file. This is the format written by Casanovo. If your mzTab uses a different spectra_ref convention, you may need to reformat it before using casanovoutils.


MGF operations#

What does pipeline do compared to running stages individually?

casanovoutils mgf pipeline chains shuffle → downsample → purge-redundant in a single pass, writing one output file. Each stage is skipped when its parameter is omitted. Running the stages individually via separate commands produces identical results but requires intermediate files.

What does downsampling do to peptides with fewer than k spectra?

All spectra for that peptide are kept. Downsampling only removes spectra when a peptide has more than k spectra; otherwise the full set is retained.

Is the downsampling reproducible?

Yes, set --random_seed to a fixed integer. The default seed is 42.

What does purge-redundant do exactly?

Peaks within each spectrum are sorted by m/z. Any peak whose m/z differs from the preceding peak by less than epsilon (default 0.001 Da) is discarded. This removes near-duplicate peaks that can arise from instrument noise or rounding.


Residue mass tables#

What is the default residue mass table?

The bundled residues.yaml file contains standard monoisotopic masses for the 20 canonical amino acids plus common modifications used by Casanovo. Export it with:

casanovoutils dump-residues dump residues.yaml

How do I add a custom modification?

Export the default table, add your modification as a new key–value pair (residue name: mass in daltons), and pass the edited file back via --residues_path:

casanovoutils dump-residues dump my_residues.yaml
# edit my_residues.yaml ...