CLI Reference

CLI Reference#

casanovoutils installs a single casanovoutils command with nested subcommands. All subcommands are built with Python Fire, which means:

Boolean flags can be passed as --flag (True) or --noflag (False).
Positional arguments can also be passed as keyword arguments.

The top-level key for each group is the bare module name. The full structure is:

casanovoutils
├── mgfutils        — MGF file processing
│   ├── pipeline
│   ├── shuffle
│   ├── downsample
│   ├── spectra-per-peptide
│   ├── downsample-spectra
│   └── purge-redundant
├── mzmlutils       — mzML file sampling (writes MGF)
├── denovoutils     — Load PSM data into DataFrames
│   ├── get_mgf_psms
│   ├── get_mztab
│   └── get_groundtruth
├── preccov         — Precision-coverage evaluation
│   ├── get_pc_df
│   └── graph_prec_cov
├── summarize_mgf   — MGF file statistics and HTML reports
│   ├── summarize
│   ├── charge-distribution
│   ├── fragment-coverage
│   ├── peak-counts
│   └── peptide-lengths
├── datasets        — Create train/val/test splits from MGF files
├── graphloss       — Plot Casanovo training/validation loss curves
└── residues        — Residue mass table utilities

`casanovoutils mgfutils`#

Process MGF spectrum files.

`pipeline`#

Run spectra through an optional chain of processing stages in order: shuffle → downsample → purge redundant peaks. Each stage is skipped when its enabling parameter is omitted.

Argument	Type	Default	Description
`spectra`	path	required	Input MGF file path
`--outfile`	path	`None`	Output MGF file path
`--do_shuffle`	bool	`True`	Shuffle spectra
`--downsample_k`	int	`None`	Max spectra per peptide (skip if omitted)
`--purge_epsilon`	float	`None`	Min m/z gap to keep a peak in Da (skip if omitted)
`--random_seed`	int	`42`	Random seed for shuffle and downsample

Examples:

# Shuffle only
casanovoutils mgfutils pipeline input.mgf --outfile out.mgf --nodo_shuffle False

# Downsample to 2 spectra per peptide, no shuffle
casanovoutils mgfutils pipeline input.mgf --outfile out.mgf --nodo_shuffle --downsample_k 2

# Full pipeline
casanovoutils mgfutils pipeline input.mgf --outfile out.mgf \
  --downsample_k 3 --purge_epsilon 0.001

`shuffle`#

Read all spectra and return them in a shuffled order.

Argument	Type	Default	Description
`spectra`	path	required	Input MGF file path
`--outfile`	path	`None`	Output MGF file path
`--random_seed`	int	`42`	Random seed for reproducibility

Example:

casanovoutils mgfutils shuffle input.mgf --outfile shuffled.mgf

`downsample`#

Limit the number of spectra retained per peptide sequence.

Argument	Type	Default	Description
`spectra`	path	required	Input MGF file path
`--k`	int	`1`	Maximum spectra per peptide
`--outfile`	path	`None`	Output MGF file path
`--random_seed`	int	`42`	Random seed for reproducibility

Example:

casanovoutils mgfutils downsample input.mgf --outfile sampled.mgf --k 5

`spectra-per-peptide`#

Reservoir-sample up to k spectra per peptide in a single streaming pass.

Argument	Type	Default	Description
`spectra`	path	required	Input MGF file path
`--k`	int	`1`	Maximum spectra per peptide
`--outfile`	path	`None`	Output MGF file path
`--random_seed`	int	`42`	Random seed for reproducibility

Example:

casanovoutils mgfutils spectra-per-peptide input.mgf --outfile sampled.mgf --k 3

`downsample-spectra`#

Downsample an MGF file to a target number or proportion of spectra using an adaptive two-pass streaming approach that guarantees exactly k spectra.

Argument	Type	Default	Description
`input_file`	path	required	Input MGF file
`output_file`	path	required	Output MGF file (must differ from input)
`--downsample_type`	str	`"number"`	`"number"` (exact count) or `"proportion"`
`--downsample_rate`	float	`100`	Target count (integer) or proportion in `(0, 1]`
`--random_seed`	int	`42`	Random seed for reproducibility

Examples:

# Keep exactly 1000 spectra
casanovoutils mgfutils downsample-spectra input.mgf out.mgf \
  --downsample_type number --downsample_rate 1000

# Keep 20 % of spectra
casanovoutils mgfutils downsample-spectra input.mgf out.mgf \
  --downsample_type proportion --downsample_rate 0.2

`purge-redundant`#

Sort peaks by m/z and remove any peak whose m/z differs from the previous peak by less than epsilon.

Argument	Type	Default	Description
`spectra`	path	required	Input MGF file path
`--epsilon`	float	`~1.19e-7`	Minimum m/z separation in Da to keep a peak
`--outfile`	path	`None`	Output MGF file path

Example:

casanovoutils mgfutils purge-redundant input.mgf --outfile purged.mgf --epsilon 0.005

`casanovoutils mzmlutils`#

Sample a proportion of spectra from an mzML file and write the result as MGF.

Reads the file in chunks of buffer_size spectra and draws round(k × chunk_size) spectra from each chunk at random, without replacement, in a single streaming pass. Precursor m/z, charge state, and retention time are carried through to the output MGF when present in the source file.

Argument	Type	Default	Description
`input_file`	path	required	Input mzML file
`k`	float	required	Proportion of spectra to sample; must be in (0, 1)
`outfile`	path	required	Output MGF file path (must have `.mgf` extension)
`--buffer_size`	int	`1000`	Spectra read per I/O chunk
`--random_seed`	int	`42`	Random seed for reproducibility

Note on count accuracy: the final sample count equals sum(round(k × b) for b in buffers), which can differ slightly from round(k × total) due to per-buffer rounding. Use a buffer_size large relative to 1 / k to minimise this effect.

mzML output: not supported directly. If you need mzML output, convert the MGF result with msConvert.

Examples:

# Sample 10 % of spectra
casanovoutils mzmlutils input.mzML 0.1 sampled.mgf

# Sample 25 % with a 5 000-spectrum buffer
casanovoutils mzmlutils input.mzML 0.25 sampled.mgf --buffer_size 5000

# Reproducible run with a fixed seed
casanovoutils mzmlutils input.mzML 0.5 sampled.mgf --random_seed 123

`casanovoutils denovoutils`#

Load and join PSM data from MGF and mzTab files into Polars DataFrames.

`get_mgf_psms`#

Load spectrum metadata from an MGF file.

Argument	Type	Default	Description
`mgf_path`	path	required	Input MGF file
`--out_path`	path	`None`	Output file path (`.parquet`, `.csv`, or `.tsv`)
`--meta_data_only`	bool	`True`	Exclude m/z and intensity arrays from output

Example:

casanovoutils denovoutils get_mgf_psms input.mgf --out_path psms.parquet

`get_mztab`#

Load the spectrum match table from an mzTab file.

Argument	Type	Default	Description
`mztab_path`	path	required	Input mzTab file
`--out_path`	path	`None`	Output file path (`.parquet`, `.csv`, or `.tsv`)

Example:

casanovoutils denovoutils get_mztab results.mztab --out_path matches.parquet

`get_groundtruth`#

Join MGF PSM metadata with mzTab predictions into a single DataFrame.

Argument	Type	Default	Description
`mgf_path`	path	required	Input MGF file
`mztab_path`	path	required	Input mzTab file
`--out_path`	path	`None`	Output file path (`.parquet`, `.csv`, or `.tsv`)

Example:

casanovoutils denovoutils get_groundtruth input.mgf results.mztab \
  --out_path groundtruth.parquet

`casanovoutils preccov`#

Compute and plot precision-coverage curves from PSM predictions.

`get_pc_df`#

Build a precision-coverage DataFrame from predicted and ground-truth PSMs. Accepts a pre-built ground-truth DataFrame or the raw MGF and mzTab paths.

Argument	Type	Default	Description
`--ground_truth_df`	path	`None`	Pre-built ground-truth DataFrame
`--mgf_df`	path	`None`	MGF PSM DataFrame (required if `ground_truth_df` is omitted)
`--mztab_df`	path	`None`	mzTab DataFrame (required if `ground_truth_df` is omitted)
`--residues_path`	path	`None`	Custom residue mass YAML; uses bundled file if omitted
`--replace_isoleucine_with_leucine`	bool	`True`	Treat I and L as equivalent
`--aa_level`	bool	`False`	Compute per-amino-acid rather than per-peptide metrics
`--out_path`	path	`None`	Output file path for the resulting DataFrame

Example:

casanovoutils preccov get_pc_df \
  --mgf_df psms.parquet --mztab_df matches.parquet \
  --out_path pc.parquet

`graph_prec_cov`#

Plot precision-coverage curves from one or more pre-computed DataFrames. Each file is plotted as a separate series labelled by its file stem.

Argument	Type	Default	Description
`*pc_df_paths`	path(s)	required	One or more precision-coverage DataFrames
`--out_path`	path	`None`	Save the figure to this path (e.g. `.png`, `.pdf`)

Example:

casanovoutils preccov graph_prec_cov run1.parquet run2.parquet \
  --out_path comparison.png

`casanovoutils summarize_mgf`#

Generate per-file statistics and visualisations for MGF files.

`summarize`#

Produce a self-contained HTML report for an MGF file covering charge distribution, peak counts, peptide lengths, and fragment ion coverage.

Argument	Type	Default	Description
`mgf_file`	path	required	Input MGF file
`--output_root`	path	`"mgf_summary"`	Output directory; HTML file shares this basename
`--tolerance`	float	`0.05`	Fragment mass tolerance
`--tolerance_unit`	str	`"Da"`	Tolerance unit: `"ppm"` or `"Da"`
`--workers`	int	`1`	Parallel worker processes for coverage annotation
`--max_charge`	str	`"1less"`	Max fragment charge: `"max"` or `"1less"`
`--neutral_losses`	bool	`True`	Include neutral losses in annotation

Example:

casanovoutils summarize_mgf summarize input.mgf --output_root my_report \
  --tolerance 10 --tolerance_unit ppm --workers 4

`charge-distribution`#

Count and plot the charge state distribution across all spectra.

Argument	Type	Default	Description
`mgf_file`	path	required	Input MGF file
`--output_tsv`	path	`"charge_distribution.tsv"`	Output counts TSV
`--output_plot`	path	`"charge_distribution.png"`	Output bar chart

Example:

casanovoutils summarize_mgf charge-distribution input.mgf \
  --output_tsv charges.tsv --output_plot charges.png

`fragment-coverage`#

Annotate spectra with b/y ions and report the fraction of total intensity covered by matched fragments.

Argument	Type	Default	Description
`mgf_file`	path	required	Annotated MGF file (requires `SEQ=` in ProForma notation)
`--tolerance`	float	`0.05`	Mass tolerance
`--tolerance_unit`	str	`"Da"`	Tolerance unit: `"ppm"` or `"Da"`
`--output_tsv`	path	`"fragment_coverage.tsv"`	Summary TSV
`--output_full_tsv`	path	`"fragment_coverage.full.tsv"`	Per-spectrum TSV
`--output_plot`	path	`"fragment_coverage.png"`	Coverage histogram
`--workers`	int	`1`	Parallel worker processes
`--max_charge`	str	`"1less"`	Max fragment charge: `"max"` or `"1less"`
`--neutral_losses`	bool	`True`	Include neutral losses

Example:

casanovoutils summarize_mgf fragment-coverage input.mgf \
  --tolerance 10 --tolerance_unit ppm --workers 4

`peak-counts`#

Histogram of the number of peaks per spectrum.

Argument	Type	Default	Description
`mgf_file`	path	required	Input MGF file
`--output_tsv`	path	`"peak_counts.tsv"`	Output counts TSV
`--output_plot`	path	`"peak_counts.png"`	Output histogram

Example:

casanovoutils summarize_mgf peak-counts input.mgf

`peptide-lengths`#

Histogram of peptide sequence lengths for annotated spectra (requires SEQ=).

Argument	Type	Default	Description
`mgf_file`	path	required	Input MGF file
`--output_tsv`	path	`"peptide_lengths.tsv"`	Output counts TSV
`--output_plot`	path	`"peptide_lengths.png"`	Output histogram

Example:

casanovoutils summarize_mgf peptide-lengths input.mgf

`casanovoutils datasets`#

Create peptide-level train/validation/test splits from annotated MGF files. Peptides are split 80 / 10 / 10 by unique sequence to prevent leakage between splits. Outputs three MGF files: <output_root>.train.mgf, .val.mgf, and .test.mgf.

Argument	Type	Default	Description
`*mgf_files`	path(s)	required	One or more annotated MGF files
`--output_root`	str	required	Base path for output files
`--spectra_per_peptide`	int	`None`	Cap spectra per peptide from new input files
`--random_seed`	int	`42`	Random seed for reproducibility
`--overwrite`	bool	`False`	Overwrite existing output files
`--existing_splits`	paths	`None`	Tuple of existing (train, val, test) MGF paths to extend
`--combine_with_existing`	bool	`False`	Include existing spectra in output alongside new ones

Examples:

# Basic split
casanovoutils datasets input.mgf --output_root splits/run1

# Multiple input files, cap at 3 spectra per peptide
casanovoutils datasets a.mgf b.mgf --output_root splits/combined \
  --spectra_per_peptide 3

`casanovoutils graphloss`#

Read Casanovo log files and/or metrics.csv files and plot training and validation loss curves.

Argument	Type	Default	Description
`root`	str	required	Output file root; plot saved as `<root>.png`
`inputs`	path(s)	required	One or more Casanovo log or `metrics.csv` files
`--max_y`	float	`None`	Optional y-axis maximum

Example:

casanovoutils graphloss run1_plot run1.log run2_metrics.csv --max_y 2.0

`casanovoutils residues`#

Copy the bundled residue mass YAML file to a specified path. The file can then be edited to add custom modifications or non-standard residues and passed back to other tools via --residues_path.

Argument	Type	Default	Description
`destination_path`	path	required	Destination path for the YAML file

Example:

casanovoutils residues my_residues.yaml

CLI Reference

Contents

CLI Reference#

casanovoutils mgfutils#

pipeline#

shuffle#

downsample#

spectra-per-peptide#

downsample-spectra#

purge-redundant#

casanovoutils mzmlutils#

casanovoutils denovoutils#

get_mgf_psms#

get_mztab#

get_groundtruth#

casanovoutils preccov#

get_pc_df#

graph_prec_cov#

casanovoutils summarize_mgf#

summarize#

charge-distribution#

fragment-coverage#

peak-counts#

peptide-lengths#

casanovoutils datasets#

casanovoutils graphloss#

casanovoutils residues#

`casanovoutils mgfutils`#

`pipeline`#

`shuffle`#

`downsample`#

`spectra-per-peptide`#

`downsample-spectra`#

`purge-redundant`#

`casanovoutils mzmlutils`#

`casanovoutils denovoutils`#

`get_mgf_psms`#

`get_mztab`#

`get_groundtruth`#

`casanovoutils preccov`#

`get_pc_df`#

`graph_prec_cov`#

`casanovoutils summarize_mgf`#

`summarize`#

`charge-distribution`#

`fragment-coverage`#

`peak-counts`#

`peptide-lengths`#

`casanovoutils datasets`#

`casanovoutils graphloss`#

`casanovoutils residues`#