CLI Reference#
casanovoutils installs a single casanovoutils command with nested
subcommands. All subcommands are built with
Python Fire, which means:
Boolean flags can be passed as
--flag(True) or--noflag(False).Positional arguments can also be passed as keyword arguments.
The top-level key for each group is the bare module name. The full structure is:
casanovoutils
├── mgfutils — MGF file processing
│ ├── pipeline
│ ├── shuffle
│ ├── downsample
│ ├── spectra-per-peptide
│ ├── downsample-spectra
│ └── purge-redundant
├── mzmlutils — mzML file sampling (writes MGF)
├── denovoutils — Load PSM data into DataFrames
│ ├── get_mgf_psms
│ ├── get_mztab
│ └── get_groundtruth
├── preccov — Precision-coverage evaluation
│ ├── get_pc_df
│ └── graph_prec_cov
├── summarize_mgf — MGF file statistics and HTML reports
│ ├── summarize
│ ├── charge-distribution
│ ├── fragment-coverage
│ ├── peak-counts
│ └── peptide-lengths
├── datasets — Create train/val/test splits from MGF files
├── graphloss — Plot Casanovo training/validation loss curves
└── residues — Residue mass table utilities
casanovoutils mgfutils#
Process MGF spectrum files.
pipeline#
Run spectra through an optional chain of processing stages in order: shuffle → downsample → purge redundant peaks. Each stage is skipped when its enabling parameter is omitted.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file path |
|
path |
|
Output MGF file path |
|
bool |
|
Shuffle spectra |
|
int |
|
Max spectra per peptide (skip if omitted) |
|
float |
|
Min m/z gap to keep a peak in Da (skip if omitted) |
|
int |
|
Random seed for shuffle and downsample |
Examples:
# Shuffle only
casanovoutils mgfutils pipeline input.mgf --outfile out.mgf --nodo_shuffle False
# Downsample to 2 spectra per peptide, no shuffle
casanovoutils mgfutils pipeline input.mgf --outfile out.mgf --nodo_shuffle --downsample_k 2
# Full pipeline
casanovoutils mgfutils pipeline input.mgf --outfile out.mgf \
--downsample_k 3 --purge_epsilon 0.001
shuffle#
Read all spectra and return them in a shuffled order.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file path |
|
path |
|
Output MGF file path |
|
int |
|
Random seed for reproducibility |
Example:
casanovoutils mgfutils shuffle input.mgf --outfile shuffled.mgf
downsample#
Limit the number of spectra retained per peptide sequence.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file path |
|
int |
|
Maximum spectra per peptide |
|
path |
|
Output MGF file path |
|
int |
|
Random seed for reproducibility |
Example:
casanovoutils mgfutils downsample input.mgf --outfile sampled.mgf --k 5
spectra-per-peptide#
Reservoir-sample up to k spectra per peptide in a single streaming pass.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file path |
|
int |
|
Maximum spectra per peptide |
|
path |
|
Output MGF file path |
|
int |
|
Random seed for reproducibility |
Example:
casanovoutils mgfutils spectra-per-peptide input.mgf --outfile sampled.mgf --k 3
downsample-spectra#
Downsample an MGF file to a target number or proportion of spectra using an
adaptive two-pass streaming approach that guarantees exactly k spectra.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
required |
Output MGF file (must differ from input) |
|
str |
|
|
|
float |
|
Target count (integer) or proportion in |
|
int |
|
Random seed for reproducibility |
Examples:
# Keep exactly 1000 spectra
casanovoutils mgfutils downsample-spectra input.mgf out.mgf \
--downsample_type number --downsample_rate 1000
# Keep 20 % of spectra
casanovoutils mgfutils downsample-spectra input.mgf out.mgf \
--downsample_type proportion --downsample_rate 0.2
purge-redundant#
Sort peaks by m/z and remove any peak whose m/z differs from the previous
peak by less than epsilon.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file path |
|
float |
|
Minimum m/z separation in Da to keep a peak |
|
path |
|
Output MGF file path |
Example:
casanovoutils mgfutils purge-redundant input.mgf --outfile purged.mgf --epsilon 0.005
casanovoutils mzmlutils#
Sample a proportion of spectra from an mzML file and write the result as MGF.
Reads the file in chunks of buffer_size spectra and draws round(k × chunk_size) spectra from each chunk at random, without replacement, in a
single streaming pass. Precursor m/z, charge state, and retention time are
carried through to the output MGF when present in the source file.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input mzML file |
|
float |
required |
Proportion of spectra to sample; must be in (0, 1) |
|
path |
required |
Output MGF file path (must have |
|
int |
|
Spectra read per I/O chunk |
|
int |
|
Random seed for reproducibility |
Note on count accuracy: the final sample count equals
sum(round(k × b) for b in buffers), which can differ slightly fromround(k × total)due to per-buffer rounding. Use abuffer_sizelarge relative to1 / kto minimise this effect.mzML output: not supported directly. If you need mzML output, convert the MGF result with msConvert.
Examples:
# Sample 10 % of spectra
casanovoutils mzmlutils input.mzML 0.1 sampled.mgf
# Sample 25 % with a 5 000-spectrum buffer
casanovoutils mzmlutils input.mzML 0.25 sampled.mgf --buffer_size 5000
# Reproducible run with a fixed seed
casanovoutils mzmlutils input.mzML 0.5 sampled.mgf --random_seed 123
casanovoutils denovoutils#
Load and join PSM data from MGF and mzTab files into Polars DataFrames.
get_mgf_psms#
Load spectrum metadata from an MGF file.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
|
Output file path ( |
|
bool |
|
Exclude m/z and intensity arrays from output |
Example:
casanovoutils denovoutils get_mgf_psms input.mgf --out_path psms.parquet
get_mztab#
Load the spectrum match table from an mzTab file.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input mzTab file |
|
path |
|
Output file path ( |
Example:
casanovoutils denovoutils get_mztab results.mztab --out_path matches.parquet
get_groundtruth#
Join MGF PSM metadata with mzTab predictions into a single DataFrame.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
required |
Input mzTab file |
|
path |
|
Output file path ( |
Example:
casanovoutils denovoutils get_groundtruth input.mgf results.mztab \
--out_path groundtruth.parquet
casanovoutils preccov#
Compute and plot precision-coverage curves from PSM predictions.
get_pc_df#
Build a precision-coverage DataFrame from predicted and ground-truth PSMs. Accepts a pre-built ground-truth DataFrame or the raw MGF and mzTab paths.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
|
Pre-built ground-truth DataFrame |
|
path |
|
MGF PSM DataFrame (required if |
|
path |
|
mzTab DataFrame (required if |
|
path |
|
Custom residue mass YAML; uses bundled file if omitted |
|
bool |
|
Treat I and L as equivalent |
|
bool |
|
Compute per-amino-acid rather than per-peptide metrics |
|
path |
|
Output file path for the resulting DataFrame |
Example:
casanovoutils preccov get_pc_df \
--mgf_df psms.parquet --mztab_df matches.parquet \
--out_path pc.parquet
graph_prec_cov#
Plot precision-coverage curves from one or more pre-computed DataFrames. Each file is plotted as a separate series labelled by its file stem.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path(s) |
required |
One or more precision-coverage DataFrames |
|
path |
|
Save the figure to this path (e.g. |
Example:
casanovoutils preccov graph_prec_cov run1.parquet run2.parquet \
--out_path comparison.png
casanovoutils summarize_mgf#
Generate per-file statistics and visualisations for MGF files.
summarize#
Produce a self-contained HTML report for an MGF file covering charge distribution, peak counts, peptide lengths, and fragment ion coverage.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
|
Output directory; HTML file shares this basename |
|
float |
|
Fragment mass tolerance |
|
str |
|
Tolerance unit: |
|
int |
|
Parallel worker processes for coverage annotation |
|
str |
|
Max fragment charge: |
|
bool |
|
Include neutral losses in annotation |
Example:
casanovoutils summarize_mgf summarize input.mgf --output_root my_report \
--tolerance 10 --tolerance_unit ppm --workers 4
charge-distribution#
Count and plot the charge state distribution across all spectra.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
|
Output counts TSV |
|
path |
|
Output bar chart |
Example:
casanovoutils summarize_mgf charge-distribution input.mgf \
--output_tsv charges.tsv --output_plot charges.png
fragment-coverage#
Annotate spectra with b/y ions and report the fraction of total intensity covered by matched fragments.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Annotated MGF file (requires |
|
float |
|
Mass tolerance |
|
str |
|
Tolerance unit: |
|
path |
|
Summary TSV |
|
path |
|
Per-spectrum TSV |
|
path |
|
Coverage histogram |
|
int |
|
Parallel worker processes |
|
str |
|
Max fragment charge: |
|
bool |
|
Include neutral losses |
Example:
casanovoutils summarize_mgf fragment-coverage input.mgf \
--tolerance 10 --tolerance_unit ppm --workers 4
peak-counts#
Histogram of the number of peaks per spectrum.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
|
Output counts TSV |
|
path |
|
Output histogram |
Example:
casanovoutils summarize_mgf peak-counts input.mgf
peptide-lengths#
Histogram of peptide sequence lengths for annotated spectra (requires SEQ=).
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Input MGF file |
|
path |
|
Output counts TSV |
|
path |
|
Output histogram |
Example:
casanovoutils summarize_mgf peptide-lengths input.mgf
casanovoutils datasets#
Create peptide-level train/validation/test splits from annotated MGF files.
Peptides are split 80 / 10 / 10 by unique sequence to prevent leakage between
splits. Outputs three MGF files: <output_root>.train.mgf, .val.mgf, and
.test.mgf.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path(s) |
required |
One or more annotated MGF files |
|
str |
required |
Base path for output files |
|
int |
|
Cap spectra per peptide from new input files |
|
int |
|
Random seed for reproducibility |
|
bool |
|
Overwrite existing output files |
|
paths |
|
Tuple of existing (train, val, test) MGF paths to extend |
|
bool |
|
Include existing spectra in output alongside new ones |
Examples:
# Basic split
casanovoutils datasets input.mgf --output_root splits/run1
# Multiple input files, cap at 3 spectra per peptide
casanovoutils datasets a.mgf b.mgf --output_root splits/combined \
--spectra_per_peptide 3
casanovoutils graphloss#
Read Casanovo log files and/or metrics.csv files and plot training and
validation loss curves.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
str |
required |
Output file root; plot saved as |
|
path(s) |
required |
One or more Casanovo log or |
|
float |
|
Optional y-axis maximum |
Example:
casanovoutils graphloss run1_plot run1.log run2_metrics.csv --max_y 2.0
casanovoutils residues#
Copy the bundled residue mass YAML file to a specified path. The file can
then be edited to add custom modifications or non-standard residues and passed
back to other tools via --residues_path.
Argument |
Type |
Default |
Description |
|---|---|---|---|
|
path |
required |
Destination path for the YAML file |
Example:
casanovoutils residues my_residues.yaml