casanovoutils.datasets#

Create train/validation/test dataset splits from annotated MGF files.

Attributes#

Functions#

create_datasets(→ None)

Create peptide-level train/validation/test splits from annotated MGF files.

main(→ None)

CLI entry point for create-datasets.

Module Contents#

casanovoutils.datasets.create_datasets(*mgf_files: os.PathLike, output_root: str, spectra_per_peptide: int | None = None, random_seed: int = 42, overwrite: bool = False, existing_splits: tuple[os.PathLike, os.PathLike, os.PathLike] | None = None, combine_with_existing: bool = False) None#

Create peptide-level train/validation/test splits from annotated MGF files.

All spectra from the input MGF files are combined and grouped by peptide sequence. The unique peptides are randomly split into training (80%), validation (10%), and test (10%) sets. Spectra are then assigned to splits based on their associated peptide, ensuring no peptide-level leakage between splits.

Parameters:
  • *mgf_files (PathLike) – One or more paths to annotated MGF files. Each spectrum must contain the peptide sequence in spectrum["params"]["seq"].

  • output_root (str) – Root path for the output files. Three MGF files will be created: <output_root>.train.mgf, <output_root>.val.mgf, and <output_root>.test.mgf. A log file <output_root>.log will also be created.

  • spectra_per_peptide (int, optional) – If provided, randomly select at most this many spectra for each peptide from the new input files. When combine_with_existing is True, existing spectra are not subject to this cap. By default all spectra are retained.

  • random_seed (int, default=42) – Random seed for reproducible splitting and sampling.

  • overwrite (bool, default=False) – If False, raise an error when any output file already exists. If True, overwrite existing output files.

  • existing_splits (tuple of PathLike, optional) – A tuple of three MGF file paths (train, validation, test) containing pre-existing splits. Peptides from new input files that already appear in an existing split are routed to that same split.

  • combine_with_existing (bool, default=False) – If True, output MGF files include both existing and new spectra. If False, only new spectra are written.

casanovoutils.datasets.COMMANDS: casanovoutils.types.Commands#
casanovoutils.datasets.main() None#

CLI entry point for create-datasets.