casanovoutils.datasets#
Create train/validation/test dataset splits from annotated MGF files.
Attributes#
Functions#
|
Create peptide-level train/validation/test splits from annotated MGF files. |
|
CLI entry point for create-datasets. |
Module Contents#
- casanovoutils.datasets.create_datasets(*mgf_files: os.PathLike, output_root: str, spectra_per_peptide: int | None = None, random_seed: int = 42, overwrite: bool = False, existing_splits: tuple[os.PathLike, os.PathLike, os.PathLike] | None = None, combine_with_existing: bool = False) None#
Create peptide-level train/validation/test splits from annotated MGF files.
All spectra from the input MGF files are combined and grouped by peptide sequence. The unique peptides are randomly split into training (80%), validation (10%), and test (10%) sets. Spectra are then assigned to splits based on their associated peptide, ensuring no peptide-level leakage between splits.
- Parameters:
*mgf_files (PathLike) – One or more paths to annotated MGF files. Each spectrum must contain the peptide sequence in
spectrum["params"]["seq"].output_root (str) – Root path for the output files. Three MGF files will be created:
<output_root>.train.mgf,<output_root>.val.mgf, and<output_root>.test.mgf. A log file<output_root>.logwill also be created.spectra_per_peptide (int, optional) – If provided, randomly select at most this many spectra for each peptide from the new input files. When
combine_with_existingis True, existing spectra are not subject to this cap. By default all spectra are retained.random_seed (int, default=42) – Random seed for reproducible splitting and sampling.
overwrite (bool, default=False) – If False, raise an error when any output file already exists. If True, overwrite existing output files.
existing_splits (tuple of PathLike, optional) – A tuple of three MGF file paths (train, validation, test) containing pre-existing splits. Peptides from new input files that already appear in an existing split are routed to that same split.
combine_with_existing (bool, default=False) – If True, output MGF files include both existing and new spectra. If False, only new spectra are written.
- casanovoutils.datasets.COMMANDS: casanovoutils.types.Commands#
- casanovoutils.datasets.main() None#
CLI entry point for create-datasets.