casanovoutils.datasets
======================

.. py:module:: casanovoutils.datasets

.. autoapi-nested-parse::

   Create train/validation/test dataset splits from annotated MGF files.



Attributes
----------

.. autoapisummary::

   casanovoutils.datasets.COMMANDS


Functions
---------

.. autoapisummary::

   casanovoutils.datasets.create_datasets
   casanovoutils.datasets.main


Module Contents
---------------

.. py:function:: create_datasets(*mgf_files: os.PathLike, output_root: str, spectra_per_peptide: Optional[int] = None, random_seed: int = 42, overwrite: bool = False, existing_splits: Optional[tuple[os.PathLike, os.PathLike, os.PathLike]] = None, combine_with_existing: bool = False) -> None

   Create peptide-level train/validation/test splits from annotated MGF files.

   All spectra from the input MGF files are combined and grouped by peptide
   sequence. The unique peptides are randomly split into training (80%),
   validation (10%), and test (10%) sets. Spectra are then assigned to splits
   based on their associated peptide, ensuring no peptide-level leakage
   between splits.

   :param \*mgf_files: One or more paths to annotated MGF files. Each spectrum must contain
                       the peptide sequence in ``spectrum["params"]["seq"]``.
   :type \*mgf_files: PathLike
   :param output_root: Root path for the output files. Three MGF files will be created:
                       ``<output_root>.train.mgf``, ``<output_root>.val.mgf``, and
                       ``<output_root>.test.mgf``. A log file ``<output_root>.log`` will
                       also be created.
   :type output_root: str
   :param spectra_per_peptide: If provided, randomly select at most this many spectra for each
                               peptide from the new input files. When ``combine_with_existing``
                               is True, existing spectra are not subject to this cap. By default
                               all spectra are retained.
   :type spectra_per_peptide: int, optional
   :param random_seed: Random seed for reproducible splitting and sampling.
   :type random_seed: int, default=42
   :param overwrite: If False, raise an error when any output file already exists.
                     If True, overwrite existing output files.
   :type overwrite: bool, default=False
   :param existing_splits: A tuple of three MGF file paths (train, validation, test) containing
                           pre-existing splits. Peptides from new input files that already appear
                           in an existing split are routed to that same split.
   :type existing_splits: tuple of PathLike, optional
   :param combine_with_existing: If True, output MGF files include both existing and new spectra.
                                 If False, only new spectra are written.
   :type combine_with_existing: bool, default=False


.. py:data:: COMMANDS
   :type:  casanovoutils.types.Commands

.. py:function:: main() -> None

   CLI entry point for create-datasets.


