casanovoutils.mzmlutils
=======================

.. py:module:: casanovoutils.mzmlutils

.. autoapi-nested-parse::

   Utilities for reading and sampling spectra from mzML files.

   Provides a single ``sample_spectra`` command that makes one streaming pass
   through an mzML file, sampling a proportion ``k`` of spectra from each
   read buffer.  Output is written to MGF format.  A ``main`` entry point
   exposes commands as CLI subcommands via ``fire``.



Attributes
----------

.. autoapisummary::

   casanovoutils.mzmlutils.COMMANDS


Functions
---------

.. autoapisummary::

   casanovoutils.mzmlutils.sample_mzml
   casanovoutils.mzmlutils.sample_spectra
   casanovoutils.mzmlutils.main


Module Contents
---------------

.. py:function:: sample_mzml(input_file: os.PathLike, k: float, buffer_size: int = 1000, random_seed: int = 42) -> list[casanovoutils.types.PyteomicsSpectrum]

   Sample a proportion of spectra from an mzML file in a single streaming pass.

   Reads the file in chunks of ``buffer_size`` spectra.  From each chunk,
   ``round(k * chunk_size)`` spectra are drawn without replacement using
   ``random.sample``.  No second pass or total-count is required.

   Note: the final sample count equals ``sum(round(k * b) for b in buffers)``
   which may differ slightly from ``round(k * total)`` due to per-buffer
   rounding.  Use a ``buffer_size`` that is large relative to ``1 / k`` to
   minimise this effect.

   :param input_file: Path to the input mzML file.
   :type input_file: PathLike
   :param k: Proportion of spectra to sample; must be in (0, 1).
   :type k: float
   :param buffer_size: Number of spectra read per I/O chunk.
   :type buffer_size: int, default=1000
   :param random_seed: Seed for reproducible sampling.
   :type random_seed: int, default=42

   :returns: Sampled spectra in file order within each buffer.
   :rtype: list[PyteomicsSpectrum]


.. py:function:: sample_spectra(input_file: os.PathLike, k: float, outfile: os.PathLike, buffer_size: int = 1000, random_seed: int = 42) -> None

   Sample spectra from an mzML file and write them to an MGF file.

   Note: writing to mzML is a pain so I didn't implement it here. If you need
   an mzML output I would recommend running this and then using msconvert
   to convert to mzML.

   :param input_file: Path to the input mzML file.
   :type input_file: PathLike
   :param k: Proportion of spectra to sample; must be in (0, 1).
   :type k: float
   :param outfile: Output path; must have a ``.mgf`` extension.
   :type outfile: PathLike
   :param buffer_size: Number of spectra read per I/O chunk.
   :type buffer_size: int, default=1000
   :param random_seed: Seed for reproducible sampling.
   :type random_seed: int, default=42


.. py:data:: COMMANDS
   :type:  casanovoutils.types.Commands

.. py:function:: main() -> None

