casanovoutils.mgfutils
======================

.. py:module:: casanovoutils.mgfutils

.. autoapi-nested-parse::

   Utilities for reading, writing, and processing MGF spectrum files.

   Provides functions to iterate over spectra from MGF files or in-memory
   dicts, downsample by peptide, shuffle, and purge near-duplicate peaks.
   A ``pipeline`` function chains these stages, and a ``main`` entry point
   exposes them all as CLI subcommands via ``fire``.


Attributes
----------

.. autoapisummary::

   casanovoutils.mgfutils.SpectraInput
   casanovoutils.mgfutils.COMMANDS


Functions
---------

.. autoapisummary::

   casanovoutils.mgfutils.iter_spectra
   casanovoutils.mgfutils.get_pep_dict_mgf
   casanovoutils.mgfutils.write_spectra
   casanovoutils.mgfutils.downsample
   casanovoutils.mgfutils.remove_redundant_peaks
   casanovoutils.mgfutils.purge_redundant
   casanovoutils.mgfutils.shuffle
   casanovoutils.mgfutils.pipeline
   casanovoutils.mgfutils.spectra_per_peptide
   casanovoutils.mgfutils.downsample_spectra
   casanovoutils.mgfutils.main


Module Contents
---------------

.. py:data:: SpectraInput

.. py:function:: iter_spectra(spectra: SpectraInput, desc: Optional[str] = None, miniters: int = 1) -> Iterable[casanovoutils.types.PyteomicsSpectrum]

   Normalize various spectrum input types to an iterable of PyteomicsSpectrum.

   :param spectra: One of:
                   - A single path to an MGF file.
                   - An iterable of paths to MGF files.
                   - An iterable of Pyteomics spectrum dictionaries.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]
   :param desc: Description for the tqdm progress bar. If ``None``, no progress bar
                is shown.
   :type desc: str, optional
   :param miniters: Minimum number of iterations between progress bar updates.
   :type miniters: int, default=1

   :Yields: *PyteomicsSpectrum* -- Spectrum dictionaries, one at a time.


.. py:function:: get_pep_dict_mgf(spectra: SpectraInput) -> dict[str, list[casanovoutils.types.PyteomicsSpectrum]]

   Read spectra and group them by peptide sequence.

   :param spectra: Spectrum source — see :func:`iter_spectra` for accepted types.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]

   :returns: A dictionary mapping peptide sequence strings (taken from
             ``spectrum["params"]["seq"]``) to a list of Pyteomics spectrum
             dictionaries corresponding to that sequence.
   :rtype: dict[str, list[PyteomicsSpectrum]]


.. py:function:: write_spectra(spectra: Iterable[casanovoutils.types.PyteomicsSpectrum], outfile: Optional[os.PathLike]) -> None

   Write spectra to an MGF file, if an output path is provided.

   :param spectra: Spectra to write.
   :type spectra: Iterable[PyteomicsSpectrum]
   :param outfile: Destination MGF file path. If ``None``, this function is a no-op.
   :type outfile: PathLike, optional


.. py:function:: downsample(spectra: SpectraInput, k: int = 1, outfile: Optional[os.PathLike] = None, random_seed: int = 42) -> list[casanovoutils.types.PyteomicsSpectrum]

   Downsample spectra by limiting the number of PSMs per peptide sequence.

   Spectra are grouped by peptide sequence, then up to ``k`` spectra are
   randomly sampled for each unique peptide. If ``outfile`` is provided, the
   result is also written to disk in MGF format.

   :param spectra: Spectrum source — see :func:`iter_spectra` for accepted types.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]
   :param k: Maximum number of spectra (PSMs) to retain per unique peptide sequence.
   :type k: int, default=1
   :param outfile: If provided, write the downsampled spectra to this MGF file path.
   :type outfile: PathLike, optional
   :param random_seed: Random seed for reproducible sampling.
   :type random_seed: int, default=42

   :returns: Downsampled spectra; each peptide sequence appears at most ``k`` times.
   :rtype: list[PyteomicsSpectrum]


.. py:function:: remove_redundant_peaks(spectrum: casanovoutils.types.PyteomicsSpectrum, eps: float) -> casanovoutils.types.PyteomicsSpectrum

   Remove redundant peaks that are too close together along the m/z axis.

   Peaks are sorted by m/z. Any peak within ``eps`` of the preceding peak
   is discarded, keeping the first peak in each run of close peaks.

   :param spectrum: A spectrum dict as returned by ``pyteomics.mgf.read``, containing
                    ``"m/z array"`` and ``"intensity array"`` keys.
   :type spectrum: PyteomicsSpectrum
   :param eps: Maximum m/z distance between two peaks to be considered redundant.
               Defaults to the 32-bit float machine epsilon
               (``numpy.finfo(numpy.float32).eps`` ≈ 1.19e-7).
   :type eps: float, optional

   :returns: A new spectrum dict with ``"m/z array"`` and ``"intensity array"``
             replaced by the deduplicated arrays. All other keys are unchanged.
   :rtype: PyteomicsSpectrum


.. py:function:: purge_redundant(spectra: SpectraInput, epsilon: float = np.finfo(np.float32).eps, outfile: Optional[os.PathLike] = None) -> list[casanovoutils.types.PyteomicsSpectrum]

   Remove peaks with near-duplicate m/z values from each spectrum.

   For each spectrum, peaks are sorted by m/z and any peak whose m/z differs
   from the previous peak by less than ``epsilon`` is discarded. If
   ``outfile`` is provided, the result is also written to disk in MGF format.

   :param spectra: Spectrum source — see :func:`iter_spectra` for accepted types.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]
   :param epsilon: Minimum m/z separation (in daltons) required to keep a peak.
   :type epsilon: float
   :param outfile: If provided, write the purged spectra to this MGF file path.
   :type outfile: PathLike, optional

   :returns: Spectra with redundant peaks removed and peaks sorted by m/z.
   :rtype: list[PyteomicsSpectrum]


.. py:function:: shuffle(spectra: SpectraInput, outfile: Optional[os.PathLike] = None, random_seed: int = 42) -> list[casanovoutils.types.PyteomicsSpectrum]

   Read all spectra and return them in a shuffled order.

   :param spectra: Spectrum source — see :func:`iter_spectra` for accepted types.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]
   :param outfile: If provided, write the shuffled spectra to this MGF file path.
   :type outfile: PathLike, optional
   :param random_seed: Random seed for reproducible shuffling.
   :type random_seed: int, default=42

   :returns: All spectra in shuffled order.
   :rtype: list[PyteomicsSpectrum]


.. py:function:: pipeline(spectra: SpectraInput, outfile: Optional[os.PathLike] = None, do_shuffle: bool = True, downsample_k: Optional[int] = None, purge_epsilon: Optional[float] = None, random_seed: int = 42) -> list[casanovoutils.types.PyteomicsSpectrum]

   Run spectra through an optional chain of processing stages.

   Stages are applied in order: shuffle → downsample → purge redundant peaks.
   Each stage is skipped when its enabling parameter is ``None`` (or
   ``False`` for ``do_shuffle``).

   :param spectra: Spectrum source — see :func:`iter_spectra` for accepted types.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]
   :param outfile: If provided, write the final spectra to this MGF file path.
   :type outfile: PathLike, optional
   :param do_shuffle: Whether to shuffle the spectra.
   :type do_shuffle: bool, default=True
   :param downsample_k: If provided, downsample to at most this many PSMs per peptide sequence.
   :type downsample_k: int, optional
   :param purge_epsilon: If provided, remove peaks whose m/z differs from the previous peak by
                         less than this value (in daltons).
   :type purge_epsilon: float, optional
   :param random_seed: Random seed passed to shuffle and downsample.
   :type random_seed: int, default=42

   :returns: Processed spectra.
   :rtype: list[PyteomicsSpectrum]


.. py:function:: spectra_per_peptide(spectra: SpectraInput, outfile: Optional[os.PathLike] = None, k: int = 1, precursor: bool = False, ignore_mods: bool = False, random_seed: int = 42) -> list[casanovoutils.types.PyteomicsSpectrum]

   Sample up to k spectra per peptide using reservoir sampling.

   Makes a single streaming pass through *spectra*, maintaining a reservoir
   of size k per unique group.  For the j-th occurrence of a group: if
   j <= k, add unconditionally; if j > k, replace a uniformly random
   reservoir slot with probability k/j.  Memory usage is
   O(unique groups x k) rather than O(total spectra).

   :param spectra: Spectrum source — see :func:`iter_spectra` for accepted types.
   :type spectra: PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]
   :param outfile: If provided, write the sampled spectra to this MGF file path.
   :type outfile: PathLike, optional
   :param k: Maximum number of spectra to retain per group.
   :type k: int, default=1
   :param precursor: If True, group by peptide sequence *and* charge state, so that the
                     same peptide observed in different charge states is treated as
                     separate groups.
   :type precursor: bool, default=False
   :param ignore_mods: If True, strip ProForma bracketed modification annotations (e.g.
                       ``[Acetyl]``, ``[Carbamidomethyl]``) from the sequence before
                       grouping, so modified and unmodified forms of the same peptide are
                       counted together.
   :type ignore_mods: bool, default=False
   :param random_seed: Seed for the local random number generator.
   :type random_seed: int, default=42

   :returns: Sampled spectra, grouped by key in first-seen order.
   :rtype: list[PyteomicsSpectrum]


.. py:function:: downsample_spectra(input_file: os.PathLike, output_file: os.PathLike, downsample_type: str = 'number', downsample_rate: float = 100, random_seed: int = 42) -> None

   Downsample an MGF file to a target number or proportion of spectra.

   Makes two streaming passes: the first counts total spectra, the second
   streams with an adaptive acceptance probability (needed/remaining) that
   guarantees exactly k spectra are written.

   :param input_file: Path to the input MGF file.
   :type input_file: PathLike
   :param output_file: Path for the downsampled output MGF file.  Must differ from
                       *input_file*.
   :type output_file: PathLike
   :param downsample_type: One of ``"number"`` (retain exactly *downsample_rate* spectra) or
                           ``"proportion"`` (retain exactly ``round(total × downsample_rate)``).
   :type downsample_type: str, default ``"number"``
   :param downsample_rate: Target rate.  Positive integer for ``"number"``; in ``(0, 1]`` for
                           ``"proportion"``.
   :type downsample_rate: float, default 100
   :param random_seed: Seed for the random number generator.
   :type random_seed: int, default 42


.. py:data:: COMMANDS
   :type:  casanovoutils.types.Commands

.. py:function:: main() -> None