mgfutils#

Read, write, and process MGF spectrum files. Provides a progression of composable functions — from low-level iteration up to a full shuffle → downsample → purge pipeline — and exposes them all as CLI subcommands via casanovoutils mgf.

Utilities for reading, writing, and processing MGF spectrum files.

Provides functions to iterate over spectra from MGF files or in-memory dicts, downsample by peptide, shuffle, and purge near-duplicate peaks. A pipeline function chains these stages, and a main entry point exposes them all as CLI subcommands via fire.

casanovoutils.mgfutils.downsample(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], k: int = 1, outfile: PathLike | None = None, random_seed: int = 42) list[list[dict[str, Any]]]#

Downsample spectra by limiting the number of PSMs per peptide sequence.

Spectra are grouped by peptide sequence, then up to k spectra are randomly sampled for each unique peptide. If outfile is provided, the result is also written to disk in MGF format.

Parameters:
  • spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see iter_spectra() for accepted types.

  • k (int, default=1) – Maximum number of spectra (PSMs) to retain per unique peptide sequence.

  • outfile (PathLike, optional) – If provided, write the downsampled spectra to this MGF file path.

  • random_seed (int, default=42) – Random seed for reproducible sampling.

Returns:

Downsampled spectra; each peptide sequence appears at most k times.

Return type:

list[PyteomicsSpectrum]

casanovoutils.mgfutils.downsample_spectra(input_file: PathLike, output_file: PathLike, downsample_type: str = 'number', downsample_rate: float = 100, random_seed: int = 42) None#

Downsample an MGF file to a target number or proportion of spectra.

Makes two streaming passes: the first counts total spectra, the second streams with an adaptive acceptance probability (needed/remaining) that guarantees exactly k spectra are written.

Parameters:
  • input_file (PathLike) – Path to the input MGF file.

  • output_file (PathLike) – Path for the downsampled output MGF file. Must differ from input_file.

  • downsample_type (str, default "number") – One of "number" (retain exactly downsample_rate spectra) or "proportion" (retain exactly round(total × downsample_rate)).

  • downsample_rate (float, default 100) – Target rate. Positive integer for "number"; in (0, 1] for "proportion".

  • random_seed (int, default 42) – Seed for the random number generator.

casanovoutils.mgfutils.get_pep_dict_mgf(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]]) dict[str, list[list[dict[str, Any]]]]#

Read spectra and group them by peptide sequence.

Parameters:

spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see iter_spectra() for accepted types.

Returns:

A dictionary mapping peptide sequence strings (taken from spectrum["params"]["seq"]) to a list of Pyteomics spectrum dictionaries corresponding to that sequence.

Return type:

dict[str, list[PyteomicsSpectrum]]

casanovoutils.mgfutils.iter_spectra(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], desc: str | None = None, miniters: int = 1) Iterable[list[dict[str, Any]]]#

Normalize various spectrum input types to an iterable of PyteomicsSpectrum.

Parameters:
  • spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – One of: - A single path to an MGF file. - An iterable of paths to MGF files. - An iterable of Pyteomics spectrum dictionaries.

  • desc (str, optional) – Description for the tqdm progress bar. If None, no progress bar is shown.

  • miniters (int, default=1) – Minimum number of iterations between progress bar updates.

Yields:

PyteomicsSpectrum – Spectrum dictionaries, one at a time.

casanovoutils.mgfutils.main() None#
casanovoutils.mgfutils.pipeline(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], outfile: PathLike | None = None, do_shuffle: bool = True, downsample_k: int | None = None, purge_epsilon: float | None = None, random_seed: int = 42) list[list[dict[str, Any]]]#

Run spectra through an optional chain of processing stages.

Stages are applied in order: shuffle → downsample → purge redundant peaks. Each stage is skipped when its enabling parameter is None (or False for do_shuffle).

Parameters:
  • spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see iter_spectra() for accepted types.

  • outfile (PathLike, optional) – If provided, write the final spectra to this MGF file path.

  • do_shuffle (bool, default=True) – Whether to shuffle the spectra.

  • downsample_k (int, optional) – If provided, downsample to at most this many PSMs per peptide sequence.

  • purge_epsilon (float, optional) – If provided, remove peaks whose m/z differs from the previous peak by less than this value (in daltons).

  • random_seed (int, default=42) – Random seed passed to shuffle and downsample.

Returns:

Processed spectra.

Return type:

list[PyteomicsSpectrum]

casanovoutils.mgfutils.purge_redundant(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], epsilon: float = np.float32(1.1920929e-07), outfile: PathLike | None = None) list[list[dict[str, Any]]]#

Remove peaks with near-duplicate m/z values from each spectrum.

For each spectrum, peaks are sorted by m/z and any peak whose m/z differs from the previous peak by less than epsilon is discarded. If outfile is provided, the result is also written to disk in MGF format.

Parameters:
  • spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see iter_spectra() for accepted types.

  • epsilon (float) – Minimum m/z separation (in daltons) required to keep a peak.

  • outfile (PathLike, optional) – If provided, write the purged spectra to this MGF file path.

Returns:

Spectra with redundant peaks removed and peaks sorted by m/z.

Return type:

list[PyteomicsSpectrum]

casanovoutils.mgfutils.remove_redundant_peaks(spectrum: list[dict[str, Any]], eps: float) list[dict[str, Any]]#

Remove redundant peaks that are too close together along the m/z axis.

Peaks are sorted by m/z. Any peak within eps of the preceding peak is discarded, keeping the first peak in each run of close peaks.

Parameters:
  • spectrum (PyteomicsSpectrum) – A spectrum dict as returned by pyteomics.mgf.read, containing "m/z array" and "intensity array" keys.

  • eps (float, optional) – Maximum m/z distance between two peaks to be considered redundant. Defaults to the 32-bit float machine epsilon (numpy.finfo(numpy.float32).eps ≈ 1.19e-7).

Returns:

A new spectrum dict with "m/z array" and "intensity array" replaced by the deduplicated arrays. All other keys are unchanged.

Return type:

PyteomicsSpectrum

casanovoutils.mgfutils.shuffle(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], outfile: PathLike | None = None, random_seed: int = 42) list[list[dict[str, Any]]]#

Read all spectra and return them in a shuffled order.

Parameters:
  • spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see iter_spectra() for accepted types.

  • outfile (PathLike, optional) – If provided, write the shuffled spectra to this MGF file path.

  • random_seed (int, default=42) – Random seed for reproducible shuffling.

Returns:

All spectra in shuffled order.

Return type:

list[PyteomicsSpectrum]

casanovoutils.mgfutils.spectra_per_peptide(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], outfile: PathLike | None = None, k: int = 1, precursor: bool = False, ignore_mods: bool = False, random_seed: int = 42) list[list[dict[str, Any]]]#

Sample up to k spectra per peptide using reservoir sampling.

Makes a single streaming pass through spectra, maintaining a reservoir of size k per unique group. For the j-th occurrence of a group: if j <= k, add unconditionally; if j > k, replace a uniformly random reservoir slot with probability k/j. Memory usage is O(unique groups x k) rather than O(total spectra).

Parameters:
  • spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see iter_spectra() for accepted types.

  • outfile (PathLike, optional) – If provided, write the sampled spectra to this MGF file path.

  • k (int, default=1) – Maximum number of spectra to retain per group.

  • precursor (bool, default=False) – If True, group by peptide sequence and charge state, so that the same peptide observed in different charge states is treated as separate groups.

  • ignore_mods (bool, default=False) – If True, strip ProForma bracketed modification annotations (e.g. [Acetyl], [Carbamidomethyl]) from the sequence before grouping, so modified and unmodified forms of the same peptide are counted together.

  • random_seed (int, default=42) – Seed for the local random number generator.

Returns:

Sampled spectra, grouped by key in first-seen order.

Return type:

list[PyteomicsSpectrum]

casanovoutils.mgfutils.write_spectra(spectra: Iterable[list[dict[str, Any]]], outfile: PathLike | None) None#

Write spectra to an MGF file, if an output path is provided.

Parameters:
  • spectra (Iterable[PyteomicsSpectrum]) – Spectra to write.

  • outfile (PathLike, optional) – Destination MGF file path. If None, this function is a no-op.