mgfutils#
Read, write, and process MGF spectrum files. Provides a progression of
composable functions — from low-level iteration up to a full
shuffle → downsample → purge pipeline — and exposes them all as CLI
subcommands via casanovoutils mgf.
Utilities for reading, writing, and processing MGF spectrum files.
Provides functions to iterate over spectra from MGF files or in-memory
dicts, downsample by peptide, shuffle, and purge near-duplicate peaks.
A pipeline function chains these stages, and a main entry point
exposes them all as CLI subcommands via fire.
- casanovoutils.mgfutils.downsample(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], k: int = 1, outfile: PathLike | None = None, random_seed: int = 42) list[list[dict[str, Any]]]#
Downsample spectra by limiting the number of PSMs per peptide sequence.
Spectra are grouped by peptide sequence, then up to
kspectra are randomly sampled for each unique peptide. Ifoutfileis provided, the result is also written to disk in MGF format.- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see
iter_spectra()for accepted types.k (int, default=1) – Maximum number of spectra (PSMs) to retain per unique peptide sequence.
outfile (PathLike, optional) – If provided, write the downsampled spectra to this MGF file path.
random_seed (int, default=42) – Random seed for reproducible sampling.
- Returns:
Downsampled spectra; each peptide sequence appears at most
ktimes.- Return type:
list[PyteomicsSpectrum]
- casanovoutils.mgfutils.downsample_spectra(input_file: PathLike, output_file: PathLike, downsample_type: str = 'number', downsample_rate: float = 100, random_seed: int = 42) None#
Downsample an MGF file to a target number or proportion of spectra.
Makes two streaming passes: the first counts total spectra, the second streams with an adaptive acceptance probability (needed/remaining) that guarantees exactly k spectra are written.
- Parameters:
input_file (PathLike) – Path to the input MGF file.
output_file (PathLike) – Path for the downsampled output MGF file. Must differ from input_file.
downsample_type (str, default
"number") – One of"number"(retain exactly downsample_rate spectra) or"proportion"(retain exactlyround(total × downsample_rate)).downsample_rate (float, default 100) – Target rate. Positive integer for
"number"; in(0, 1]for"proportion".random_seed (int, default 42) – Seed for the random number generator.
- casanovoutils.mgfutils.get_pep_dict_mgf(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]]) dict[str, list[list[dict[str, Any]]]]#
Read spectra and group them by peptide sequence.
- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see
iter_spectra()for accepted types.- Returns:
A dictionary mapping peptide sequence strings (taken from
spectrum["params"]["seq"]) to a list of Pyteomics spectrum dictionaries corresponding to that sequence.- Return type:
dict[str, list[PyteomicsSpectrum]]
- casanovoutils.mgfutils.iter_spectra(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], desc: str | None = None, miniters: int = 1) Iterable[list[dict[str, Any]]]#
Normalize various spectrum input types to an iterable of PyteomicsSpectrum.
- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – One of: - A single path to an MGF file. - An iterable of paths to MGF files. - An iterable of Pyteomics spectrum dictionaries.
desc (str, optional) – Description for the tqdm progress bar. If
None, no progress bar is shown.miniters (int, default=1) – Minimum number of iterations between progress bar updates.
- Yields:
PyteomicsSpectrum – Spectrum dictionaries, one at a time.
- casanovoutils.mgfutils.main() None#
- casanovoutils.mgfutils.pipeline(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], outfile: PathLike | None = None, do_shuffle: bool = True, downsample_k: int | None = None, purge_epsilon: float | None = None, random_seed: int = 42) list[list[dict[str, Any]]]#
Run spectra through an optional chain of processing stages.
Stages are applied in order: shuffle → downsample → purge redundant peaks. Each stage is skipped when its enabling parameter is
None(orFalsefordo_shuffle).- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see
iter_spectra()for accepted types.outfile (PathLike, optional) – If provided, write the final spectra to this MGF file path.
do_shuffle (bool, default=True) – Whether to shuffle the spectra.
downsample_k (int, optional) – If provided, downsample to at most this many PSMs per peptide sequence.
purge_epsilon (float, optional) – If provided, remove peaks whose m/z differs from the previous peak by less than this value (in daltons).
random_seed (int, default=42) – Random seed passed to shuffle and downsample.
- Returns:
Processed spectra.
- Return type:
list[PyteomicsSpectrum]
- casanovoutils.mgfutils.purge_redundant(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], epsilon: float = np.float32(1.1920929e-07), outfile: PathLike | None = None) list[list[dict[str, Any]]]#
Remove peaks with near-duplicate m/z values from each spectrum.
For each spectrum, peaks are sorted by m/z and any peak whose m/z differs from the previous peak by less than
epsilonis discarded. Ifoutfileis provided, the result is also written to disk in MGF format.- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see
iter_spectra()for accepted types.epsilon (float) – Minimum m/z separation (in daltons) required to keep a peak.
outfile (PathLike, optional) – If provided, write the purged spectra to this MGF file path.
- Returns:
Spectra with redundant peaks removed and peaks sorted by m/z.
- Return type:
list[PyteomicsSpectrum]
- casanovoutils.mgfutils.remove_redundant_peaks(spectrum: list[dict[str, Any]], eps: float) list[dict[str, Any]]#
Remove redundant peaks that are too close together along the m/z axis.
Peaks are sorted by m/z. Any peak within
epsof the preceding peak is discarded, keeping the first peak in each run of close peaks.- Parameters:
spectrum (PyteomicsSpectrum) – A spectrum dict as returned by
pyteomics.mgf.read, containing"m/z array"and"intensity array"keys.eps (float, optional) – Maximum m/z distance between two peaks to be considered redundant. Defaults to the 32-bit float machine epsilon (
numpy.finfo(numpy.float32).eps≈ 1.19e-7).
- Returns:
A new spectrum dict with
"m/z array"and"intensity array"replaced by the deduplicated arrays. All other keys are unchanged.- Return type:
- casanovoutils.mgfutils.shuffle(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], outfile: PathLike | None = None, random_seed: int = 42) list[list[dict[str, Any]]]#
Read all spectra and return them in a shuffled order.
- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see
iter_spectra()for accepted types.outfile (PathLike, optional) – If provided, write the shuffled spectra to this MGF file path.
random_seed (int, default=42) – Random seed for reproducible shuffling.
- Returns:
All spectra in shuffled order.
- Return type:
list[PyteomicsSpectrum]
- casanovoutils.mgfutils.spectra_per_peptide(spectra: PathLike | Iterable[PathLike] | Iterable[list[dict[str, Any]]], outfile: PathLike | None = None, k: int = 1, precursor: bool = False, ignore_mods: bool = False, random_seed: int = 42) list[list[dict[str, Any]]]#
Sample up to k spectra per peptide using reservoir sampling.
Makes a single streaming pass through spectra, maintaining a reservoir of size k per unique group. For the j-th occurrence of a group: if j <= k, add unconditionally; if j > k, replace a uniformly random reservoir slot with probability k/j. Memory usage is O(unique groups x k) rather than O(total spectra).
- Parameters:
spectra (PathLike, Iterable[PathLike], or Iterable[PyteomicsSpectrum]) – Spectrum source — see
iter_spectra()for accepted types.outfile (PathLike, optional) – If provided, write the sampled spectra to this MGF file path.
k (int, default=1) – Maximum number of spectra to retain per group.
precursor (bool, default=False) – If True, group by peptide sequence and charge state, so that the same peptide observed in different charge states is treated as separate groups.
ignore_mods (bool, default=False) – If True, strip ProForma bracketed modification annotations (e.g.
[Acetyl],[Carbamidomethyl]) from the sequence before grouping, so modified and unmodified forms of the same peptide are counted together.random_seed (int, default=42) – Seed for the local random number generator.
- Returns:
Sampled spectra, grouped by key in first-seen order.
- Return type:
list[PyteomicsSpectrum]
- casanovoutils.mgfutils.write_spectra(spectra: Iterable[list[dict[str, Any]]], outfile: PathLike | None) None#
Write spectra to an MGF file, if an output path is provided.
- Parameters:
spectra (Iterable[PyteomicsSpectrum]) – Spectra to write.
outfile (PathLike, optional) – Destination MGF file path. If
None, this function is a no-op.