casanovoutils.mzmlutils#

Utilities for reading and sampling spectra from mzML files.

Provides a single sample_spectra command that makes one streaming pass through an mzML file, sampling a proportion k of spectra from each read buffer. Output is written to MGF format. A main entry point exposes commands as CLI subcommands via fire.

Attributes#

Functions#

sample_mzml(→ list[casanovoutils.types.PyteomicsSpectrum])

Sample a proportion of spectra from an mzML file in a single streaming pass.

sample_spectra(→ None)

Sample spectra from an mzML file and write them to an MGF file.

main(→ None)

Module Contents#

casanovoutils.mzmlutils.sample_mzml(input_file: os.PathLike, k: float, buffer_size: int = 1000, random_seed: int = 42) list[casanovoutils.types.PyteomicsSpectrum]#

Sample a proportion of spectra from an mzML file in a single streaming pass.

Reads the file in chunks of buffer_size spectra. From each chunk, round(k * chunk_size) spectra are drawn without replacement using random.sample. No second pass or total-count is required.

Note: the final sample count equals sum(round(k * b) for b in buffers) which may differ slightly from round(k * total) due to per-buffer rounding. Use a buffer_size that is large relative to 1 / k to minimise this effect.

Parameters:
  • input_file (PathLike) – Path to the input mzML file.

  • k (float) – Proportion of spectra to sample; must be in (0, 1).

  • buffer_size (int, default=1000) – Number of spectra read per I/O chunk.

  • random_seed (int, default=42) – Seed for reproducible sampling.

Returns:

Sampled spectra in file order within each buffer.

Return type:

list[PyteomicsSpectrum]

casanovoutils.mzmlutils.sample_spectra(input_file: os.PathLike, k: float, outfile: os.PathLike, buffer_size: int = 1000, random_seed: int = 42) None#

Sample spectra from an mzML file and write them to an MGF file.

Note: writing to mzML is a pain so I didn’t implement it here. If you need an mzML output I would recommend running this and then using msconvert to convert to mzML.

Parameters:
  • input_file (PathLike) – Path to the input mzML file.

  • k (float) – Proportion of spectra to sample; must be in (0, 1).

  • outfile (PathLike) – Output path; must have a .mgf extension.

  • buffer_size (int, default=1000) – Number of spectra read per I/O chunk.

  • random_seed (int, default=42) – Seed for reproducible sampling.

casanovoutils.mzmlutils.COMMANDS: casanovoutils.types.Commands#
casanovoutils.mzmlutils.main() None#