casanovoutils.mzmlutils#
Utilities for reading and sampling spectra from mzML files.
Provides a single sample_spectra command that makes one streaming pass
through an mzML file, sampling a proportion k of spectra from each
read buffer. Output is written to MGF format. A main entry point
exposes commands as CLI subcommands via fire.
Attributes#
Functions#
|
Sample a proportion of spectra from an mzML file in a single streaming pass. |
|
Sample spectra from an mzML file and write them to an MGF file. |
|
Module Contents#
- casanovoutils.mzmlutils.sample_mzml(input_file: os.PathLike, k: float, buffer_size: int = 1000, random_seed: int = 42) list[casanovoutils.types.PyteomicsSpectrum]#
Sample a proportion of spectra from an mzML file in a single streaming pass.
Reads the file in chunks of
buffer_sizespectra. From each chunk,round(k * chunk_size)spectra are drawn without replacement usingrandom.sample. No second pass or total-count is required.Note: the final sample count equals
sum(round(k * b) for b in buffers)which may differ slightly fromround(k * total)due to per-buffer rounding. Use abuffer_sizethat is large relative to1 / kto minimise this effect.- Parameters:
input_file (PathLike) – Path to the input mzML file.
k (float) – Proportion of spectra to sample; must be in (0, 1).
buffer_size (int, default=1000) – Number of spectra read per I/O chunk.
random_seed (int, default=42) – Seed for reproducible sampling.
- Returns:
Sampled spectra in file order within each buffer.
- Return type:
list[PyteomicsSpectrum]
- casanovoutils.mzmlutils.sample_spectra(input_file: os.PathLike, k: float, outfile: os.PathLike, buffer_size: int = 1000, random_seed: int = 42) None#
Sample spectra from an mzML file and write them to an MGF file.
Note: writing to mzML is a pain so I didn’t implement it here. If you need an mzML output I would recommend running this and then using msconvert to convert to mzML.
- Parameters:
input_file (PathLike) – Path to the input mzML file.
k (float) – Proportion of spectra to sample; must be in (0, 1).
outfile (PathLike) – Output path; must have a
.mgfextension.buffer_size (int, default=1000) – Number of spectra read per I/O chunk.
random_seed (int, default=42) – Seed for reproducible sampling.
- casanovoutils.mzmlutils.COMMANDS: casanovoutils.types.Commands#
- casanovoutils.mzmlutils.main() None#