casanovoutils.denovoutils#

Data loading and preprocessing utilities for MGF and mzTab PSM files.

Provides functions to parse raw instrument files into Polars DataFrames, join predicted and ground truth annotations, and tokenize peptide sequences for downstream evaluation. All loaded DataFrames accept either a file path or an already-loaded DataFrame, allowing the functions to be composed freely without redundant I/O.

The module is also executable as a CLI via python -m casanovoutils.utils (or the installed casanovoutils entry point), exposing get_mgf_psms, get_mztab, and get_groundtruth as subcommands.

Attributes#

`DfPath`
`COMMANDS`

Functions#

`process_spectrum`(→ dict[str, dict[str, Any]])	Extract and augment parameter metadata from a single spectrum.
`write_dataframe`(→ None)	Write a DataFrame to a file, inferring the format from the extension.
`get_mgf_psms_df`(→ polars.DataFrame)	Load PSM metadata from an MGF file into a Polars DataFrame.
`tokenize_helper`(→ list[str])	Split a peptide sequence into tokens.
`tokenize_sequences`(→ polars.DataFrame)	Tokenize a peptide sequence column and append token and length columns.
`read_dataframe`(→ polars.DataFrame)	Read a DataFrame from a file path, inferring the format from the extension.
`get_mztab_df`(→ polars.DataFrame)	Load the spectrum match table from an mzTab file into a Polars DataFrame.
`get_ground_truth_df`(→ polars.DataFrame)	Join MGF PSM metadata with mzTab spectrum match annotations.
`main`(→ None)	Configure logging and expose data loading functions as a CLI.

Module Contents#

casanovoutils.denovoutils.DfPath#

casanovoutils.denovoutils.process_spectrum(spectrum: casanovoutils.types.PyteomicsSpectrum, meta_data_only: bool = True) → dict[str, dict[str, Any]]#

Extract and augment parameter metadata from a single spectrum.

Retrieves the params dict from a Pyteomics spectrum object and annotates it with the number of peaks in the spectrum. If meta_data_only is False, the intensity and m/z arrays are also included in the output.

Parameters:

spectrum (PyteomicsSpectrum) – A spectrum dict as returned by pyteomics.mgf.read, containing at least a "params" key, an "m/z array" key, and an "intensity array" key.
meta_data_only (bool, optional) – If True, only scalar metadata is returned (no spectral arrays). If False, "intensity_array" and "m_z_array" are added to the output dict.

Returns:

The spectrum’s parameter dict with an added "n_peaks" entry, and optionally "intensity_array" and "m_z_array" entries.

Return type:

dict[str, Any]

casanovoutils.denovoutils.write_dataframe(data_df: polars.DataFrame, out_path: os.PathLike) → None#

Write a DataFrame to a file, inferring the format from the extension.

Parameters:

data_df (pl.DataFrame) – The DataFrame to write.
out_path (PathLike) – Destination path. The file format is inferred from the extension: .parquet / .pq for Parquet, .csv for comma-separated, and .tsv for tab-separated.

Raises:

ValueError – If the file extension is not one of the supported types.

casanovoutils.denovoutils.get_mgf_psms_df(mgf_path: DfPath, out_path: os.PathLike | None = None, meta_data_only: bool = True) → polars.DataFrame#

Load PSM metadata from an MGF file into a Polars DataFrame.

If mgf_path is already a polars.DataFrame, it is returned as-is (and optionally written to out_path). Otherwise, the MGF file is parsed with Pyteomics, per-spectrum parameters are extracted via process_spectrum(), and all columns are prefixed with mgf_.

Parameters:

mgf_path (DfPath) – Path to an MGF file, or an already-loaded polars.DataFrame.
out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension via write_dataframe().
meta_data_only (bool, optional) – Passed through to process_spectrum(). If True (default), only scalar spectrum metadata is loaded (no m/z or intensity arrays). If False, mgf_intensity_array and mgf_m_z_array columns are included in the returned DataFrame.

Returns:

A DataFrame with one row per spectrum and columns prefixed with mgf_, including an mgf_n_peaks column.

Return type:

pl.DataFrame

Raises:

ValueError – Propagated from write_dataframe() if out_path has an unsupported file extension.

casanovoutils.denovoutils.tokenize_helper(seq: str, tokenizer: depthcharge.tokenizers.PeptideTokenizer, combine_n_term: bool = True) → list[str]#

Split a peptide sequence into tokens.

Delegates to tokenizer.split and, when combine_n_term is True, fuses a leading modification token (e.g. "[UNIMOD:x]") onto the first residue token so that the modification is not a stand-alone element.

Parameters:

seq (str) – A peptide sequence string, optionally containing modification annotations recognised by tokenizer.
tokenizer (depthcharge.tokenizers.PeptideTokenizer) – A tokenizer instance used to split the sequence.
combine_n_term (bool, optional) – If True (default), merge a leading modification token with the first residue token.

Returns:

Ordered list of token strings representing the peptide.

Return type:

list[str]

casanovoutils.denovoutils.tokenize_sequences(data_df: polars.DataFrame, seq_column: str, out_prefix: str | None = None, combine_n_term: bool = True, residues_path: os.PathLike | None = None, replace_isoleucine_with_leucine: bool = True) → polars.DataFrame#

Tokenize a peptide sequence column and append token and length columns.

Loads residue masses via get_residues(), constructs an MskbPeptideTokenizer, and applies tokenize_helper() to each value in seq_column. Two new columns are added to the DataFrame: {out_prefix}_tokens (a list of token strings) and {out_prefix}_sequence_len (the number of tokens).

Parameters:

data_df (pl.DataFrame) – Input DataFrame containing the sequence column to tokenize.
seq_column (str) – Name of the column holding peptide sequence strings.
out_prefix (str, optional) – Prefix for the output columns. If None (default), the portion of seq_column before the first underscore is used.
combine_n_term (bool, optional) – Passed through to tokenize_helper(). If True (default), N-terminal modification tokens are merged with the first residue.
residues_path (PathLike, optional) – Path to a residue mass YAML file. If None (default), the bundled residues.yaml is used.

Returns:

The input DataFrame with two additional columns: {out_prefix}_tokens and {out_prefix}_sequence_len.

Return type:

pl.DataFrame

casanovoutils.denovoutils.read_dataframe(df_path: DfPath) → polars.DataFrame#

Read a DataFrame from a file path, inferring the format from the extension.

Parameters:: df_path (DfPath) – Path to a .parquet / .pq, .csv, or .tsv file, or an already-loaded polars.DataFrame which is returned as-is.
Returns:: The loaded DataFrame.
Return type:: pl.DataFrame
Raises:: ValueError – If the file extension is not one of the supported types.

casanovoutils.denovoutils.get_mztab_df(mztab_path: DfPath, out_path: os.PathLike | None = None) → polars.DataFrame#

Load the spectrum match table from an mzTab file into a Polars DataFrame.

If mztab_path is already a DataFrame, it is returned as-is. Otherwise, the file is parsed with Pyteomics, converted from pandas, and given a row index. All columns are prefixed with mztab_.

Parameters:

mztab_path (PathLike) – Path to an mzTab file, or an already-loaded polars.DataFrame.
out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.

Returns:

A DataFrame with one row per spectrum match and columns prefixed with mztab_.

Return type:

pl.DataFrame

casanovoutils.denovoutils.get_ground_truth_df(mgf_path: DfPath | list[DfPath], mztab_path: DfPath, out_path: os.PathLike | None = None) → polars.DataFrame#

Join MGF PSM metadata with mzTab spectrum match annotations.

Loads both sources, aligns them on the MGF spectrum index encoded in the mzTab spectra_ref field, performs a left join, and drops all temporary tmp_ columns from the result.

Parameters:

mgf_path (PathLike or list of PathLike) – Path to the MGF file, or an already-loaded polars.DataFrame.
mztab_path (PathLike) – Path to the mzTab file, or an already-loaded polars.DataFrame.
out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.

Returns:

A DataFrame containing all MGF parameter columns (prefixed mgf_) left-joined with mzTab annotation columns (prefixed mztab_).

Return type:

pl.DataFrame

casanovoutils.denovoutils.COMMANDS: casanovoutils.types.Commands#

casanovoutils.denovoutils.main() → None#

Configure logging and expose data loading functions as a CLI.

Sets up a stdout logger at INFO level, then delegates to fire.Fire() which maps subcommands to their corresponding functions:

get_mgf_psms → get_mgf_psms_df()
get_mztab → get_mztab_df()
get_groundtruth → get_merged_groundtruth_df()

Examples

python module.py get_mgf_psms path/to/file.mgf --out_path out.parquet
python module.py get_mztab path/to/file.mztab --out_path out.parquet
python module.py get_groundtruth path/to/file.mgf path/to/file.mztab --out_path out.parquet