denovoutils#
Load and join PSM data from MGF and mzTab files into Polars DataFrames.
Includes helpers for tokenizing peptide sequences and writing DataFrames
to Parquet, CSV, or TSV. Exposed as CLI subcommands via
casanovoutils denovo.
Data loading and preprocessing utilities for MGF and mzTab PSM files.
Provides functions to parse raw instrument files into Polars DataFrames, join predicted and ground truth annotations, and tokenize peptide sequences for downstream evaluation. All loaded DataFrames accept either a file path or an already-loaded DataFrame, allowing the functions to be composed freely without redundant I/O.
The module is also executable as a CLI via python -m casanovoutils.utils
(or the installed casanovoutils entry point), exposing get_mgf_psms,
get_mztab, and get_groundtruth as subcommands.
- casanovoutils.denovoutils.get_ground_truth_df(mgf_path: PathLike | DataFrame | list[PathLike | DataFrame], mztab_path: PathLike | DataFrame, out_path: PathLike | None = None) DataFrame#
Join MGF PSM metadata with mzTab spectrum match annotations.
Loads both sources, aligns them on the MGF spectrum index encoded in the mzTab
spectra_reffield, performs a left join, and drops all temporarytmp_columns from the result.- Parameters:
mgf_path (PathLike or list of PathLike) – Path to the MGF file, or an already-loaded
polars.DataFrame.mztab_path (PathLike) – Path to the mzTab file, or an already-loaded
polars.DataFrame.out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.
- Returns:
A DataFrame containing all MGF parameter columns (prefixed
mgf_) left-joined with mzTab annotation columns (prefixedmztab_).- Return type:
pl.DataFrame
- casanovoutils.denovoutils.get_mgf_psms_df(mgf_path: PathLike | DataFrame, out_path: PathLike | None = None, meta_data_only: bool = True) DataFrame#
Load PSM metadata from an MGF file into a Polars DataFrame.
If
mgf_pathis already apolars.DataFrame, it is returned as-is (and optionally written toout_path). Otherwise, the MGF file is parsed with Pyteomics, per-spectrum parameters are extracted viaprocess_spectrum(), and all columns are prefixed withmgf_.- Parameters:
mgf_path (DfPath) – Path to an MGF file, or an already-loaded
polars.DataFrame.out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension via
write_dataframe().meta_data_only (bool, optional) – Passed through to
process_spectrum(). IfTrue(default), only scalar spectrum metadata is loaded (no m/z or intensity arrays). IfFalse,mgf_intensity_arrayandmgf_m_z_arraycolumns are included in the returned DataFrame.
- Returns:
A DataFrame with one row per spectrum and columns prefixed with
mgf_, including anmgf_n_peakscolumn.- Return type:
pl.DataFrame
- Raises:
ValueError – Propagated from
write_dataframe()ifout_pathhas an unsupported file extension.
- casanovoutils.denovoutils.get_mztab_df(mztab_path: PathLike | DataFrame, out_path: PathLike | None = None) DataFrame#
Load the spectrum match table from an mzTab file into a Polars DataFrame.
If
mztab_pathis already a DataFrame, it is returned as-is. Otherwise, the file is parsed with Pyteomics, converted from pandas, and given a row index. All columns are prefixed withmztab_.- Parameters:
mztab_path (PathLike) – Path to an mzTab file, or an already-loaded
polars.DataFrame.out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.
- Returns:
A DataFrame with one row per spectrum match and columns prefixed with
mztab_.- Return type:
pl.DataFrame
- casanovoutils.denovoutils.main() None#
Configure logging and expose data loading functions as a CLI.
Sets up a stdout logger at INFO level, then delegates to
fire.Fire()which maps subcommands to their corresponding functions:get_mgf_psms→get_mgf_psms_df()get_mztab→get_mztab_df()get_groundtruth→get_merged_groundtruth_df()
Examples
python module.py get_mgf_psms path/to/file.mgf --out_path out.parquet python module.py get_mztab path/to/file.mztab --out_path out.parquet python module.py get_groundtruth path/to/file.mgf path/to/file.mztab --out_path out.parquet
- casanovoutils.denovoutils.process_spectrum(spectrum: list[dict[str, Any]], meta_data_only: bool = True) dict[str, dict[str, Any]]#
Extract and augment parameter metadata from a single spectrum.
Retrieves the
paramsdict from a Pyteomics spectrum object and annotates it with the number of peaks in the spectrum. Ifmeta_data_onlyisFalse, the intensity and m/z arrays are also included in the output.- Parameters:
spectrum (PyteomicsSpectrum) – A spectrum dict as returned by
pyteomics.mgf.read, containing at least a"params"key, an"m/z array"key, and an"intensity array"key.meta_data_only (bool, optional) – If
True, only scalar metadata is returned (no spectral arrays). IfFalse,"intensity_array"and"m_z_array"are added to the output dict.
- Returns:
The spectrum’s parameter dict with an added
"n_peaks"entry, and optionally"intensity_array"and"m_z_array"entries.- Return type:
dict[str, Any]
- casanovoutils.denovoutils.read_dataframe(df_path: PathLike | DataFrame) DataFrame#
Read a DataFrame from a file path, inferring the format from the extension.
- Parameters:
df_path (DfPath) – Path to a
.parquet/.pq,.csv, or.tsvfile, or an already-loadedpolars.DataFramewhich is returned as-is.- Returns:
The loaded DataFrame.
- Return type:
pl.DataFrame
- Raises:
ValueError – If the file extension is not one of the supported types.
- casanovoutils.denovoutils.tokenize_helper(seq: str, tokenizer: PeptideTokenizer, combine_n_term: bool = True) list[str]#
Split a peptide sequence into tokens.
Delegates to
tokenizer.splitand, whencombine_n_termisTrue, fuses a leading modification token (e.g."[UNIMOD:x]") onto the first residue token so that the modification is not a stand-alone element.- Parameters:
seq (str) – A peptide sequence string, optionally containing modification annotations recognised by
tokenizer.tokenizer (depthcharge.tokenizers.PeptideTokenizer) – A tokenizer instance used to split the sequence.
combine_n_term (bool, optional) – If
True(default), merge a leading modification token with the first residue token.
- Returns:
Ordered list of token strings representing the peptide.
- Return type:
list[str]
- casanovoutils.denovoutils.tokenize_sequences(data_df: DataFrame, seq_column: str, out_prefix: str | None = None, combine_n_term: bool = True, residues_path: PathLike | None = None, replace_isoleucine_with_leucine: bool = True) DataFrame#
Tokenize a peptide sequence column and append token and length columns.
Loads residue masses via
get_residues(), constructs anMskbPeptideTokenizer, and appliestokenize_helper()to each value inseq_column. Two new columns are added to the DataFrame:{out_prefix}_tokens(a list of token strings) and{out_prefix}_sequence_len(the number of tokens).- Parameters:
data_df (pl.DataFrame) – Input DataFrame containing the sequence column to tokenize.
seq_column (str) – Name of the column holding peptide sequence strings.
out_prefix (str, optional) – Prefix for the output columns. If
None(default), the portion ofseq_columnbefore the first underscore is used.combine_n_term (bool, optional) – Passed through to
tokenize_helper(). IfTrue(default), N-terminal modification tokens are merged with the first residue.residues_path (PathLike, optional) – Path to a residue mass YAML file. If
None(default), the bundledresidues.yamlis used.
- Returns:
The input DataFrame with two additional columns:
{out_prefix}_tokensand{out_prefix}_sequence_len.- Return type:
pl.DataFrame
- casanovoutils.denovoutils.write_dataframe(data_df: DataFrame, out_path: PathLike) None#
Write a DataFrame to a file, inferring the format from the extension.
- Parameters:
data_df (pl.DataFrame) – The DataFrame to write.
out_path (PathLike) – Destination path. The file format is inferred from the extension:
.parquet/.pqfor Parquet,.csvfor comma-separated, and.tsvfor tab-separated.
- Raises:
ValueError – If the file extension is not one of the supported types.