casanovoutils.preccov

casanovoutils.preccov#

Precision-coverage computation for peptide and amino acid evaluation.

Provides an end-to-end pipeline that takes predicted and ground truth PSM DataFrames, aligns token sequences with gap insertion, and computes cumulative precision-coverage (Prec-Cov) curves. Results can be exported as DataFrames or rendered as matplotlib figures.

The main entry points are:

get_prec_cov_df() — builds a precision-coverage DataFrame at either peptide or amino acid level.
graph_prec_cov() — plots pre-computed precision-coverage DataFrames.
GraphPrecCov — stateful plot builder, intended for programmatic or CLI use via Fire.

The module is also executable as a CLI via python -m casanovoutils.prec_cov (or the installed casanovoutils entry point), exposing get_pc_df and graph_prec_cov as subcommands.

Attributes#

COMMANDS

Classes#

GraphPrecCov

Plot and compare peptide-level precision-coverage (Prec-Cov) curves.

Functions#

`mutate_row_as_dict`(→ dict[str, Any])	Align predicted and ground truth token sequences within a single row dict.
`calc_precision_coverage`(→ polars.DataFrame)	Compute cumulative precision and coverage curves sorted by score.
`load_ground_truth_df`(→ polars.DataFrame)	Load or construct a ground truth PSM DataFrame.
`fill_null_columns`(→ polars.DataFrame)	Replace null values in score and sequence columns with safe defaults.
`tokenize_and_parse_scores`(→ polars.DataFrame)	Tokenize predicted and ground truth sequences and parse per-AA score strings.
`align_and_explode`(→ polars.DataFrame)	Align predicted and ground truth token sequences and explode to per-AA rows.
`get_prec_cov_df`(→ polars.DataFrame)	Build a precision-coverage DataFrame from predicted and ground truth PSMs.
`graph_prec_cov`(→ None)	Plot precision-coverage curves from one or more pre-computed DataFrames.
`main`(→ None)	CLI entry

Module Contents#

class casanovoutils.preccov.GraphPrecCov#

Plot and compare peptide-level precision-coverage (Prec-Cov) curves.

This class accumulates multiple datasets onto a single precision-coverage plot. For each dataset, predicted peptide correctness and scores are extracted via get_ground_truth(), and a precision-coverage curve is computed using prec_cov(). The area under the precision-coverage curve (AUPC) is displayed in the legend.

Designed for command-line use with Fire, multiple datasets can be added in a single process before saving or showing the figure.

Parameters:

fig_width (float, default=4.0) – Width of the matplotlib figure in inches.
fig_height (float, default=4.0) – Height of the matplotlib figure in inches.
fig_dpi (int, default=150) – Figure resolution in dots per inch.
legend_border (bool, default=False) – Whether to draw a border around the legend frame.
legend_location (str, default="lower left") – Legend location string passed to matplotlib.axes.Axes.legend.
ax_x_label (str, default="Coverage") – Label for the x-axis.
ax_y_label (str, default="Precision") – Label for the y-axis.
ax_title (str, default="") – Base title for the plot. “(Amino Acid)” is appended automatically.

Notes

Each call to add_peptides() adds a new curve to the same axes. Use clear() to reset the figure.

All commands operate on the same instance, so state (the accumulated curves) is preserved.

fig_width: float = 4.0#

fig_height: float = 4.0#

fig_dpi: int = 150#

legend_border: bool = False#

legend_location: str = 'lower left'#

ax_x_label: str = 'Coverage'#

ax_y_label: str = 'Precision'#

ax_title: str = ''#

add_series(pc_df: polars.DataFrame, series_name: str, color: str | None = None, linestyle: str | None = None) → None#

Add a precision-coverage curve for a single dataset to the plot.

Extracts precision and coverage columns from pc_df, computes the area under the precision-coverage curve (AUPC) via the trapezoidal rule, and plots the curve with series_name and the AUPC value in the legend label.

Parameters:

pc_df (pl.DataFrame) – A DataFrame containing Constants.precision_column and Constants.coverage_column columns, as produced by calc_precision_coverage().
series_name (str) – Display name for this dataset in the plot legend.
color (str, optional) – Line color passed to matplotlib.axes.Axes.plot. If None (default), matplotlib’s automatic color cycling is used.
linestyle (str, optional) – Line style passed to matplotlib.axes.Axes.plot (e.g. "-", "--", ":"). If None (default), matplotlib’s default solid line style is used.

Return type:

None

clear() → None#

Reset the figures and axes to blank precision-coverage plots.

Creates two matplotlib figures + axes: 1) amino-acid-level precision/coverage plot 2) peptide-level precision/coverage plot

Return type:: None

save(save_path: os.PathLike) → None#

Save the current plot to a file.

Parameters:: save_path (PathLike) – Output file path. The file extension (e.g., .png, .pdf, .svg) determines the format written by matplotlib.
Return type:: None

show() → None#

Display the current precision-coverage plot.

Return type:: None

casanovoutils.preccov.mutate_row_as_dict(tie_break_suffix: bool, row: dict[str, Any]) → dict[str, Any]#

Align predicted and ground truth token sequences within a single row dict.

Calls align_tokens_with_gaps() on the predicted tokens, ground truth tokens, and per-amino-acid scores from the row, then mutates the row in place with the aligned sequences, aligned scores, and a positional index list.

Parameters:

tie_break_suffix (bool) – Passed through to align_tokens_with_gaps(). Controls tie-breaking behaviour when the gap and no-gap paths score equally during traceback.
row (dict[str, Any]) – A single row represented as a dict, as produced by DataFrame.iter_rows(named=True).

Returns:

The same row dict with Constants.predicted_tokens, Constants.ground_truth_tokens, Constants.aa_scores_column, and Constants.aa_idx_column replaced by their gap-aligned counterparts.

Return type:

dict[str, Any]

casanovoutils.preccov.calc_precision_coverage(pc_df: polars.DataFrame, score_col: str) → polars.DataFrame#

Compute cumulative precision and coverage curves sorted by score.

Sorts the DataFrame by score_col in descending order, computes a boolean correctness column indicating where the predicted sequence matches the ground truth, then calculates cumulative precision and coverage at each rank threshold.

Parameters:

pc_df (pl.DataFrame) – Input DataFrame containing predicted and ground truth sequence columns and a score column.
score_col (str) – Name of the column to sort by. Typically either the peptide-level score column or the per-amino-acid score column depending on whether evaluation is at peptide or amino acid level.

Returns:

The input DataFrame sorted by score_col with three additional columns: "pc_is_correct" (bool), "pc_precision" (float), and "pc_coverage" (float).

Return type:

pl.DataFrame

casanovoutils.preccov.load_ground_truth_df(ground_truth_df: casanovoutils.denovoutils.DfPath | None, mgf_df: casanovoutils.denovoutils.DfPath | None, mztab_df: casanovoutils.denovoutils.DfPath | None) → polars.DataFrame#

Load or construct a ground truth PSM DataFrame.

If ground_truth_df is provided, it is loaded via read_dataframe(). Otherwise, the ground truth is constructed from the provided MGF and mzTab files via get_ground_truth_df().

Parameters:

ground_truth_df (DfPath, optional) – Path to or an already-loaded ground truth DataFrame.
mgf_df (DfPath, optional) – Path to or an already-loaded MGF PSM DataFrame. Required when ground_truth_df is None.
mztab_df (DfPath, optional) – Path to or an already-loaded mzTab DataFrame. Required when ground_truth_df is None.

Returns:

The loaded or constructed ground truth DataFrame.

Return type:

pl.DataFrame

Raises:

ValueError – If ground_truth_df is None and either mgf_df or mztab_df is also None.

casanovoutils.preccov.fill_null_columns(df: polars.DataFrame, pred_col: str) → polars.DataFrame#

Replace null values in score and sequence columns with safe defaults.

Fills nulls in the predicted sequence column and the per-amino-acid scores column with empty strings, and nulls in the peptide score column with -1.0.

Parameters:

df (pl.DataFrame) – Input DataFrame containing the columns to fill.
pred_col (str) – Name of the predicted sequence column.

Returns:

The DataFrame with null values replaced.

Return type:

pl.DataFrame

casanovoutils.preccov.tokenize_and_parse_scores(df: polars.DataFrame, pred_col: str, residues_path: os.PathLike | None, replace_isoleucine_with_leucine: bool) → polars.DataFrame#

Tokenize predicted and ground truth sequences and parse per-AA score strings.

Applies tokenize_sequences() to both the ground truth and predicted sequence columns, then parses the comma-separated per-amino-acid score strings in the aa scores column into lists of floats.

Parameters:

df (pl.DataFrame) – Input DataFrame containing sequence and score columns.
pred_col (str) – Name of the predicted sequence column.
residues_path (PathLike, optional) – Path to a residue mass YAML file. If None, the bundled residues.yaml is used.
replace_isoleucine_with_leucine (bool) – If True, isoleucine (I) is replaced with leucine (L) during tokenization, treating them as equivalent.

Returns:

The DataFrame with added token columns and the aa scores column converted from comma-separated strings to lists of floats.

Return type:

pl.DataFrame

casanovoutils.preccov.align_and_explode(df: polars.DataFrame, tie_break_suffix: bool) → polars.DataFrame#

Align predicted and ground truth token sequences and explode to per-AA rows.

Iterates over each row, aligns the predicted and ground truth token sequences with gap insertion via mutate_row_as_dict(), then explodes the resulting list columns so that each row corresponds to a single amino acid position.

Parameters:

df (pl.DataFrame) – Input DataFrame with tokenized predicted and ground truth sequence columns and parsed per-amino-acid scores.
tie_break_suffix (bool) – Passed through to mutate_row_as_dict(). Controls tie-breaking behavior when the gap and no-gap paths score equally during traceback.

Returns:

A DataFrame exploded to one row per aligned amino acid position, with gap characters inserted where sequences do not align.

Return type:

pl.DataFrame

casanovoutils.preccov.get_prec_cov_df(ground_truth_df: casanovoutils.denovoutils.DfPath | None = None, mgf_df: casanovoutils.denovoutils.DfPath | None = None, mztab_df: casanovoutils.denovoutils.DfPath | None = None, residues_path: casanovoutils.denovoutils.DfPath | None = None, replace_isoleucine_with_leucine: bool = True, aa_level: bool = False, align_tie_beak_suffix: bool = True, out_path: os.PathLike | None = None) → polars.DataFrame#

Build a precision-coverage DataFrame from predicted and ground truth PSMs.

Loads or constructs a ground truth DataFrame, tokenizes both predicted and ground truth sequences, parses per-amino-acid scores, and computes precision-coverage metrics. When aa_level is True, sequences are first aligned with gap insertion and then exploded so that each row represents a single amino acid position rather than a full peptide.

Parameters:

ground_truth_df (DfPath, optional) – Path to or an already-loaded ground truth DataFrame. If None, both mgf_df and mztab_df must be provided and the ground truth DataFrame will be constructed via get_ground_truth_df().
mgf_df (DfPath, optional) – Path to or an already-loaded MGF PSM DataFrame. Required when ground_truth_df is None.
mztab_df (DfPath, optional) – Path to or an already-loaded mzTab DataFrame. Required when ground_truth_df is None.
residues_path (DfPath, optional) – Path to a residue mass YAML file passed through to tokenize_sequences(). If None, the bundled residues.yaml is used.
replace_isoleucine_with_leucine (bool, optional) – If True (default), isoleucine (I) is replaced with leucine (L) during tokenization, treating them as equivalent.
aa_level (bool, optional) – If True, perform per-amino-acid alignment via gap insertion and explode the DataFrame so each row corresponds to a single amino acid position. If False (default), metrics are computed at the peptide level using the peptide-level score column.
align_tie_beak_suffix (bool, optional) – Passed through to the alignment step when aa_level is True. Controls tie-breaking behavior when the gap and no-gap paths score equally during traceback. Defaults to True.
out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.

Returns:

A DataFrame with precision and coverage metrics. At peptide level, each row is one PSM; at amino acid level (aa_level=True), each row is one aligned amino acid position.

Return type:

pl.DataFrame

Raises:

ValueError – If ground_truth_df is None and either mgf_df or mztab_df is also None.

casanovoutils.preccov.graph_prec_cov(*pc_df_paths: os.PathLike, out_path: os.PathLike | None = None) → None#

Plot precision-coverage curves from one or more pre-computed DataFrames.

Loads each DataFrame from pc_df_paths, adds it as a series to a GraphPrecCov plot using the file stem as the series name, and then either saves the figure, displays it, or both.

Parameters:

*pc_df_paths (PathLike) – One or more paths to DataFrames containing Constants.precision_column and Constants.coverage_column columns, as produced by get_prec_cov_df(). The file stem of each path is used as the series label in the legend.
out_path (PathLike, optional) – If provided, the figure is saved to this path. The file extension determines the format (e.g. .png, .pdf, .svg).

Return type:

None

Warns:

Logs a warning if the plot cannot be displayed, which typically occurs
when no graphical backend is available (e.g. in a headless environment).
In that case, saving via ``out_path`` still works normally.

casanovoutils.preccov.COMMANDS: casanovoutils.types.Commands#

casanovoutils.preccov.main() → None#: CLI entry