casanovoutils.preccov#

Precision-coverage computation for peptide and amino acid evaluation.

Provides an end-to-end pipeline that takes predicted and ground truth PSM DataFrames, aligns token sequences with gap insertion, and computes cumulative precision-coverage (Prec-Cov) curves. Results can be exported as DataFrames or rendered as matplotlib figures.

The main entry points are:

  • get_prec_cov_df() — builds a precision-coverage DataFrame at either peptide or amino acid level.

  • graph_prec_cov() — plots pre-computed precision-coverage DataFrames.

  • GraphPrecCov — stateful plot builder, intended for programmatic or CLI use via Fire.

The module is also executable as a CLI via python -m casanovoutils.prec_cov (or the installed casanovoutils entry point), exposing get_pc_df and graph_prec_cov as subcommands.

Attributes#

Classes#

GraphPrecCov

Plot and compare peptide-level precision-coverage (Prec-Cov) curves.

Functions#

mutate_row_as_dict(→ dict[str, Any])

Align predicted and ground truth token sequences within a single row dict.

calc_precision_coverage(→ polars.DataFrame)

Compute cumulative precision and coverage curves sorted by score.

load_ground_truth_df(→ polars.DataFrame)

Load or construct a ground truth PSM DataFrame.

fill_null_columns(→ polars.DataFrame)

Replace null values in score and sequence columns with safe defaults.

tokenize_and_parse_scores(→ polars.DataFrame)

Tokenize predicted and ground truth sequences and parse per-AA score strings.

align_and_explode(→ polars.DataFrame)

Align predicted and ground truth token sequences and explode to per-AA rows.

get_prec_cov_df(→ polars.DataFrame)

Build a precision-coverage DataFrame from predicted and ground truth PSMs.

graph_prec_cov(→ None)

Plot precision-coverage curves from one or more pre-computed DataFrames.

main(→ None)

CLI entry

Module Contents#

class casanovoutils.preccov.GraphPrecCov#

Plot and compare peptide-level precision-coverage (Prec-Cov) curves.

This class accumulates multiple datasets onto a single precision-coverage plot. For each dataset, predicted peptide correctness and scores are extracted via get_ground_truth(), and a precision-coverage curve is computed using prec_cov(). The area under the precision-coverage curve (AUPC) is displayed in the legend.

Designed for command-line use with Fire, multiple datasets can be added in a single process before saving or showing the figure.

Parameters:
  • fig_width (float, default=4.0) – Width of the matplotlib figure in inches.

  • fig_height (float, default=4.0) – Height of the matplotlib figure in inches.

  • fig_dpi (int, default=150) – Figure resolution in dots per inch.

  • legend_border (bool, default=False) – Whether to draw a border around the legend frame.

  • legend_location (str, default="lower left") – Legend location string passed to matplotlib.axes.Axes.legend.

  • ax_x_label (str, default="Coverage") – Label for the x-axis.

  • ax_y_label (str, default="Precision") – Label for the y-axis.

  • ax_title (str, default="") – Base title for the plot. “(Amino Acid)” is appended automatically.

Notes

Each call to add_peptides() adds a new curve to the same axes. Use clear() to reset the figure.

All commands operate on the same instance, so state (the accumulated curves) is preserved.

fig_width: float = 4.0#
fig_height: float = 4.0#
fig_dpi: int = 150#
legend_border: bool = False#
legend_location: str = 'lower left'#
ax_x_label: str = 'Coverage'#
ax_y_label: str = 'Precision'#
ax_title: str = ''#
add_series(pc_df: polars.DataFrame, series_name: str, color: str | None = None, linestyle: str | None = None) None#

Add a precision-coverage curve for a single dataset to the plot.

Extracts precision and coverage columns from pc_df, computes the area under the precision-coverage curve (AUPC) via the trapezoidal rule, and plots the curve with series_name and the AUPC value in the legend label.

Parameters:
  • pc_df (pl.DataFrame) – A DataFrame containing Constants.precision_column and Constants.coverage_column columns, as produced by calc_precision_coverage().

  • series_name (str) – Display name for this dataset in the plot legend.

  • color (str, optional) – Line color passed to matplotlib.axes.Axes.plot. If None (default), matplotlib’s automatic color cycling is used.

  • linestyle (str, optional) – Line style passed to matplotlib.axes.Axes.plot (e.g. "-", "--", ":"). If None (default), matplotlib’s default solid line style is used.

Return type:

None

clear() None#

Reset the figures and axes to blank precision-coverage plots.

Creates two matplotlib figures + axes: 1) amino-acid-level precision/coverage plot 2) peptide-level precision/coverage plot

Return type:

None

save(save_path: os.PathLike) None#

Save the current plot to a file.

Parameters:

save_path (PathLike) – Output file path. The file extension (e.g., .png, .pdf, .svg) determines the format written by matplotlib.

Return type:

None

show() None#

Display the current precision-coverage plot.

Return type:

None

casanovoutils.preccov.mutate_row_as_dict(tie_break_suffix: bool, row: dict[str, Any]) dict[str, Any]#

Align predicted and ground truth token sequences within a single row dict.

Calls align_tokens_with_gaps() on the predicted tokens, ground truth tokens, and per-amino-acid scores from the row, then mutates the row in place with the aligned sequences, aligned scores, and a positional index list.

Parameters:
  • tie_break_suffix (bool) – Passed through to align_tokens_with_gaps(). Controls tie-breaking behaviour when the gap and no-gap paths score equally during traceback.

  • row (dict[str, Any]) – A single row represented as a dict, as produced by DataFrame.iter_rows(named=True).

Returns:

The same row dict with Constants.predicted_tokens, Constants.ground_truth_tokens, Constants.aa_scores_column, and Constants.aa_idx_column replaced by their gap-aligned counterparts.

Return type:

dict[str, Any]

casanovoutils.preccov.calc_precision_coverage(pc_df: polars.DataFrame, score_col: str) polars.DataFrame#

Compute cumulative precision and coverage curves sorted by score.

Sorts the DataFrame by score_col in descending order, computes a boolean correctness column indicating where the predicted sequence matches the ground truth, then calculates cumulative precision and coverage at each rank threshold.

Parameters:
  • pc_df (pl.DataFrame) – Input DataFrame containing predicted and ground truth sequence columns and a score column.

  • score_col (str) – Name of the column to sort by. Typically either the peptide-level score column or the per-amino-acid score column depending on whether evaluation is at peptide or amino acid level.

Returns:

The input DataFrame sorted by score_col with three additional columns: "pc_is_correct" (bool), "pc_precision" (float), and "pc_coverage" (float).

Return type:

pl.DataFrame

casanovoutils.preccov.load_ground_truth_df(ground_truth_df: casanovoutils.denovoutils.DfPath | None, mgf_df: casanovoutils.denovoutils.DfPath | None, mztab_df: casanovoutils.denovoutils.DfPath | None) polars.DataFrame#

Load or construct a ground truth PSM DataFrame.

If ground_truth_df is provided, it is loaded via read_dataframe(). Otherwise, the ground truth is constructed from the provided MGF and mzTab files via get_ground_truth_df().

Parameters:
  • ground_truth_df (DfPath, optional) – Path to or an already-loaded ground truth DataFrame.

  • mgf_df (DfPath, optional) – Path to or an already-loaded MGF PSM DataFrame. Required when ground_truth_df is None.

  • mztab_df (DfPath, optional) – Path to or an already-loaded mzTab DataFrame. Required when ground_truth_df is None.

Returns:

The loaded or constructed ground truth DataFrame.

Return type:

pl.DataFrame

Raises:

ValueError – If ground_truth_df is None and either mgf_df or mztab_df is also None.

casanovoutils.preccov.fill_null_columns(df: polars.DataFrame, pred_col: str) polars.DataFrame#

Replace null values in score and sequence columns with safe defaults.

Fills nulls in the predicted sequence column and the per-amino-acid scores column with empty strings, and nulls in the peptide score column with -1.0.

Parameters:
  • df (pl.DataFrame) – Input DataFrame containing the columns to fill.

  • pred_col (str) – Name of the predicted sequence column.

Returns:

The DataFrame with null values replaced.

Return type:

pl.DataFrame

casanovoutils.preccov.tokenize_and_parse_scores(df: polars.DataFrame, pred_col: str, residues_path: os.PathLike | None, replace_isoleucine_with_leucine: bool) polars.DataFrame#

Tokenize predicted and ground truth sequences and parse per-AA score strings.

Applies tokenize_sequences() to both the ground truth and predicted sequence columns, then parses the comma-separated per-amino-acid score strings in the aa scores column into lists of floats.

Parameters:
  • df (pl.DataFrame) – Input DataFrame containing sequence and score columns.

  • pred_col (str) – Name of the predicted sequence column.

  • residues_path (PathLike, optional) – Path to a residue mass YAML file. If None, the bundled residues.yaml is used.

  • replace_isoleucine_with_leucine (bool) – If True, isoleucine (I) is replaced with leucine (L) during tokenization, treating them as equivalent.

Returns:

The DataFrame with added token columns and the aa scores column converted from comma-separated strings to lists of floats.

Return type:

pl.DataFrame

casanovoutils.preccov.align_and_explode(df: polars.DataFrame, tie_break_suffix: bool) polars.DataFrame#

Align predicted and ground truth token sequences and explode to per-AA rows.

Iterates over each row, aligns the predicted and ground truth token sequences with gap insertion via mutate_row_as_dict(), then explodes the resulting list columns so that each row corresponds to a single amino acid position.

Parameters:
  • df (pl.DataFrame) – Input DataFrame with tokenized predicted and ground truth sequence columns and parsed per-amino-acid scores.

  • tie_break_suffix (bool) – Passed through to mutate_row_as_dict(). Controls tie-breaking behavior when the gap and no-gap paths score equally during traceback.

Returns:

A DataFrame exploded to one row per aligned amino acid position, with gap characters inserted where sequences do not align.

Return type:

pl.DataFrame

casanovoutils.preccov.get_prec_cov_df(ground_truth_df: casanovoutils.denovoutils.DfPath | None = None, mgf_df: casanovoutils.denovoutils.DfPath | None = None, mztab_df: casanovoutils.denovoutils.DfPath | None = None, residues_path: casanovoutils.denovoutils.DfPath | None = None, replace_isoleucine_with_leucine: bool = True, aa_level: bool = False, align_tie_beak_suffix: bool = True, out_path: os.PathLike | None = None) polars.DataFrame#

Build a precision-coverage DataFrame from predicted and ground truth PSMs.

Loads or constructs a ground truth DataFrame, tokenizes both predicted and ground truth sequences, parses per-amino-acid scores, and computes precision-coverage metrics. When aa_level is True, sequences are first aligned with gap insertion and then exploded so that each row represents a single amino acid position rather than a full peptide.

Parameters:
  • ground_truth_df (DfPath, optional) – Path to or an already-loaded ground truth DataFrame. If None, both mgf_df and mztab_df must be provided and the ground truth DataFrame will be constructed via get_ground_truth_df().

  • mgf_df (DfPath, optional) – Path to or an already-loaded MGF PSM DataFrame. Required when ground_truth_df is None.

  • mztab_df (DfPath, optional) – Path to or an already-loaded mzTab DataFrame. Required when ground_truth_df is None.

  • residues_path (DfPath, optional) – Path to a residue mass YAML file passed through to tokenize_sequences(). If None, the bundled residues.yaml is used.

  • replace_isoleucine_with_leucine (bool, optional) – If True (default), isoleucine (I) is replaced with leucine (L) during tokenization, treating them as equivalent.

  • aa_level (bool, optional) – If True, perform per-amino-acid alignment via gap insertion and explode the DataFrame so each row corresponds to a single amino acid position. If False (default), metrics are computed at the peptide level using the peptide-level score column.

  • align_tie_beak_suffix (bool, optional) – Passed through to the alignment step when aa_level is True. Controls tie-breaking behavior when the gap and no-gap paths score equally during traceback. Defaults to True.

  • out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.

Returns:

A DataFrame with precision and coverage metrics. At peptide level, each row is one PSM; at amino acid level (aa_level=True), each row is one aligned amino acid position.

Return type:

pl.DataFrame

Raises:

ValueError – If ground_truth_df is None and either mgf_df or mztab_df is also None.

casanovoutils.preccov.graph_prec_cov(*pc_df_paths: os.PathLike, out_path: os.PathLike | None = None) None#

Plot precision-coverage curves from one or more pre-computed DataFrames.

Loads each DataFrame from pc_df_paths, adds it as a series to a GraphPrecCov plot using the file stem as the series name, and then either saves the figure, displays it, or both.

Parameters:
  • *pc_df_paths (PathLike) – One or more paths to DataFrames containing Constants.precision_column and Constants.coverage_column columns, as produced by get_prec_cov_df(). The file stem of each path is used as the series label in the legend.

  • out_path (PathLike, optional) – If provided, the figure is saved to this path. The file extension determines the format (e.g. .png, .pdf, .svg).

Return type:

None

Warns:
  • Logs a warning if the plot cannot be displayed, which typically occurs

  • when no graphical backend is available (e.g. in a headless environment).

  • In that case, saving via ``out_path`` still works normally.

casanovoutils.preccov.COMMANDS: casanovoutils.types.Commands#
casanovoutils.preccov.main() None#

CLI entry