casanovoutils.preccov#
Precision-coverage computation for peptide and amino acid evaluation.
Provides an end-to-end pipeline that takes predicted and ground truth PSM DataFrames, aligns token sequences with gap insertion, and computes cumulative precision-coverage (Prec-Cov) curves. Results can be exported as DataFrames or rendered as matplotlib figures.
The main entry points are:
get_prec_cov_df()— builds a precision-coverage DataFrame at either peptide or amino acid level.graph_prec_cov()— plots pre-computed precision-coverage DataFrames.GraphPrecCov— stateful plot builder, intended for programmatic or CLI use via Fire.
The module is also executable as a CLI via python -m casanovoutils.prec_cov
(or the installed casanovoutils entry point), exposing get_pc_df and
graph_prec_cov as subcommands.
Attributes#
Classes#
Plot and compare peptide-level precision-coverage (Prec-Cov) curves. |
Functions#
|
Align predicted and ground truth token sequences within a single row dict. |
|
Compute cumulative precision and coverage curves sorted by score. |
|
Load or construct a ground truth PSM DataFrame. |
|
Replace null values in score and sequence columns with safe defaults. |
|
Tokenize predicted and ground truth sequences and parse per-AA score strings. |
|
Align predicted and ground truth token sequences and explode to per-AA rows. |
|
Build a precision-coverage DataFrame from predicted and ground truth PSMs. |
|
Plot precision-coverage curves from one or more pre-computed DataFrames. |
|
CLI entry |
Module Contents#
- class casanovoutils.preccov.GraphPrecCov#
Plot and compare peptide-level precision-coverage (Prec-Cov) curves.
This class accumulates multiple datasets onto a single precision-coverage plot. For each dataset, predicted peptide correctness and scores are extracted via
get_ground_truth(), and a precision-coverage curve is computed usingprec_cov(). The area under the precision-coverage curve (AUPC) is displayed in the legend.Designed for command-line use with Fire, multiple datasets can be added in a single process before saving or showing the figure.
- Parameters:
fig_width (float, default=4.0) – Width of the matplotlib figure in inches.
fig_height (float, default=4.0) – Height of the matplotlib figure in inches.
fig_dpi (int, default=150) – Figure resolution in dots per inch.
legend_border (bool, default=False) – Whether to draw a border around the legend frame.
legend_location (str, default="lower left") – Legend location string passed to
matplotlib.axes.Axes.legend.ax_x_label (str, default="Coverage") – Label for the x-axis.
ax_y_label (str, default="Precision") – Label for the y-axis.
ax_title (str, default="") – Base title for the plot. “(Amino Acid)” is appended automatically.
Notes
Each call to
add_peptides()adds a new curve to the same axes. Useclear()to reset the figure.All commands operate on the same instance, so state (the accumulated curves) is preserved.
- fig_width: float = 4.0#
- fig_height: float = 4.0#
- fig_dpi: int = 150#
- legend_border: bool = False#
- legend_location: str = 'lower left'#
- ax_x_label: str = 'Coverage'#
- ax_y_label: str = 'Precision'#
- ax_title: str = ''#
- add_series(pc_df: polars.DataFrame, series_name: str, color: str | None = None, linestyle: str | None = None) None#
Add a precision-coverage curve for a single dataset to the plot.
Extracts precision and coverage columns from
pc_df, computes the area under the precision-coverage curve (AUPC) via the trapezoidal rule, and plots the curve withseries_nameand the AUPC value in the legend label.- Parameters:
pc_df (pl.DataFrame) – A DataFrame containing
Constants.precision_columnandConstants.coverage_columncolumns, as produced bycalc_precision_coverage().series_name (str) – Display name for this dataset in the plot legend.
color (str, optional) – Line color passed to
matplotlib.axes.Axes.plot. IfNone(default), matplotlib’s automatic color cycling is used.linestyle (str, optional) – Line style passed to
matplotlib.axes.Axes.plot(e.g."-","--",":"). IfNone(default), matplotlib’s default solid line style is used.
- Return type:
None
- clear() None#
Reset the figures and axes to blank precision-coverage plots.
Creates two matplotlib figures + axes: 1) amino-acid-level precision/coverage plot 2) peptide-level precision/coverage plot
- Return type:
None
- save(save_path: os.PathLike) None#
Save the current plot to a file.
- Parameters:
save_path (PathLike) – Output file path. The file extension (e.g., .png, .pdf, .svg) determines the format written by matplotlib.
- Return type:
None
- show() None#
Display the current precision-coverage plot.
- Return type:
None
- casanovoutils.preccov.mutate_row_as_dict(tie_break_suffix: bool, row: dict[str, Any]) dict[str, Any]#
Align predicted and ground truth token sequences within a single row dict.
Calls
align_tokens_with_gaps()on the predicted tokens, ground truth tokens, and per-amino-acid scores from the row, then mutates the row in place with the aligned sequences, aligned scores, and a positional index list.- Parameters:
tie_break_suffix (bool) – Passed through to
align_tokens_with_gaps(). Controls tie-breaking behaviour when the gap and no-gap paths score equally during traceback.row (dict[str, Any]) – A single row represented as a dict, as produced by
DataFrame.iter_rows(named=True).
- Returns:
The same row dict with
Constants.predicted_tokens,Constants.ground_truth_tokens,Constants.aa_scores_column, andConstants.aa_idx_columnreplaced by their gap-aligned counterparts.- Return type:
dict[str, Any]
- casanovoutils.preccov.calc_precision_coverage(pc_df: polars.DataFrame, score_col: str) polars.DataFrame#
Compute cumulative precision and coverage curves sorted by score.
Sorts the DataFrame by
score_colin descending order, computes a boolean correctness column indicating where the predicted sequence matches the ground truth, then calculates cumulative precision and coverage at each rank threshold.- Parameters:
pc_df (pl.DataFrame) – Input DataFrame containing predicted and ground truth sequence columns and a score column.
score_col (str) – Name of the column to sort by. Typically either the peptide-level score column or the per-amino-acid score column depending on whether evaluation is at peptide or amino acid level.
- Returns:
The input DataFrame sorted by
score_colwith three additional columns:"pc_is_correct"(bool),"pc_precision"(float), and"pc_coverage"(float).- Return type:
pl.DataFrame
- casanovoutils.preccov.load_ground_truth_df(ground_truth_df: casanovoutils.denovoutils.DfPath | None, mgf_df: casanovoutils.denovoutils.DfPath | None, mztab_df: casanovoutils.denovoutils.DfPath | None) polars.DataFrame#
Load or construct a ground truth PSM DataFrame.
If
ground_truth_dfis provided, it is loaded viaread_dataframe(). Otherwise, the ground truth is constructed from the provided MGF and mzTab files viaget_ground_truth_df().- Parameters:
ground_truth_df (DfPath, optional) – Path to or an already-loaded ground truth DataFrame.
mgf_df (DfPath, optional) – Path to or an already-loaded MGF PSM DataFrame. Required when
ground_truth_dfisNone.mztab_df (DfPath, optional) – Path to or an already-loaded mzTab DataFrame. Required when
ground_truth_dfisNone.
- Returns:
The loaded or constructed ground truth DataFrame.
- Return type:
pl.DataFrame
- Raises:
ValueError – If
ground_truth_dfisNoneand eithermgf_dformztab_dfis alsoNone.
- casanovoutils.preccov.fill_null_columns(df: polars.DataFrame, pred_col: str) polars.DataFrame#
Replace null values in score and sequence columns with safe defaults.
Fills nulls in the predicted sequence column and the per-amino-acid scores column with empty strings, and nulls in the peptide score column with
-1.0.- Parameters:
df (pl.DataFrame) – Input DataFrame containing the columns to fill.
pred_col (str) – Name of the predicted sequence column.
- Returns:
The DataFrame with null values replaced.
- Return type:
pl.DataFrame
- casanovoutils.preccov.tokenize_and_parse_scores(df: polars.DataFrame, pred_col: str, residues_path: os.PathLike | None, replace_isoleucine_with_leucine: bool) polars.DataFrame#
Tokenize predicted and ground truth sequences and parse per-AA score strings.
Applies
tokenize_sequences()to both the ground truth and predicted sequence columns, then parses the comma-separated per-amino-acid score strings in the aa scores column into lists of floats.- Parameters:
df (pl.DataFrame) – Input DataFrame containing sequence and score columns.
pred_col (str) – Name of the predicted sequence column.
residues_path (PathLike, optional) – Path to a residue mass YAML file. If
None, the bundledresidues.yamlis used.replace_isoleucine_with_leucine (bool) – If
True, isoleucine (I) is replaced with leucine (L) during tokenization, treating them as equivalent.
- Returns:
The DataFrame with added token columns and the aa scores column converted from comma-separated strings to lists of floats.
- Return type:
pl.DataFrame
- casanovoutils.preccov.align_and_explode(df: polars.DataFrame, tie_break_suffix: bool) polars.DataFrame#
Align predicted and ground truth token sequences and explode to per-AA rows.
Iterates over each row, aligns the predicted and ground truth token sequences with gap insertion via
mutate_row_as_dict(), then explodes the resulting list columns so that each row corresponds to a single amino acid position.- Parameters:
df (pl.DataFrame) – Input DataFrame with tokenized predicted and ground truth sequence columns and parsed per-amino-acid scores.
tie_break_suffix (bool) – Passed through to
mutate_row_as_dict(). Controls tie-breaking behavior when the gap and no-gap paths score equally during traceback.
- Returns:
A DataFrame exploded to one row per aligned amino acid position, with gap characters inserted where sequences do not align.
- Return type:
pl.DataFrame
- casanovoutils.preccov.get_prec_cov_df(ground_truth_df: casanovoutils.denovoutils.DfPath | None = None, mgf_df: casanovoutils.denovoutils.DfPath | None = None, mztab_df: casanovoutils.denovoutils.DfPath | None = None, residues_path: casanovoutils.denovoutils.DfPath | None = None, replace_isoleucine_with_leucine: bool = True, aa_level: bool = False, align_tie_beak_suffix: bool = True, out_path: os.PathLike | None = None) polars.DataFrame#
Build a precision-coverage DataFrame from predicted and ground truth PSMs.
Loads or constructs a ground truth DataFrame, tokenizes both predicted and ground truth sequences, parses per-amino-acid scores, and computes precision-coverage metrics. When
aa_levelisTrue, sequences are first aligned with gap insertion and then exploded so that each row represents a single amino acid position rather than a full peptide.- Parameters:
ground_truth_df (DfPath, optional) – Path to or an already-loaded ground truth DataFrame. If
None, bothmgf_dfandmztab_dfmust be provided and the ground truth DataFrame will be constructed viaget_ground_truth_df().mgf_df (DfPath, optional) – Path to or an already-loaded MGF PSM DataFrame. Required when
ground_truth_dfisNone.mztab_df (DfPath, optional) – Path to or an already-loaded mzTab DataFrame. Required when
ground_truth_dfisNone.residues_path (DfPath, optional) – Path to a residue mass YAML file passed through to
tokenize_sequences(). IfNone, the bundledresidues.yamlis used.replace_isoleucine_with_leucine (bool, optional) – If
True(default), isoleucine (I) is replaced with leucine (L) during tokenization, treating them as equivalent.aa_level (bool, optional) – If
True, perform per-amino-acid alignment via gap insertion and explode the DataFrame so each row corresponds to a single amino acid position. IfFalse(default), metrics are computed at the peptide level using the peptide-level score column.align_tie_beak_suffix (bool, optional) – Passed through to the alignment step when
aa_levelisTrue. Controls tie-breaking behavior when the gap and no-gap paths score equally during traceback. Defaults toTrue.out_path (PathLike, optional) – If provided, the resulting DataFrame is written to this path before being returned. The format is inferred from the file extension.
- Returns:
A DataFrame with precision and coverage metrics. At peptide level, each row is one PSM; at amino acid level (
aa_level=True), each row is one aligned amino acid position.- Return type:
pl.DataFrame
- Raises:
ValueError – If
ground_truth_dfisNoneand eithermgf_dformztab_dfis alsoNone.
- casanovoutils.preccov.graph_prec_cov(*pc_df_paths: os.PathLike, out_path: os.PathLike | None = None) None#
Plot precision-coverage curves from one or more pre-computed DataFrames.
Loads each DataFrame from
pc_df_paths, adds it as a series to aGraphPrecCovplot using the file stem as the series name, and then either saves the figure, displays it, or both.- Parameters:
*pc_df_paths (PathLike) – One or more paths to DataFrames containing
Constants.precision_columnandConstants.coverage_columncolumns, as produced byget_prec_cov_df(). The file stem of each path is used as the series label in the legend.out_path (PathLike, optional) – If provided, the figure is saved to this path. The file extension determines the format (e.g.
.png,.pdf,.svg).
- Return type:
None
- Warns:
Logs a warning if the plot cannot be displayed, which typically occurs
when no graphical backend is available (e.g. in a headless environment).
In that case, saving via ``out_path`` still works normally.
- casanovoutils.preccov.COMMANDS: casanovoutils.types.Commands#
- casanovoutils.preccov.main() None#
CLI entry