casanovoutils.preccov
=====================

.. py:module:: casanovoutils.preccov

.. autoapi-nested-parse::

   Precision-coverage computation for peptide and amino acid evaluation.

   Provides an end-to-end pipeline that takes predicted and ground truth PSM
   DataFrames, aligns token sequences with gap insertion, and computes cumulative
   precision-coverage (Prec-Cov) curves. Results can be exported as DataFrames
   or rendered as matplotlib figures.

   The main entry points are:

   - :func:`get_prec_cov_df` — builds a precision-coverage DataFrame at either
     peptide or amino acid level.
   - :func:`graph_prec_cov` — plots pre-computed precision-coverage DataFrames.
   - :class:`GraphPrecCov` — stateful plot builder, intended for programmatic or
     CLI use via Fire.

   The module is also executable as a CLI via ``python -m casanovoutils.prec_cov``
   (or the installed ``casanovoutils`` entry point), exposing ``get_pc_df`` and
   ``graph_prec_cov`` as subcommands.


Attributes
----------

.. autoapisummary::

   casanovoutils.preccov.COMMANDS


Classes
-------

.. autoapisummary::

   casanovoutils.preccov.GraphPrecCov


Functions
---------

.. autoapisummary::

   casanovoutils.preccov.mutate_row_as_dict
   casanovoutils.preccov.calc_precision_coverage
   casanovoutils.preccov.load_ground_truth_df
   casanovoutils.preccov.fill_null_columns
   casanovoutils.preccov.tokenize_and_parse_scores
   casanovoutils.preccov.align_and_explode
   casanovoutils.preccov.get_prec_cov_df
   casanovoutils.preccov.graph_prec_cov
   casanovoutils.preccov.main


Module Contents
---------------

.. py:class:: GraphPrecCov

   Plot and compare peptide-level precision-coverage (Prec-Cov) curves.

   This class accumulates multiple datasets onto a single precision-coverage
   plot. For each dataset, predicted peptide correctness and scores are
   extracted via ``get_ground_truth()``, and a precision-coverage curve is
   computed using ``prec_cov()``. The area under the precision-coverage curve
   (AUPC) is displayed in the legend.

   Designed for command-line use with Fire, multiple datasets can be added in
   a single process before saving or showing the figure.

   :param fig_width: Width of the matplotlib figure in inches.
   :type fig_width: float, default=4.0
   :param fig_height: Height of the matplotlib figure in inches.
   :type fig_height: float, default=4.0
   :param fig_dpi: Figure resolution in dots per inch.
   :type fig_dpi: int, default=150
   :param legend_border: Whether to draw a border around the legend frame.
   :type legend_border: bool, default=False
   :param legend_location: Legend location string passed to ``matplotlib.axes.Axes.legend``.
   :type legend_location: str, default="lower left"
   :param ax_x_label: Label for the x-axis.
   :type ax_x_label: str, default="Coverage"
   :param ax_y_label: Label for the y-axis.
   :type ax_y_label: str, default="Precision"
   :param ax_title: Base title for the plot. "(Amino Acid)" is appended automatically.
   :type ax_title: str, default=""

   .. rubric:: Notes

   Each call to ``add_peptides()`` adds a new curve to the same axes. Use
   ``clear()`` to reset the figure.

   All commands operate on the same instance, so state (the accumulated
   curves) is preserved.


   .. py:attribute:: fig_width
      :type:  float
      :value: 4.0


   .. py:attribute:: fig_height
      :type:  float
      :value: 4.0


   .. py:attribute:: fig_dpi
      :type:  int
      :value: 150


   .. py:attribute:: legend_border
      :type:  bool
      :value: False


   .. py:attribute:: legend_location
      :type:  str
      :value: 'lower left'


   .. py:attribute:: ax_x_label
      :type:  str
      :value: 'Coverage'


   .. py:attribute:: ax_y_label
      :type:  str
      :value: 'Precision'


   .. py:attribute:: ax_title
      :type:  str
      :value: ''


   .. py:method:: add_series(pc_df: polars.DataFrame, series_name: str, color: Optional[str] = None, linestyle: Optional[str] = None) -> None

      Add a precision-coverage curve for a single dataset to the plot.

      Extracts precision and coverage columns from ``pc_df``, computes the
      area under the precision-coverage curve (AUPC) via the trapezoidal
      rule, and plots the curve with ``series_name`` and the AUPC value
      in the legend label.

      :param pc_df: A DataFrame containing ``Constants.precision_column`` and
                    ``Constants.coverage_column`` columns, as produced by
                    :func:`calc_precision_coverage`.
      :type pc_df: pl.DataFrame
      :param series_name: Display name for this dataset in the plot legend.
      :type series_name: str
      :param color: Line color passed to ``matplotlib.axes.Axes.plot``. If ``None``
                    (default), matplotlib's automatic color cycling is used.
      :type color: str, optional
      :param linestyle: Line style passed to ``matplotlib.axes.Axes.plot`` (e.g. ``"-"``,
                        ``"--"``, ``":"``). If ``None`` (default), matplotlib's default
                        solid line style is used.
      :type linestyle: str, optional

      :rtype: None


   .. py:method:: clear() -> None

      Reset the figures and axes to blank precision-coverage plots.

      Creates two matplotlib figures + axes:
      1) amino-acid-level precision/coverage plot
      2) peptide-level precision/coverage plot

      :rtype: None


   .. py:method:: save(save_path: os.PathLike) -> None

      Save the current plot to a file.

      :param save_path: Output file path. The file extension (e.g., .png, .pdf, .svg)
                        determines the format written by matplotlib.
      :type save_path: PathLike

      :rtype: None


   .. py:method:: show() -> None

      Display the current precision-coverage plot.

      :rtype: None


.. py:function:: mutate_row_as_dict(tie_break_suffix: bool, row: dict[str, Any]) -> dict[str, Any]

   Align predicted and ground truth token sequences within a single row dict.

   Calls :func:`align_tokens_with_gaps` on the predicted tokens, ground truth
   tokens, and per-amino-acid scores from the row, then mutates the row in
   place with the aligned sequences, aligned scores, and a positional index
   list.

   :param tie_break_suffix: Passed through to :func:`align_tokens_with_gaps`. Controls tie-breaking
                            behaviour when the gap and no-gap paths score equally during traceback.
   :type tie_break_suffix: bool
   :param row: A single row represented as a dict, as produced by
               ``DataFrame.iter_rows(named=True)``.
   :type row: dict[str, Any]

   :returns: The same row dict with ``Constants.predicted_tokens``,
             ``Constants.ground_truth_tokens``,
             ``Constants.aa_scores_column``, and ``Constants.aa_idx_column``
             replaced by their gap-aligned counterparts.
   :rtype: dict[str, Any]


.. py:function:: calc_precision_coverage(pc_df: polars.DataFrame, score_col: str) -> polars.DataFrame

   Compute cumulative precision and coverage curves sorted by score.

   Sorts the DataFrame by ``score_col`` in descending order, computes a
   boolean correctness column indicating where the predicted sequence matches
   the ground truth, then calculates cumulative precision and coverage at
   each rank threshold.

   :param pc_df: Input DataFrame containing predicted and ground truth sequence columns
                 and a score column.
   :type pc_df: pl.DataFrame
   :param score_col: Name of the column to sort by. Typically either the peptide-level
                     score column or the per-amino-acid score column depending on whether
                     evaluation is at peptide or amino acid level.
   :type score_col: str

   :returns: The input DataFrame sorted by ``score_col`` with three additional
             columns: ``"pc_is_correct"`` (bool), ``"pc_precision"`` (float),
             and ``"pc_coverage"`` (float).
   :rtype: pl.DataFrame


.. py:function:: load_ground_truth_df(ground_truth_df: Optional[casanovoutils.denovoutils.DfPath], mgf_df: Optional[casanovoutils.denovoutils.DfPath], mztab_df: Optional[casanovoutils.denovoutils.DfPath]) -> polars.DataFrame

   Load or construct a ground truth PSM DataFrame.

   If ``ground_truth_df`` is provided, it is loaded via :func:`read_dataframe`.
   Otherwise, the ground truth is constructed from the provided MGF and mzTab
   files via :func:`get_ground_truth_df`.

   :param ground_truth_df: Path to or an already-loaded ground truth DataFrame.
   :type ground_truth_df: DfPath, optional
   :param mgf_df: Path to or an already-loaded MGF PSM DataFrame. Required when
                  ``ground_truth_df`` is ``None``.
   :type mgf_df: DfPath, optional
   :param mztab_df: Path to or an already-loaded mzTab DataFrame. Required when
                    ``ground_truth_df`` is ``None``.
   :type mztab_df: DfPath, optional

   :returns: The loaded or constructed ground truth DataFrame.
   :rtype: pl.DataFrame

   :raises ValueError: If ``ground_truth_df`` is ``None`` and either ``mgf_df`` or
       ``mztab_df`` is also ``None``.


.. py:function:: fill_null_columns(df: polars.DataFrame, pred_col: str) -> polars.DataFrame

   Replace null values in score and sequence columns with safe defaults.

   Fills nulls in the predicted sequence column and the per-amino-acid scores
   column with empty strings, and nulls in the peptide score column with
   ``-1.0``.

   :param df: Input DataFrame containing the columns to fill.
   :type df: pl.DataFrame
   :param pred_col: Name of the predicted sequence column.
   :type pred_col: str

   :returns: The DataFrame with null values replaced.
   :rtype: pl.DataFrame


.. py:function:: tokenize_and_parse_scores(df: polars.DataFrame, pred_col: str, residues_path: Optional[os.PathLike], replace_isoleucine_with_leucine: bool) -> polars.DataFrame

   Tokenize predicted and ground truth sequences and parse per-AA score strings.

   Applies :func:`tokenize_sequences` to both the ground truth and predicted
   sequence columns, then parses the comma-separated per-amino-acid score
   strings in the aa scores column into lists of floats.

   :param df: Input DataFrame containing sequence and score columns.
   :type df: pl.DataFrame
   :param pred_col: Name of the predicted sequence column.
   :type pred_col: str
   :param residues_path: Path to a residue mass YAML file. If ``None``, the bundled
                         ``residues.yaml`` is used.
   :type residues_path: PathLike, optional
   :param replace_isoleucine_with_leucine: If ``True``, isoleucine (I) is replaced with leucine (L) during
                                           tokenization, treating them as equivalent.
   :type replace_isoleucine_with_leucine: bool

   :returns: The DataFrame with added token columns and the aa scores column
             converted from comma-separated strings to lists of floats.
   :rtype: pl.DataFrame


.. py:function:: align_and_explode(df: polars.DataFrame, tie_break_suffix: bool) -> polars.DataFrame

   Align predicted and ground truth token sequences and explode to per-AA rows.

   Iterates over each row, aligns the predicted and ground truth token
   sequences with gap insertion via :func:`mutate_row_as_dict`, then explodes
   the resulting list columns so that each row corresponds to a single amino
   acid position.

   :param df: Input DataFrame with tokenized predicted and ground truth sequence
              columns and parsed per-amino-acid scores.
   :type df: pl.DataFrame
   :param tie_break_suffix: Passed through to :func:`mutate_row_as_dict`. Controls tie-breaking
                            behavior when the gap and no-gap paths score equally during traceback.
   :type tie_break_suffix: bool

   :returns: A DataFrame exploded to one row per aligned amino acid position,
             with gap characters inserted where sequences do not align.
   :rtype: pl.DataFrame


.. py:function:: get_prec_cov_df(ground_truth_df: Optional[casanovoutils.denovoutils.DfPath] = None, mgf_df: Optional[casanovoutils.denovoutils.DfPath] = None, mztab_df: Optional[casanovoutils.denovoutils.DfPath] = None, residues_path: Optional[casanovoutils.denovoutils.DfPath] = None, replace_isoleucine_with_leucine: bool = True, aa_level: bool = False, align_tie_beak_suffix: bool = True, out_path: Optional[os.PathLike] = None) -> polars.DataFrame

   Build a precision-coverage DataFrame from predicted and ground truth PSMs.

   Loads or constructs a ground truth DataFrame, tokenizes both predicted and
   ground truth sequences, parses per-amino-acid scores, and computes
   precision-coverage metrics. When ``aa_level`` is ``True``, sequences are
   first aligned with gap insertion and then exploded so that each row
   represents a single amino acid position rather than a full peptide.

   :param ground_truth_df: Path to or an already-loaded ground truth DataFrame. If ``None``,
                           both ``mgf_df`` and ``mztab_df`` must be provided and the ground
                           truth DataFrame will be constructed via :func:`get_ground_truth_df`.
   :type ground_truth_df: DfPath, optional
   :param mgf_df: Path to or an already-loaded MGF PSM DataFrame. Required when
                  ``ground_truth_df`` is ``None``.
   :type mgf_df: DfPath, optional
   :param mztab_df: Path to or an already-loaded mzTab DataFrame. Required when
                    ``ground_truth_df`` is ``None``.
   :type mztab_df: DfPath, optional
   :param residues_path: Path to a residue mass YAML file passed through to
                         :func:`tokenize_sequences`. If ``None``, the bundled
                         ``residues.yaml`` is used.
   :type residues_path: DfPath, optional
   :param replace_isoleucine_with_leucine: If ``True`` (default), isoleucine (I) is replaced with leucine (L)
                                           during tokenization, treating them as equivalent.
   :type replace_isoleucine_with_leucine: bool, optional
   :param aa_level: If ``True``, perform per-amino-acid alignment via gap insertion and
                    explode the DataFrame so each row corresponds to a single amino acid
                    position. If ``False`` (default), metrics are computed at the peptide
                    level using the peptide-level score column.
   :type aa_level: bool, optional
   :param align_tie_beak_suffix: Passed through to the alignment step when ``aa_level`` is ``True``.
                                 Controls tie-breaking behavior when the gap and no-gap paths score
                                 equally during traceback. Defaults to ``True``.
   :type align_tie_beak_suffix: bool, optional
   :param out_path: If provided, the resulting DataFrame is written to this path before
                    being returned. The format is inferred from the file extension.
   :type out_path: PathLike, optional

   :returns: A DataFrame with precision and coverage metrics. At peptide level,
             each row is one PSM; at amino acid level (``aa_level=True``), each
             row is one aligned amino acid position.
   :rtype: pl.DataFrame

   :raises ValueError: If ``ground_truth_df`` is ``None`` and either ``mgf_df`` or
       ``mztab_df`` is also ``None``.


.. py:function:: graph_prec_cov(*pc_df_paths: os.PathLike, out_path: Optional[os.PathLike] = None) -> None

   Plot precision-coverage curves from one or more pre-computed DataFrames.

   Loads each DataFrame from ``pc_df_paths``, adds it as a series to a
   :class:`GraphPrecCov` plot using the file stem as the series name, and
   then either saves the figure, displays it, or both.

   :param \*pc_df_paths: One or more paths to DataFrames containing
                         ``Constants.precision_column`` and ``Constants.coverage_column``
                         columns, as produced by :func:`get_prec_cov_df`. The file stem of each
                         path is used as the series label in the legend.
   :type \*pc_df_paths: PathLike
   :param out_path: If provided, the figure is saved to this path. The file extension
                    determines the format (e.g. ``.png``, ``.pdf``, ``.svg``).
   :type out_path: PathLike, optional

   :rtype: None

   :Warns: * **Logs a warning if the plot cannot be displayed, which typically occurs**
           * **when no graphical backend is available (e.g. in a headless environment).**
           * **In that case, saving via ``out_path`` still works normally.**


.. py:data:: COMMANDS
   :type:  casanovoutils.types.Commands

.. py:function:: main() -> None

   CLI entry