casanovoutils.align
===================

.. py:module:: casanovoutils.align

.. autoapi-nested-parse::

   Sequence alignment utilities for inserting gaps into token lists.

   Provides a lightweight alignment algorithm that inserts gap markers into the
   shorter of two token sequences to maximize the number of exact position-wise
   matches with the longer sequence. Intended for aligning predicted and ground
   truth peptide token sequences prior to per-position scoring.

   The alignment is implemented as a backwards-filled DP table
   (:func:`get_aligned_dp_array`) with a greedy traceback
   (:func:`recover_solution`). A helper (:func:`align_scores`) keeps per-token
   score arrays in sync after gaps are inserted. The top-level entry point
   (:func:`align_tokens_with_gaps`) handles length-equality short-circuits and
   dispatches to the correct argument order depending on which sequence is shorter.


Functions
---------

.. autoapisummary::

   casanovoutils.align.get_aligned_dp_array
   casanovoutils.align.recover_solution
   casanovoutils.align.align_scores
   casanovoutils.align.align_tokens_with_gaps


Module Contents
---------------

.. py:function:: get_aligned_dp_array(short: list[str], long: list[str]) -> numpy.ndarray

   Build a DP scoring table for aligning two token sequences.

   Fills the table backwards from the end of both sequences, scoring
   +1 for a match and 0 for a mismatch or gap. The table is padded to
   shape ``(len(short) + 1, len(long) + 1)`` with sentinel values along
   the borders; the meaningful scores occupy ``dp[:len(short), :len(long)]``.

   :param short: The shorter token sequence, into which gaps will be inserted.
   :type short: list[str]
   :param long: The longer token sequence, which is never modified.
   :type long: list[str]

   :returns: An integer array of shape ``(len(short) + 1, len(long) + 1)`` where
             entry ``[i, j]`` is the best achievable alignment score from position
             ``i`` in ``short`` and ``j`` in ``long`` to the end of both sequences.
   :rtype: np.ndarray


.. py:function:: recover_solution(dp: numpy.ndarray, short: list[str], long: list[str], gap: str, tie_break_suffix: bool) -> list[str]

   Uses the DP table to reconstruct the gap-inserted version of ``short``.

   At each step, decides whether to emit a gap or the next token from ``short``
   by comparing the no-gap score against the gap score. Ties are broken
   according to ``tie_break_suffix``.

   :param dp: A DP table as returned by :func:`get_aligned_dp_array`.
   :type dp: np.ndarray
   :param short: The shorter token sequence being aligned.
   :type short: list[str]
   :param long: The longer token sequence that ``short`` is being aligned to.
   :type long: list[str]
   :param gap: The gap marker to insert, e.g. ``"-"``.
   :type gap: str
   :param tie_break_suffix: If ``True``, prefer inserting a gap on a tie (suffix-biased);
                            if ``False``, prefer consuming a real token (prefix-biased).
   :type tie_break_suffix: bool

   :returns: A copy of ``short`` with gap markers inserted, of the same length as
             ``long``.
   :rtype: list[str]


.. py:function:: align_scores(predicted: list[str], scores: list[float], gap: str) -> list[float]

   Realign a score array to match a gap-inserted token sequence.

   After gaps are inserted into ``predicted``, the original ``scores`` array
   no longer corresponds index-for-index. This function inserts ``min_score``
   placeholders at every gap position to restore that correspondence.

   :param predicted: The gap-inserted predicted token sequence.
   :type predicted: list[str]
   :param scores: The original scores, parallel to ``predicted`` before gap insertion.
   :type scores: list[float]
   :param gap: The gap marker used in ``predicted``, e.g. ``"-"``.
   :type gap: str

   :returns: A score array of the same length as ``predicted``, with
             ``Constants.min_score`` at every gap position.
   :rtype: list[float]


.. py:function:: align_tokens_with_gaps(predicted: list[str], ground_truth: list[str], scores: list[float], gap: str = '-', tie_break_suffix: bool = True) -> tuple[list[str], list[str], list[float]]

   Align two token sequences by inserting gaps to maximise exact matches.
   Gaps are always inserted into the shorter sequence, leaving the longer
   one untouched. Scoring awards +1 for a match, 0 for a mismatch or gap.

   :param predicted: The predicted token sequence.
   :type predicted: list[str]
   :param ground_truth: The ground truth token sequence.
   :type ground_truth: list[str]
   :param scores: Per-token scores parallel to ``predicted``.
   :type scores: list[float]
   :param gap: The gap marker to insert. Defaults to ``"-"``.
   :type gap: str, optional
   :param tie_break_suffix: Passed to :func:`recover_solution` to control tie-breaking behaviour.
                            Defaults to ``True``.
   :type tie_break_suffix: bool, optional

   :returns: A three-tuple of ``(aligned_predicted, aligned_ground_truth, aligned_scores)``,
             all of equal length.
   :rtype: tuple[list[str], list[str], list[float]]