align#

Sequence alignment utilities for inserting gap markers into the shorter of two token sequences to maximise exact position-wise matches with the longer sequence. Used internally to align predicted and ground-truth peptide tokens before per-position scoring.

Sequence alignment utilities for inserting gaps into token lists.

Provides a lightweight alignment algorithm that inserts gap markers into the shorter of two token sequences to maximize the number of exact position-wise matches with the longer sequence. Intended for aligning predicted and ground truth peptide token sequences prior to per-position scoring.

The alignment is implemented as a backwards-filled DP table (get_aligned_dp_array()) with a greedy traceback (recover_solution()). A helper (align_scores()) keeps per-token score arrays in sync after gaps are inserted. The top-level entry point (align_tokens_with_gaps()) handles length-equality short-circuits and dispatches to the correct argument order depending on which sequence is shorter.

casanovoutils.align.align_scores(predicted: list[str], scores: list[float], gap: str) list[float]#

Realign a score array to match a gap-inserted token sequence.

After gaps are inserted into predicted, the original scores array no longer corresponds index-for-index. This function inserts min_score placeholders at every gap position to restore that correspondence.

Parameters:
  • predicted (list[str]) – The gap-inserted predicted token sequence.

  • scores (list[float]) – The original scores, parallel to predicted before gap insertion.

  • gap (str) – The gap marker used in predicted, e.g. "-".

Returns:

A score array of the same length as predicted, with Constants.min_score at every gap position.

Return type:

list[float]

casanovoutils.align.align_tokens_with_gaps(predicted: list[str], ground_truth: list[str], scores: list[float], gap: str = '-', tie_break_suffix: bool = True) tuple[list[str], list[str], list[float]]#

Align two token sequences by inserting gaps to maximise exact matches. Gaps are always inserted into the shorter sequence, leaving the longer one untouched. Scoring awards +1 for a match, 0 for a mismatch or gap.

Parameters:
  • predicted (list[str]) – The predicted token sequence.

  • ground_truth (list[str]) – The ground truth token sequence.

  • scores (list[float]) – Per-token scores parallel to predicted.

  • gap (str, optional) – The gap marker to insert. Defaults to "-".

  • tie_break_suffix (bool, optional) – Passed to recover_solution() to control tie-breaking behaviour. Defaults to True.

Returns:

A three-tuple of (aligned_predicted, aligned_ground_truth, aligned_scores), all of equal length.

Return type:

tuple[list[str], list[str], list[float]]

casanovoutils.align.get_aligned_dp_array(short: list[str], long: list[str]) ndarray#

Build a DP scoring table for aligning two token sequences.

Fills the table backwards from the end of both sequences, scoring +1 for a match and 0 for a mismatch or gap. The table is padded to shape (len(short) + 1, len(long) + 1) with sentinel values along the borders; the meaningful scores occupy dp[:len(short), :len(long)].

Parameters:
  • short (list[str]) – The shorter token sequence, into which gaps will be inserted.

  • long (list[str]) – The longer token sequence, which is never modified.

Returns:

An integer array of shape (len(short) + 1, len(long) + 1) where entry [i, j] is the best achievable alignment score from position i in short and j in long to the end of both sequences.

Return type:

np.ndarray

casanovoutils.align.recover_solution(dp: ndarray, short: list[str], long: list[str], gap: str, tie_break_suffix: bool) list[str]#

Uses the DP table to reconstruct the gap-inserted version of short.

At each step, decides whether to emit a gap or the next token from short by comparing the no-gap score against the gap score. Ties are broken according to tie_break_suffix.

Parameters:
  • dp (np.ndarray) – A DP table as returned by get_aligned_dp_array().

  • short (list[str]) – The shorter token sequence being aligned.

  • long (list[str]) – The longer token sequence that short is being aligned to.

  • gap (str) – The gap marker to insert, e.g. "-".

  • tie_break_suffix (bool) – If True, prefer inserting a gap on a tie (suffix-biased); if False, prefer consuming a real token (prefix-biased).

Returns:

A copy of short with gap markers inserted, of the same length as long.

Return type:

list[str]