align#
Sequence alignment utilities for inserting gap markers into the shorter of two token sequences to maximise exact position-wise matches with the longer sequence. Used internally to align predicted and ground-truth peptide tokens before per-position scoring.
Sequence alignment utilities for inserting gaps into token lists.
Provides a lightweight alignment algorithm that inserts gap markers into the shorter of two token sequences to maximize the number of exact position-wise matches with the longer sequence. Intended for aligning predicted and ground truth peptide token sequences prior to per-position scoring.
The alignment is implemented as a backwards-filled DP table
(get_aligned_dp_array()) with a greedy traceback
(recover_solution()). A helper (align_scores()) keeps per-token
score arrays in sync after gaps are inserted. The top-level entry point
(align_tokens_with_gaps()) handles length-equality short-circuits and
dispatches to the correct argument order depending on which sequence is shorter.
- casanovoutils.align.align_scores(predicted: list[str], scores: list[float], gap: str) list[float]#
Realign a score array to match a gap-inserted token sequence.
After gaps are inserted into
predicted, the originalscoresarray no longer corresponds index-for-index. This function insertsmin_scoreplaceholders at every gap position to restore that correspondence.- Parameters:
predicted (list[str]) – The gap-inserted predicted token sequence.
scores (list[float]) – The original scores, parallel to
predictedbefore gap insertion.gap (str) – The gap marker used in
predicted, e.g."-".
- Returns:
A score array of the same length as
predicted, withConstants.min_scoreat every gap position.- Return type:
list[float]
- casanovoutils.align.align_tokens_with_gaps(predicted: list[str], ground_truth: list[str], scores: list[float], gap: str = '-', tie_break_suffix: bool = True) tuple[list[str], list[str], list[float]]#
Align two token sequences by inserting gaps to maximise exact matches. Gaps are always inserted into the shorter sequence, leaving the longer one untouched. Scoring awards +1 for a match, 0 for a mismatch or gap.
- Parameters:
predicted (list[str]) – The predicted token sequence.
ground_truth (list[str]) – The ground truth token sequence.
scores (list[float]) – Per-token scores parallel to
predicted.gap (str, optional) – The gap marker to insert. Defaults to
"-".tie_break_suffix (bool, optional) – Passed to
recover_solution()to control tie-breaking behaviour. Defaults toTrue.
- Returns:
A three-tuple of
(aligned_predicted, aligned_ground_truth, aligned_scores), all of equal length.- Return type:
tuple[list[str], list[str], list[float]]
- casanovoutils.align.get_aligned_dp_array(short: list[str], long: list[str]) ndarray#
Build a DP scoring table for aligning two token sequences.
Fills the table backwards from the end of both sequences, scoring +1 for a match and 0 for a mismatch or gap. The table is padded to shape
(len(short) + 1, len(long) + 1)with sentinel values along the borders; the meaningful scores occupydp[:len(short), :len(long)].- Parameters:
short (list[str]) – The shorter token sequence, into which gaps will be inserted.
long (list[str]) – The longer token sequence, which is never modified.
- Returns:
An integer array of shape
(len(short) + 1, len(long) + 1)where entry[i, j]is the best achievable alignment score from positioniinshortandjinlongto the end of both sequences.- Return type:
np.ndarray
- casanovoutils.align.recover_solution(dp: ndarray, short: list[str], long: list[str], gap: str, tie_break_suffix: bool) list[str]#
Uses the DP table to reconstruct the gap-inserted version of
short.At each step, decides whether to emit a gap or the next token from
shortby comparing the no-gap score against the gap score. Ties are broken according totie_break_suffix.- Parameters:
dp (np.ndarray) – A DP table as returned by
get_aligned_dp_array().short (list[str]) – The shorter token sequence being aligned.
long (list[str]) – The longer token sequence that
shortis being aligned to.gap (str) – The gap marker to insert, e.g.
"-".tie_break_suffix (bool) – If
True, prefer inserting a gap on a tie (suffix-biased); ifFalse, prefer consuming a real token (prefix-biased).
- Returns:
A copy of
shortwith gap markers inserted, of the same length aslong.- Return type:
list[str]