casanovoutils.denovoutils
=========================

.. py:module:: casanovoutils.denovoutils

.. autoapi-nested-parse::

   Data loading and preprocessing utilities for MGF and mzTab PSM files.

   Provides functions to parse raw instrument files into Polars DataFrames,
   join predicted and ground truth annotations, and tokenize peptide sequences
   for downstream evaluation. All loaded DataFrames accept either a file path
   or an already-loaded DataFrame, allowing the functions to be composed
   freely without redundant I/O.

   The module is also executable as a CLI via ``python -m casanovoutils.utils``
   (or the installed ``casanovoutils`` entry point), exposing ``get_mgf_psms``,
   ``get_mztab``, and ``get_groundtruth`` as subcommands.



Attributes
----------

.. autoapisummary::

   casanovoutils.denovoutils.DfPath
   casanovoutils.denovoutils.COMMANDS


Functions
---------

.. autoapisummary::

   casanovoutils.denovoutils.process_spectrum
   casanovoutils.denovoutils.write_dataframe
   casanovoutils.denovoutils.get_mgf_psms_df
   casanovoutils.denovoutils.tokenize_helper
   casanovoutils.denovoutils.tokenize_sequences
   casanovoutils.denovoutils.read_dataframe
   casanovoutils.denovoutils.get_mztab_df
   casanovoutils.denovoutils.get_ground_truth_df
   casanovoutils.denovoutils.main


Module Contents
---------------

.. py:data:: DfPath

.. py:function:: process_spectrum(spectrum: casanovoutils.types.PyteomicsSpectrum, meta_data_only: bool = True) -> dict[str, dict[str, Any]]

   Extract and augment parameter metadata from a single spectrum.

   Retrieves the ``params`` dict from a Pyteomics spectrum object and
   annotates it with the number of peaks in the spectrum. If ``meta_data_only``
   is ``False``, the intensity and m/z arrays are also included in the output.

   :param spectrum: A spectrum dict as returned by ``pyteomics.mgf.read``, containing
                    at least a ``"params"`` key, an ``"m/z array"`` key, and an
                    ``"intensity array"`` key.
   :type spectrum: PyteomicsSpectrum
   :param meta_data_only: If ``True``, only scalar metadata is returned (no spectral arrays).
                          If ``False``, ``"intensity_array"`` and ``"m_z_array"`` are added
                          to the output dict.
   :type meta_data_only: bool, optional

   :returns: The spectrum's parameter dict with an added ``"n_peaks"`` entry, and
             optionally ``"intensity_array"`` and ``"m_z_array"`` entries.
   :rtype: dict[str, Any]


.. py:function:: write_dataframe(data_df: polars.DataFrame, out_path: os.PathLike) -> None

   Write a DataFrame to a file, inferring the format from the extension.

   :param data_df: The DataFrame to write.
   :type data_df: pl.DataFrame
   :param out_path: Destination path. The file format is inferred from the extension:
                    ``.parquet`` / ``.pq`` for Parquet, ``.csv`` for comma-separated,
                    and ``.tsv`` for tab-separated.
   :type out_path: PathLike

   :raises ValueError: If the file extension is not one of the supported types.


.. py:function:: get_mgf_psms_df(mgf_path: DfPath, out_path: Optional[os.PathLike] = None, meta_data_only: bool = True) -> polars.DataFrame

   Load PSM metadata from an MGF file into a Polars DataFrame.

   If ``mgf_path`` is already a :class:`polars.DataFrame`, it is returned
   as-is (and optionally written to ``out_path``). Otherwise, the MGF file
   is parsed with Pyteomics, per-spectrum parameters are extracted via
   :func:`process_spectrum`, and all columns are prefixed with ``mgf_``.

   :param mgf_path: Path to an MGF file, or an already-loaded :class:`polars.DataFrame`.
   :type mgf_path: DfPath
   :param out_path: If provided, the resulting DataFrame is written to this path before
                    being returned. The format is inferred from the file extension via
                    :func:`write_dataframe`.
   :type out_path: PathLike, optional
   :param meta_data_only: Passed through to :func:`process_spectrum`. If ``True`` (default),
                          only scalar spectrum metadata is loaded (no m/z or intensity arrays).
                          If ``False``, ``mgf_intensity_array`` and ``mgf_m_z_array`` columns
                          are included in the returned DataFrame.
   :type meta_data_only: bool, optional

   :returns: A DataFrame with one row per spectrum and columns prefixed with
             ``mgf_``, including an ``mgf_n_peaks`` column.
   :rtype: pl.DataFrame

   :raises ValueError: Propagated from :func:`write_dataframe` if ``out_path`` has an
       unsupported file extension.


.. py:function:: tokenize_helper(seq: str, tokenizer: depthcharge.tokenizers.PeptideTokenizer, combine_n_term: bool = True) -> list[str]

   Split a peptide sequence into tokens.

   Delegates to ``tokenizer.split`` and, when ``combine_n_term`` is ``True``,
   fuses a leading modification token (e.g. ``"[UNIMOD:x]"``) onto the first
   residue token so that the modification is not a stand-alone element.

   :param seq: A peptide sequence string, optionally containing modification
               annotations recognised by ``tokenizer``.
   :type seq: str
   :param tokenizer: A tokenizer instance used to split the sequence.
   :type tokenizer: depthcharge.tokenizers.PeptideTokenizer
   :param combine_n_term: If ``True`` (default), merge a leading modification token with the
                          first residue token.
   :type combine_n_term: bool, optional

   :returns: Ordered list of token strings representing the peptide.
   :rtype: list[str]


.. py:function:: tokenize_sequences(data_df: polars.DataFrame, seq_column: str, out_prefix: Optional[str] = None, combine_n_term: bool = True, residues_path: Optional[os.PathLike] = None, replace_isoleucine_with_leucine: bool = True) -> polars.DataFrame

   Tokenize a peptide sequence column and append token and length columns.

   Loads residue masses via :func:`get_residues`, constructs an
   ``MskbPeptideTokenizer``, and applies :func:`tokenize_helper` to each
   value in ``seq_column``. Two new columns are added to the DataFrame:
   ``{out_prefix}_tokens`` (a list of token strings) and
   ``{out_prefix}_sequence_len`` (the number of tokens).

   :param data_df: Input DataFrame containing the sequence column to tokenize.
   :type data_df: pl.DataFrame
   :param seq_column: Name of the column holding peptide sequence strings.
   :type seq_column: str
   :param out_prefix: Prefix for the output columns. If ``None`` (default), the portion
                      of ``seq_column`` before the first underscore is used.
   :type out_prefix: str, optional
   :param combine_n_term: Passed through to :func:`tokenize_helper`. If ``True`` (default),
                          N-terminal modification tokens are merged with the first residue.
   :type combine_n_term: bool, optional
   :param residues_path: Path to a residue mass YAML file. If ``None`` (default), the
                         bundled ``residues.yaml`` is used.
   :type residues_path: PathLike, optional

   :returns: The input DataFrame with two additional columns:
             ``{out_prefix}_tokens`` and ``{out_prefix}_sequence_len``.
   :rtype: pl.DataFrame


.. py:function:: read_dataframe(df_path: DfPath) -> polars.DataFrame

   Read a DataFrame from a file path, inferring the format from the extension.

   :param df_path: Path to a ``.parquet`` / ``.pq``, ``.csv``, or ``.tsv`` file, or
                   an already-loaded :class:`polars.DataFrame` which is returned as-is.
   :type df_path: DfPath

   :returns: The loaded DataFrame.
   :rtype: pl.DataFrame

   :raises ValueError: If the file extension is not one of the supported types.


.. py:function:: get_mztab_df(mztab_path: DfPath, out_path: Optional[os.PathLike] = None) -> polars.DataFrame

   Load the spectrum match table from an mzTab file into a Polars DataFrame.

   If ``mztab_path`` is already a DataFrame, it is returned as-is.
   Otherwise, the file is parsed with Pyteomics, converted from pandas,
   and given a row index. All columns are prefixed with ``mztab_``.

   :param mztab_path: Path to an mzTab file, or an already-loaded :class:`polars.DataFrame`.
   :type mztab_path: PathLike
   :param out_path: If provided, the resulting DataFrame is written to this path before
                    being returned. The format is inferred from the file extension.
   :type out_path: PathLike, optional

   :returns: A DataFrame with one row per spectrum match and columns prefixed
             with ``mztab_``.
   :rtype: pl.DataFrame


.. py:function:: get_ground_truth_df(mgf_path: DfPath | list[DfPath], mztab_path: DfPath, out_path: Optional[os.PathLike] = None) -> polars.DataFrame

   Join MGF PSM metadata with mzTab spectrum match annotations.

   Loads both sources, aligns them on the MGF spectrum index encoded in
   the mzTab ``spectra_ref`` field, performs a left join, and drops all
   temporary ``tmp_`` columns from the result.

   :param mgf_path: Path to the MGF file, or an already-loaded :class:`polars.DataFrame`.
   :type mgf_path: PathLike or list of PathLike
   :param mztab_path: Path to the mzTab file, or an already-loaded :class:`polars.DataFrame`.
   :type mztab_path: PathLike
   :param out_path: If provided, the resulting DataFrame is written to this path before
                    being returned. The format is inferred from the file extension.
   :type out_path: PathLike, optional

   :returns: A DataFrame containing all MGF parameter columns (prefixed ``mgf_``)
             left-joined with mzTab annotation columns (prefixed ``mztab_``).
   :rtype: pl.DataFrame


.. py:data:: COMMANDS
   :type:  casanovoutils.types.Commands

.. py:function:: main() -> None

   Configure logging and expose data loading functions as a CLI.

   Sets up a stdout logger at INFO level, then delegates to
   :func:`fire.Fire` which maps subcommands to their corresponding
   functions:

   - ``get_mgf_psms``   → :func:`get_mgf_psms_df`
   - ``get_mztab``      → :func:`get_mztab_df`
   - ``get_groundtruth`` → :func:`get_merged_groundtruth_df`

   .. rubric:: Examples

   .. code-block:: bash

       python module.py get_mgf_psms path/to/file.mgf --out_path out.parquet
       python module.py get_mztab path/to/file.mztab --out_path out.parquet
       python module.py get_groundtruth path/to/file.mgf path/to/file.mztab --out_path out.parquet


