Skip to content

174 feat add mgf file support for mztab data loader#178

Closed
JemmaLDaniel wants to merge 5 commits into
refactor-data-loadingfrom
174-feat-add-mgf-file-support-for-mztab-data-loader
Closed

174 feat add mgf file support for mztab data loader#178
JemmaLDaniel wants to merge 5 commits into
refactor-data-loadingfrom
174-feat-add-mgf-file-support-for-mztab-data-loader

Conversation

@JemmaLDaniel

Copy link
Copy Markdown
Collaborator

Summary

  • Extends MZTabDatasetLoader to accept .mgf spectrum files (via matchms) alongside the existing .parquet and .ipc formats
  • Adds an add_index_cols parameter (defaulting to true in mztab.yaml) that generates experiment_name and spectrum_id columns, enabling downstream features that require spectrum_id
  • Extracts shared MGF parsing helpers (_df_from_matchms, _add_index_cols) from InstaNovoDatasetLoader into module-level functions reused by both loaders

The generated spectrum_id ({file_stem}:{row_index}) aligns with the 0-based index=N extracted from MZTab spectra_ref, so Casanovo and other MZTab-producing engines are correctly mapped to their input spectra without changing the existing join semantics.

@JemmaLDaniel JemmaLDaniel linked an issue Apr 4, 2026 that may be closed by this pull request
@JemmaLDaniel JemmaLDaniel self-assigned this Apr 4, 2026
@JemmaLDaniel JemmaLDaniel added the enhancement New feature or request label Apr 4, 2026
@github-actions

github-actions Bot commented Apr 4, 2026

Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py316797%247–248, 443, 728, 916, 920, 1215
   calibrator.py911583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   data_loaders.py2761494%23, 190, 221–222, 415, 456, 869, 873, 922, 933, 1045–1046, 1074–1075
   interfaces.py30100% 
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py1851850%8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221, 224, 238–240, 242, 244, 249, 251–253, 255–256, 258–260, 265–266, 268–270, 272, 274, 276–277, 281–284, 286–287, 289–290, 292–293, 295, 298, 312–314, 317, 320, 325, 327–329, 331–333, 335–336, 339–340, 343, 345–346, 348, 350, 352–353, 355, 358–359, 365–366, 369–370, 373–374, 377–378, 386–388, 392, 395, 399, 402, 425, 438–439, 442, 464, 476–477, 480, 505, 518–519, 522, 537, 549–550, 553, 565, 577–578, 581, 596, 608–609
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL127930975% 

Tests Skipped Failures Errors Time
301 0 💤 0 ❌ 0 🔥 35.571s ⏱️

Comment thread winnow/configs/data_loader/mztab.yaml Outdated
# to search engine output predictions (e.g. Casanovo, InstaNovo). MGF files always receive
# these columns regardless of this setting. Set to false if your parquet/ipc files already
# contain a spectrum_id column.
add_index_cols: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would preserve existing spectrum IDs by default. When users load parquet/ipc spectra through the default data_loader=mztab config and the file already contains a spectrum_id, this new default causes _add_index_cols to overwrite those IDs with {file_stem}:{row}. Downstream prediction/FDR exports key results by spectrum_id, so existing stable identifiers are silently lost unless users discover and override this setting. So keep the default false for parquet/ipc or only fill missing IDs.

@JemmaLDaniel JemmaLDaniel Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to False and made the docs clearer

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek June 15, 2026 15:30
@JemmaLDaniel JemmaLDaniel changed the base branch from main to refactor-data-loading June 16, 2026 16:10
@JemmaLDaniel

JemmaLDaniel commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

I am closing this PR in favour of #211, which now encompasses these changes and MZTab reviewer comments across existing open PRs, and aligns with a similar file structure to #182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add mgf file support for MZTab data loader

2 participants