174 feat add mgf file support for mztab data loader by JemmaLDaniel · Pull Request #178 · instadeepai/winnow

JemmaLDaniel · 2026-04-04T16:55:10Z

Summary

Extends MZTabDatasetLoader to accept .mgf spectrum files (via matchms) alongside the existing .parquet and .ipc formats
Adds an add_index_cols parameter (defaulting to true in mztab.yaml) that generates experiment_name and spectrum_id columns, enabling downstream features that require spectrum_id
Extracts shared MGF parsing helpers (_df_from_matchms, _add_index_cols) from InstaNovoDatasetLoader into module-level functions reused by both loaders

The generated spectrum_id ({file_stem}:{row_index}) aligns with the 0-based index=N extracted from MZTab spectra_ref, so Casanovo and other MZTab-producing engines are correctly mapped to their input spectra without changing the existing join semantics.

…ZTabDatasetLoader

… for MZTabDatasetLoader

github-actions · 2026-04-04T16:56:43Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	316	7	97%	247–248, 443, 728, 916, 920, 1215
calibrator.py	91	15	83%	69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
data_loaders.py	276	14	94%	23, 190, 221–222, 415, 456, 869, 873, 922, 933, 1045–1046, 1074–1075
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	185	185	0%	8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221, 224, 238–240, 242, 244, 249, 251–253, 255–256, 258–260, 265–266, 268–270, 272, 274, 276–277, 281–284, 286–287, 289–290, 292–293, 295, 298, 312–314, 317, 320, 325, 327–329, 331–333, 335–336, 339–340, 343, 345–346, 348, 350, 352–353, 355, 358–359, 365–366, 369–370, 373–374, 377–378, 386–388, 392, 395, 399, 402, 425, 438–439, 442, 464, 476–477, 480, 505, 518–519, 522, 537, 549–550, 553, 565, 577–578, 581, 596, 608–609
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
peptide.py	16	0	100%
TOTAL	1279	309	75%

Tests	Skipped	Failures	Errors	Time
301	0 💤	0 ❌	0 🔥	35.571s ⏱️

BioGeek · 2026-06-08T20:24:10Z

+# to search engine output predictions (e.g. Casanovo, InstaNovo). MGF files always receive
+# these columns regardless of this setting. Set to false if your parquet/ipc files already
+# contain a spectrum_id column.
+add_index_cols: true


I would preserve existing spectrum IDs by default. When users load parquet/ipc spectra through the default data_loader=mztab config and the file already contains a spectrum_id, this new default causes _add_index_cols to overwrite those IDs with {file_stem}:{row}. Downstream prediction/FDR exports key results by spectrum_id, so existing stable identifiers are silently lost unless users discover and override this setting. So keep the default false for parquet/ipc or only fill missing IDs.

Changed to False and made the docs clearer

…-loader

JemmaLDaniel · 2026-06-16T16:36:18Z

I am closing this PR in favour of #211, which now encompasses these changes and MZTab reviewer comments across existing open PRs, and aligns with a similar file structure to #182

JemmaLDaniel added 2 commits April 4, 2026 17:46

feat: support mgf file loading and auto-addition of spectrum ID for M…

fb27d99

…ZTabDatasetLoader

test: add tests for auto-addition of spectrum ID and mgf file loading…

abcdd5b

… for MZTabDatasetLoader

JemmaLDaniel linked an issue Apr 4, 2026 that may be closed by this pull request

feat: Add mgf file support for MZTab data loader #174

Open

JemmaLDaniel self-assigned this Apr 4, 2026

JemmaLDaniel added the enhancement New feature or request label Apr 4, 2026

BioGeek requested changes Jun 8, 2026

View reviewed changes

JemmaLDaniel added 3 commits June 15, 2026 16:19

chore: default to not adding spectrum_id columns

72f017e

docs: update data loader documentation to clarify matching protocol

cb3b7b5

Merge branch 'main' into 174-feat-add-mgf-file-support-for-mztab-data…

ce03259

…-loader

JemmaLDaniel requested a review from BioGeek June 15, 2026 15:30

JemmaLDaniel changed the base branch from main to refactor-data-loading June 16, 2026 16:10

JemmaLDaniel closed this Jun 16, 2026

JemmaLDaniel mentioned this pull request Jun 16, 2026

Refactor data loading #211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

174 feat add mgf file support for mztab data loader#178

174 feat add mgf file support for mztab data loader#178
JemmaLDaniel wants to merge 5 commits into
refactor-data-loadingfrom
174-feat-add-mgf-file-support-for-mztab-data-loader

JemmaLDaniel commented Apr 4, 2026

Uh oh!

github-actions Bot commented Apr 4, 2026

Uh oh!

BioGeek Jun 8, 2026

Uh oh!

JemmaLDaniel Jun 15, 2026 •

edited

Loading

Uh oh!

JemmaLDaniel commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JemmaLDaniel commented Apr 4, 2026

Summary

Uh oh!

github-actions Bot commented Apr 4, 2026

Uh oh!

BioGeek Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

JemmaLDaniel Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JemmaLDaniel commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JemmaLDaniel Jun 15, 2026 •

edited

Loading

JemmaLDaniel commented Jun 16, 2026 •

edited

Loading