Fix fragment match double counting by JemmaLDaniel · Pull Request #183 · instadeepai/winnow

JemmaLDaniel · 2026-04-10T14:02:46Z

Fix fragment ion double-counting in peak matching

Summary

Fixes a bug in find_matching_ions where an observed peak that was already matched as part of one theoretical ion's isotopic envelope could be re-matched as the M0 peak of a different theoretical ion. This inflated both ion_matches and ion_match_intensity, producing overly optimistic match rates.

The fix tracks matched observed peak indices in a set and skips any peak that has already been assigned, ensuring each observed peak contributes to at most one theoretical ion match.

Changes

winnow/calibration/features/utils.py — find_matching_ions now maintains a matched_indices set of already-assigned observed peak indices. Before matching an M0 peak, the function checks whether that index has already been claimed (either as a prior M0 or as part of a prior isotopic envelope). Isotopic envelope peaks are also added to the set when matched.
tests/calibration/features/test_utils.py — adds targeted test cases that construct spectra where the same observed peak could be matched by multiple theoretical ions, verifying that only the first match is counted.

github-actions · 2026-04-10T14:04:21Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	5	0	100%
calibrator.py	91	15	83%	69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
calibration/features
__init__.py	10	0	100%
base.py	8	0	100%
beam.py	47	0	100%
chimeric.py	77	1	98%	198
constants.py	1	0	100%
fragment_match.py	73	1	98%	190
mass_error.py	28	2	92%	71, 75
retention_time.py	77	1	98%	160
sequence.py	19	0	100%
token_score.py	37	1	97%	82
utils.py	130	1	99%	215
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
data_loaders.py	270	14	94%	23, 189, 220–221, 414, 455, 847, 851, 900, 911, 1023–1024, 1052–1053
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	185	185	0%	8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221, 224, 238–240, 242, 244, 249, 251–253, 255–256, 258–260, 265–266, 268–270, 272, 274, 276–277, 281–284, 286–287, 289–290, 292–293, 295, 298, 312–314, 317, 320, 325, 327–329, 331–333, 335–336, 339–340, 343, 345–346, 348, 350, 352–353, 355, 358–359, 365–366, 369–370, 373–374, 377–378, 386–388, 392, 395, 399, 402, 425, 438–439, 442, 464, 476–477, 480, 505, 518–519, 522, 537, 549–550, 553, 565, 577–578, 581, 596, 608–609
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
peptide.py	16	0	100%
TOTAL	1469	309	78%

Tests	Skipped	Failures	Errors	Time
320	0 💤	0 ❌	0 🔥	35.884s ⏱️

This reverts commit ce3e951.

BioGeek

For mzTab files whose opt_ms_run[1]_aa_scores are already log-probabilities (the parser accepts negative values such as -0.1 unchanged), this extra np.log turns each token score into NaN. Those NaNs then propagate through TokenScoreFeatures and can make calibration fail or train on invalid feature values.

See new test test_create_beam_predictions_preserves_mztab_token_log_probabilities

Also the issue reported in #182 is still present here:

When ChimericFeatures passes the list of runner-up sequences, len(predictions) is the number of spectra in the dataset, not the peptide length for the current row. This makes chimeric_complementary_ion_count wrong whenever the dataset size differs from the runner-up peptide length, silently corrupting that feature.

…atch-double-counting

JemmaLDaniel · 2026-06-16T17:29:23Z

PR #211 should address the MZTab data loader concerns and #182 addresses the chimeric feature comment

BioGeek

The current algorithm processes theoretical ions one by one, in the order they appear in source_mz/source_annotations .

For each theoretical ion, it picks the nearest currently unused observed peak within tolerance. Once that observed peak is claimed, later theoretical ions cannot use it. That fixes the double-counting bug, but the assignment is still greedy: an early theoretical ion can claim a peak that would have been a better or only usable match for a later theoretical ion.

Example:

tolerance = 0.02

theoretical ions:
A = 100.000
B = 100.018

observed peaks:
p1 = 100.010
p2 = 100.030

Processing in order [A, B]:

A matches p1 because it is within tolerance.
B can match p2 because |100.030 - 100.018| = 0.012.
Result: 2 matches.

But a different arrangement can show the limitation:

theoretical ions:
A = 100.000
B = 100.012

observed peaks:
p1 = 100.010

Processing [A, B]:

A claims p1, distance 0.010.
B gets no peak, even though B was closer to p1, distance 0.002.
Result: 1 match assigned to A.

For a future PR we can think about a globally optimized matcher which would consider all theoretical/observed candidate pairs within tolerance and chooses assignments that optimize some objective, for example:

maximize number of matched theoretical ions,
then minimize total mass error,
maybe prefer M0 over isotope matches,
maybe prefer higher-intensity peaks.

That could be solved as a bipartite matching problem. But it is more complex and would need an explicit biological/scoring policy for ties and isotope envelopes.

For this PR, the greedy is acceptable because the existing algorithm was already greedy.

JemmaLDaniel self-assigned this Apr 10, 2026

JemmaLDaniel added the bug Something isn't working label Apr 10, 2026

JemmaLDaniel force-pushed the fix-fragment-match-double-counting branch from 8e9a314 to f49db77 Compare April 10, 2026 14:36

JemmaLDaniel force-pushed the feat-refactor-calibration-features branch from f0af123 to 3a65f83 Compare April 10, 2026 15:41

JemmaLDaniel added 2 commits April 10, 2026 17:41

fix: exclude already-matched peaks from being counted twice

6a8f356

test: add fragment ion matching tests for double-counting

7b2daac

JemmaLDaniel force-pushed the fix-fragment-match-double-counting branch from f49db77 to 7b2daac Compare April 10, 2026 16:41

BioGeek added 3 commits June 10, 2026 13:55

test: cover mzTab token log probability preservation

a795c97

test: cover chimeric complementary ion peptide lengths

ce3e951

Revert "test: cover chimeric complementary ion peptide lengths"

890f69f

This reverts commit ce3e951.

BioGeek requested changes Jun 10, 2026

View reviewed changes

Merge branch 'feat-refactor-calibration-features' into fix-fragment-m…

0f82058

…atch-double-counting

JemmaLDaniel mentioned this pull request Jun 16, 2026

Feat refactor calibration features #182

Open

Merge branch 'feat-refactor-calibration-features' into fix-fragment-m…

05840bd

…atch-double-counting

JemmaLDaniel force-pushed the fix-fragment-match-double-counting branch from 593bd2b to 05840bd Compare June 16, 2026 17:28

JemmaLDaniel requested a review from BioGeek June 16, 2026 17:42

BioGeek approved these changes Jun 17, 2026

View reviewed changes

JemmaLDaniel merged commit f066578 into feat-refactor-calibration-features Jun 19, 2026
2 checks passed

JemmaLDaniel deleted the fix-fragment-match-double-counting branch June 19, 2026 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fragment match double counting#183

Fix fragment match double counting#183
JemmaLDaniel merged 7 commits into
feat-refactor-calibration-featuresfrom
fix-fragment-match-double-counting

JemmaLDaniel commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

BioGeek left a comment

Uh oh!

JemmaLDaniel commented Jun 16, 2026

Uh oh!

BioGeek left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JemmaLDaniel commented Apr 10, 2026

Fix fragment ion double-counting in peak matching

Summary

Changes

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioGeek left a comment

Choose a reason for hiding this comment

Uh oh!

JemmaLDaniel commented Jun 16, 2026

Uh oh!

BioGeek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 10, 2026 •

edited

Loading