Skip to content

Fix fragment match double counting#183

Merged
JemmaLDaniel merged 7 commits into
feat-refactor-calibration-featuresfrom
fix-fragment-match-double-counting
Jun 19, 2026
Merged

Fix fragment match double counting#183
JemmaLDaniel merged 7 commits into
feat-refactor-calibration-featuresfrom
fix-fragment-match-double-counting

Conversation

@JemmaLDaniel

Copy link
Copy Markdown
Collaborator

Fix fragment ion double-counting in peak matching

Summary

Fixes a bug in find_matching_ions where an observed peak that was already matched as part of one theoretical ion's isotopic envelope could be re-matched as the M0 peak of a different theoretical ion. This inflated both ion_matches and ion_match_intensity, producing overly optimistic match rates.

The fix tracks matched observed peak indices in a set and skips any peak that has already been assigned, ensuring each observed peak contributes to at most one theoretical ion match.

Changes

  • winnow/calibration/features/utils.pyfind_matching_ions now maintains a matched_indices set of already-assigned observed peak indices. Before matching an M0 peak, the function checks whether that index has already been claimed (either as a prior M0 or as part of a prior isotopic envelope). Isotopic envelope peaks are also added to the set when matched.
  • tests/calibration/features/test_utils.py — adds targeted test cases that construct spectra where the same observed peak could be matched by multiple theoretical ions, verifying that only the first match is counted.

@JemmaLDaniel JemmaLDaniel self-assigned this Apr 10, 2026
@JemmaLDaniel JemmaLDaniel added the bug Something isn't working label Apr 10, 2026
@github-actions

github-actions Bot commented Apr 10, 2026

Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py50100% 
   calibrator.py911583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
calibration/features
   __init__.py100100% 
   base.py80100% 
   beam.py470100% 
   chimeric.py77198%198
   constants.py10100% 
   fragment_match.py73198%190
   mass_error.py28292%71, 75
   retention_time.py77198%160
   sequence.py190100% 
   token_score.py37197%82
   utils.py130199%215
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   data_loaders.py2701494%23, 189, 220–221, 414, 455, 847, 851, 900, 911, 1023–1024, 1052–1053
   interfaces.py30100% 
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py1851850%8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221, 224, 238–240, 242, 244, 249, 251–253, 255–256, 258–260, 265–266, 268–270, 272, 274, 276–277, 281–284, 286–287, 289–290, 292–293, 295, 298, 312–314, 317, 320, 325, 327–329, 331–333, 335–336, 339–340, 343, 345–346, 348, 350, 352–353, 355, 358–359, 365–366, 369–370, 373–374, 377–378, 386–388, 392, 395, 399, 402, 425, 438–439, 442, 464, 476–477, 480, 505, 518–519, 522, 537, 549–550, 553, 565, 577–578, 581, 596, 608–609
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL146930978% 

Tests Skipped Failures Errors Time
320 0 💤 0 ❌ 0 🔥 35.884s ⏱️

@JemmaLDaniel JemmaLDaniel force-pushed the fix-fragment-match-double-counting branch from 8e9a314 to f49db77 Compare April 10, 2026 14:36
@JemmaLDaniel JemmaLDaniel force-pushed the feat-refactor-calibration-features branch from f0af123 to 3a65f83 Compare April 10, 2026 15:41
@JemmaLDaniel JemmaLDaniel force-pushed the fix-fragment-match-double-counting branch from f49db77 to 7b2daac Compare April 10, 2026 16:41

@BioGeek BioGeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mzTab files whose opt_ms_run[1]_aa_scores are already log-probabilities (the parser accepts negative values such as -0.1 unchanged), this extra np.log turns each token score into NaN. Those NaNs then propagate through TokenScoreFeatures and can make calibration fail or train on invalid feature values.

See new test test_create_beam_predictions_preserves_mztab_token_log_probabilities

Also the issue reported in #182 is still present here:

When ChimericFeatures passes the list of runner-up sequences, len(predictions) is the number of spectra in the dataset, not the peptide length for the current row. This makes chimeric_complementary_ion_count wrong whenever the dataset size differs from the runner-up peptide length, silently corrupting that feature.

@JemmaLDaniel JemmaLDaniel force-pushed the fix-fragment-match-double-counting branch from 593bd2b to 05840bd Compare June 16, 2026 17:28
@JemmaLDaniel

Copy link
Copy Markdown
Collaborator Author

PR #211 should address the MZTab data loader concerns and #182 addresses the chimeric feature comment

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek June 16, 2026 17:42

@BioGeek BioGeek left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current algorithm processes theoretical ions one by one, in the order they appear in source_mz/source_annotations .

For each theoretical ion, it picks the nearest currently unused observed peak within tolerance. Once that observed peak is claimed, later theoretical ions cannot use it. That fixes the double-counting bug, but the assignment is still greedy: an early theoretical ion can claim a peak that would have been a better or only usable match for a later theoretical ion.

Example:

tolerance = 0.02

theoretical ions:
A = 100.000
B = 100.018

observed peaks:
p1 = 100.010
p2 = 100.030

Processing in order [A, B]:

  • A matches p1 because it is within tolerance.
  • B can match p2 because |100.030 - 100.018| = 0.012.
  • Result: 2 matches.

But a different arrangement can show the limitation:

theoretical ions:
A = 100.000
B = 100.012

observed peaks:
p1 = 100.010

Processing [A, B]:

  • A claims p1, distance 0.010.
  • B gets no peak, even though B was closer to p1, distance 0.002.
  • Result: 1 match assigned to A.

For a future PR we can think about a globally optimized matcher which would consider all theoretical/observed candidate pairs within tolerance and chooses assignments that optimize some objective, for example:

  • maximize number of matched theoretical ions,
  • then minimize total mass error,
  • maybe prefer M0 over isotope matches,
  • maybe prefer higher-intensity peaks.

That could be solved as a bipartite matching problem. But it is more complex and would need an explicit biological/scoring policy for ties and isotope envelopes.

For this PR, the greedy is acceptable because the existing algorithm was already greedy.

@JemmaLDaniel JemmaLDaniel merged commit f066578 into feat-refactor-calibration-features Jun 19, 2026
2 checks passed
@JemmaLDaniel JemmaLDaniel deleted the fix-fragment-match-double-counting branch June 19, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants