Fix fragment match double counting#183
Conversation
8e9a314 to
f49db77
Compare
f0af123 to
3a65f83
Compare
f49db77 to
7b2daac
Compare
BioGeek
left a comment
There was a problem hiding this comment.
For mzTab files whose opt_ms_run[1]_aa_scores are already log-probabilities (the parser accepts negative values such as -0.1 unchanged), this extra np.log turns each token score into NaN. Those NaNs then propagate through TokenScoreFeatures and can make calibration fail or train on invalid feature values.
See new test test_create_beam_predictions_preserves_mztab_token_log_probabilities
Also the issue reported in #182 is still present here:
When ChimericFeatures passes the list of runner-up sequences, len(predictions) is the number of spectra in the dataset, not the peptide length for the current row. This makes chimeric_complementary_ion_count wrong whenever the dataset size differs from the runner-up peptide length, silently corrupting that feature.
…atch-double-counting
…atch-double-counting
593bd2b to
05840bd
Compare
BioGeek
left a comment
There was a problem hiding this comment.
The current algorithm processes theoretical ions one by one, in the order they appear in source_mz/source_annotations .
For each theoretical ion, it picks the nearest currently unused observed peak within tolerance. Once that observed peak is claimed, later theoretical ions cannot use it. That fixes the double-counting bug, but the assignment is still greedy: an early theoretical ion can claim a peak that would have been a better or only usable match for a later theoretical ion.
Example:
tolerance = 0.02
theoretical ions:
A = 100.000
B = 100.018
observed peaks:
p1 = 100.010
p2 = 100.030
Processing in order [A, B]:
Amatchesp1because it is within tolerance.Bcan matchp2because|100.030 - 100.018| = 0.012.- Result: 2 matches.
But a different arrangement can show the limitation:
theoretical ions:
A = 100.000
B = 100.012
observed peaks:
p1 = 100.010
Processing [A, B]:
Aclaimsp1, distance 0.010.Bgets no peak, even thoughBwas closer top1, distance 0.002.- Result: 1 match assigned to
A.
For a future PR we can think about a globally optimized matcher which would consider all theoretical/observed candidate pairs within tolerance and chooses assignments that optimize some objective, for example:
- maximize number of matched theoretical ions,
- then minimize total mass error,
- maybe prefer M0 over isotope matches,
- maybe prefer higher-intensity peaks.
That could be solved as a bipartite matching problem. But it is more complex and would need an explicit biological/scoring policy for ties and isotope envelopes.
For this PR, the greedy is acceptable because the existing algorithm was already greedy.
f066578
into
feat-refactor-calibration-features
Fix fragment ion double-counting in peak matching
Summary
Fixes a bug in
find_matching_ionswhere an observed peak that was already matched as part of one theoretical ion's isotopic envelope could be re-matched as the M0 peak of a different theoretical ion. This inflated bothion_matchesandion_match_intensity, producing overly optimistic match rates.The fix tracks matched observed peak indices in a set and skips any peak that has already been assigned, ensuring each observed peak contributes to at most one theoretical ion match.
Changes
winnow/calibration/features/utils.py—find_matching_ionsnow maintains amatched_indicesset of already-assigned observed peak indices. Before matching an M0 peak, the function checks whether that index has already been claimed (either as a prior M0 or as part of a prior isotopic envelope). Isotopic envelope peaks are also added to the set when matched.tests/calibration/features/test_utils.py— adds targeted test cases that construct spectra where the same observed peak could be matched by multiple theoretical ions, verifying that only the first match is counted.