feat: add MS1 scan extraction to parquet conversion (--include-ms1)#138
Open
BioGeek wants to merge 3 commits into
Open
feat: add MS1 scan extraction to parquet conversion (--include-ms1)#138BioGeek wants to merge 3 commits into
BioGeek wants to merge 3 commits into
Conversation
Extend mzML/mzXML readers and the `instanovo convert` CLI to extract MS1 scans alongside MS2 scans into the same Parquet files. Changes: 1. msreader.py: read_mzml() and read_mzxml() accept `ms_levels` param - Default [2] preserves backward compatibility (MS2 only) - Use [1, 2] to extract both MS1 and MS2 scans - New `ms_level` column in output: 1 for MS1, 2 for MS2 - MS1 rows have None for precursor_mz, precursor_charge, sequence - read_mgf() also adds ms_level=2 column for consistency 2. data_handler.py: load_mzml() and _df_from_mzml() accept ms_levels 3. convert_to_sdf.py: new --include-ms1 CLI flag - When set, extracts both MS1+MS2 from mzML/mzXML files - MS2 rows filtered by --max-charge, MS1 rows kept unconditionally - Only applies to mzML/mzXML sources (warning for other formats) This is needed by the INFlow pipeline: Parquet is the sole data substrate after conversion, and quantification (INQuant) requires MS1 scans for MS1-level signal extraction. Usage: instanovo convert data.mzML output/ --name sample --partition test --include-ms1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When MS1 scans are included (--include-ms1), precursor_mz and precursor_charge are None. The previous code passed these through numpy which failed on None * None. Now uses Python list comprehensions to compute precursor_mass with null propagation. Also adds ms_level column to _df_from_dict output when present in the input data dict, and updates mzml/mzxml test expected values to include the new ms_level column. All 7 conversion tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tput All three readers (read_mzml, read_mzxml, read_mgf) now add experiment_name (derived from the input filename stem) to each spectrum row. _df_from_dict() auto-constructs spectrum_id as "experiment_name:scan_number" when experiment_name is present and spectrum_id is not. This ensures Parquet output from instanovo convert always has the columns needed by downstream tools (Winnow requires spectrum_id for merging predictions with spectrum data). Previously, experiment_name and spectrum_id were only added through the SpectrumDataFrame constructor path with add_spectrum_id=True, not through the direct load_mzml() path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Author
Additional fix added to this PRAlways include
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend mzML/mzXML readers and the
instanovo convertCLI to extract MS1 scans alongside MS2 scans into the same Parquet files.New
--include-ms1flagSchema changes
New
ms_levelcolumn added to the data dict and resulting Parquet:1for MS1 scans2for MS2 scansMS1 rows have
Noneforprecursor_mz,precursor_charge, andsequence(these fields are not applicable for MS1 spectra).Files changed
instanovo/utils/msreader.py:read_mzml()andread_mzxml()acceptms_levelsparameter (default[2]for backward compat).read_mgf()addsms_level=2for consistency.instanovo/utils/data_handler.py:load_mzml()and_df_from_mzml()pass throughms_levels.instanovo/scripts/convert_to_sdf.py: New--include-ms1CLI flag. When set, loads MS1+MS2 from mzML/mzXML; warns and ignores for other formats.Motivation
The INFlow pipeline uses Parquet as the sole data substrate after conversion. The meeting decision (2026-04-10) settled that:
ms_levelcolumnms_levelas needed (e.g., InstaNovo predict uses onlyms_level=2)Test plan
test_convert_commandpassestest_mzml_to_parquetpasses🤖 Generated with Claude Code