feat: add MS1 scan extraction to parquet conversion (--include-ms1) by BioGeek · Pull Request #138 · instadeepai/InstaNovo

BioGeek · 2026-04-10T14:12:59Z

Summary

Extend mzML/mzXML readers and the instanovo convert CLI to extract MS1 scans alongside MS2 scans into the same Parquet files.

New `--include-ms1` flag

# MS2 only (default, backward compatible)
instanovo convert data.mzML output/ --name sample --partition test

# MS1 + MS2 in same Parquet files
instanovo convert data.mzML output/ --name sample --partition test --include-ms1

Schema changes

New ms_level column added to the data dict and resulting Parquet:

1 for MS1 scans
2 for MS2 scans

MS1 rows have None for precursor_mz, precursor_charge, and sequence (these fields are not applicable for MS1 spectra).

Files changed

instanovo/utils/msreader.py: read_mzml() and read_mzxml() accept ms_levels parameter (default [2] for backward compat). read_mgf() adds ms_level=2 for consistency.
instanovo/utils/data_handler.py: load_mzml() and _df_from_mzml() pass through ms_levels.
instanovo/scripts/convert_to_sdf.py: New --include-ms1 CLI flag. When set, loads MS1+MS2 from mzML/mzXML; warns and ignores for other formats.

Motivation

The INFlow pipeline uses Parquet as the sole data substrate after conversion. The meeting decision (2026-04-10) settled that:

MS1 and MS2 scans go in the same Parquet files with an ms_level column
INQuant needs MS1 data for quantification
Winnow-related metrics may also use MS1 signal information
Downstream processes filter by ms_level as needed (e.g., InstaNovo predict uses only ms_level=2)

Test plan

test_convert_command passes
test_mzml_to_parquet passes
Pre-commit hooks pass (ruff, mypy, codespell)
Backward compatible: default behavior (MS2 only) unchanged

🤖 Generated with Claude Code

Extend mzML/mzXML readers and the `instanovo convert` CLI to extract MS1 scans alongside MS2 scans into the same Parquet files. Changes: 1. msreader.py: read_mzml() and read_mzxml() accept `ms_levels` param - Default [2] preserves backward compatibility (MS2 only) - Use [1, 2] to extract both MS1 and MS2 scans - New `ms_level` column in output: 1 for MS1, 2 for MS2 - MS1 rows have None for precursor_mz, precursor_charge, sequence - read_mgf() also adds ms_level=2 column for consistency 2. data_handler.py: load_mzml() and _df_from_mzml() accept ms_levels 3. convert_to_sdf.py: new --include-ms1 CLI flag - When set, extracts both MS1+MS2 from mzML/mzXML files - MS2 rows filtered by --max-charge, MS1 rows kept unconditionally - Only applies to mzML/mzXML sources (warning for other formats) This is needed by the INFlow pipeline: Parquet is the sole data substrate after conversion, and quantification (INQuant) requires MS1 scans for MS1-level signal extraction. Usage: instanovo convert data.mzML output/ --name sample --partition test --include-ms1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When MS1 scans are included (--include-ms1), precursor_mz and precursor_charge are None. The previous code passed these through numpy which failed on None * None. Now uses Python list comprehensions to compute precursor_mass with null propagation. Also adds ms_level column to _df_from_dict output when present in the input data dict, and updates mzml/mzxml test expected values to include the new ms_level column. All 7 conversion tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tput All three readers (read_mzml, read_mzxml, read_mgf) now add experiment_name (derived from the input filename stem) to each spectrum row. _df_from_dict() auto-constructs spectrum_id as "experiment_name:scan_number" when experiment_name is present and spectrum_id is not. This ensures Parquet output from instanovo convert always has the columns needed by downstream tools (Winnow requires spectrum_id for merging predictions with spectrum data). Previously, experiment_name and spectrum_id were only added through the SpectrumDataFrame constructor path with add_spectrum_id=True, not through the direct load_mzml() path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BioGeek · 2026-04-10T22:46:33Z

Additional fix added to this PR

Always include `experiment_name` and `spectrum_id` in mzML/mzXML output (`988c1ca`)

All three readers (read_mzml, read_mzxml, read_mgf) now add experiment_name (from filename stem) to each spectrum row. _df_from_dict() auto-constructs spectrum_id as experiment_name:scan_number.

This ensures Parquet output from instanovo convert always has the columns needed by Winnow (which merges on spectrum_id). Previously these columns were only added through the SpectrumDataFrame constructor path with add_spectrum_id=True, not the direct load_mzml() path.

All tests pass.

BioGeek requested a review from KevinEloff April 10, 2026 14:14

BioGeek and others added 2 commits April 10, 2026 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add MS1 scan extraction to parquet conversion (--include-ms1)#138

feat: add MS1 scan extraction to parquet conversion (--include-ms1)#138
BioGeek wants to merge 3 commits into
mainfrom
feat-extract-ms1-scans

BioGeek commented Apr 10, 2026

Uh oh!

BioGeek commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BioGeek commented Apr 10, 2026

Summary

New --include-ms1 flag

Schema changes

Files changed

Motivation

Test plan

Uh oh!

BioGeek commented Apr 10, 2026

Additional fix added to this PR

Always include experiment_name and spectrum_id in mzML/mzXML output (988c1ca)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New `--include-ms1` flag

Always include `experiment_name` and `spectrum_id` in mzML/mzXML output (`988c1ca`)