Skip to content

feat: add MS1 scan extraction to parquet conversion (--include-ms1)#138

Open
BioGeek wants to merge 3 commits into
mainfrom
feat-extract-ms1-scans
Open

feat: add MS1 scan extraction to parquet conversion (--include-ms1)#138
BioGeek wants to merge 3 commits into
mainfrom
feat-extract-ms1-scans

Conversation

@BioGeek

@BioGeek BioGeek commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extend mzML/mzXML readers and the instanovo convert CLI to extract MS1 scans alongside MS2 scans into the same Parquet files.

New --include-ms1 flag

# MS2 only (default, backward compatible)
instanovo convert data.mzML output/ --name sample --partition test

# MS1 + MS2 in same Parquet files
instanovo convert data.mzML output/ --name sample --partition test --include-ms1

Schema changes

New ms_level column added to the data dict and resulting Parquet:

  • 1 for MS1 scans
  • 2 for MS2 scans

MS1 rows have None for precursor_mz, precursor_charge, and sequence (these fields are not applicable for MS1 spectra).

Files changed

  • instanovo/utils/msreader.py: read_mzml() and read_mzxml() accept ms_levels parameter (default [2] for backward compat). read_mgf() adds ms_level=2 for consistency.
  • instanovo/utils/data_handler.py: load_mzml() and _df_from_mzml() pass through ms_levels.
  • instanovo/scripts/convert_to_sdf.py: New --include-ms1 CLI flag. When set, loads MS1+MS2 from mzML/mzXML; warns and ignores for other formats.

Motivation

The INFlow pipeline uses Parquet as the sole data substrate after conversion. The meeting decision (2026-04-10) settled that:

  • MS1 and MS2 scans go in the same Parquet files with an ms_level column
  • INQuant needs MS1 data for quantification
  • Winnow-related metrics may also use MS1 signal information
  • Downstream processes filter by ms_level as needed (e.g., InstaNovo predict uses only ms_level=2)

Test plan

  • test_convert_command passes
  • test_mzml_to_parquet passes
  • Pre-commit hooks pass (ruff, mypy, codespell)
  • Backward compatible: default behavior (MS2 only) unchanged

🤖 Generated with Claude Code

Extend mzML/mzXML readers and the `instanovo convert` CLI to extract
MS1 scans alongside MS2 scans into the same Parquet files.

Changes:

1. msreader.py: read_mzml() and read_mzxml() accept `ms_levels` param
   - Default [2] preserves backward compatibility (MS2 only)
   - Use [1, 2] to extract both MS1 and MS2 scans
   - New `ms_level` column in output: 1 for MS1, 2 for MS2
   - MS1 rows have None for precursor_mz, precursor_charge, sequence
   - read_mgf() also adds ms_level=2 column for consistency

2. data_handler.py: load_mzml() and _df_from_mzml() accept ms_levels

3. convert_to_sdf.py: new --include-ms1 CLI flag
   - When set, extracts both MS1+MS2 from mzML/mzXML files
   - MS2 rows filtered by --max-charge, MS1 rows kept unconditionally
   - Only applies to mzML/mzXML sources (warning for other formats)

This is needed by the INFlow pipeline: Parquet is the sole data substrate
after conversion, and quantification (INQuant) requires MS1 scans for
MS1-level signal extraction.

Usage:
  instanovo convert data.mzML output/ --name sample --partition test --include-ms1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@BioGeek BioGeek requested a review from KevinEloff April 10, 2026 14:14
BioGeek and others added 2 commits April 10, 2026 17:09
When MS1 scans are included (--include-ms1), precursor_mz and
precursor_charge are None. The previous code passed these through
numpy which failed on None * None. Now uses Python list comprehensions
to compute precursor_mass with null propagation.

Also adds ms_level column to _df_from_dict output when present
in the input data dict, and updates mzml/mzxml test expected
values to include the new ms_level column.

All 7 conversion tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tput

All three readers (read_mzml, read_mzxml, read_mgf) now add
experiment_name (derived from the input filename stem) to each
spectrum row.

_df_from_dict() auto-constructs spectrum_id as
"experiment_name:scan_number" when experiment_name is present and
spectrum_id is not.

This ensures Parquet output from instanovo convert always has the
columns needed by downstream tools (Winnow requires spectrum_id for
merging predictions with spectrum data).

Previously, experiment_name and spectrum_id were only added through
the SpectrumDataFrame constructor path with add_spectrum_id=True,
not through the direct load_mzml() path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@BioGeek

BioGeek commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator Author

Additional fix added to this PR

Always include experiment_name and spectrum_id in mzML/mzXML output (988c1ca)

All three readers (read_mzml, read_mzxml, read_mgf) now add experiment_name (from filename stem) to each spectrum row. _df_from_dict() auto-constructs spectrum_id as experiment_name:scan_number.

This ensures Parquet output from instanovo convert always has the columns needed by Winnow (which merges on spectrum_id). Previously these columns were only added through the SpectrumDataFrame constructor path with add_spectrum_id=True, not the direct load_mzml() path.

All tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant