Skip to content

Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets#1718

Open
Bisho2122 wants to merge 1 commit intomasterfrom
Fix/chaos_dense
Open

Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets#1718
Bisho2122 wants to merge 1 commit intomasterfrom
Fix/chaos_dense

Conversation

@Bisho2122
Copy link
Copy Markdown
Collaborator

Problem

Datasets where spectra form isolated blobs (e.g. two regions at opposite corners of a large imzML bounding box) cause OOM crashes during annotation. The root cause is img.toarray() in formula_validator.py, which materialises the full h×w bounding box as a dense array for every formula × every isotope peak. For a 1000×1000 bounding box with 4 peaks and thousands of formulas, this repeatedly allocates ~4 MB arrays that are immediately cropped down to the actual data size. chaos_metric was the first consumer of this dense array and is addressed in this PR. The spectral/spatial metrics are addressed in a follow-up PR.

Change

chaos_metric now accepts the coo_matrix directly instead of a pre-densified np.ndarray. It builds the compact 2D array internally without ever allocating the full h×w bounding box:

  1. Row/col occupancy masks are built from coo.row / coo.col — O(nnz), no dense allocation
  2. The existing _chaos_dilate is applied to those masks — same dilation logic, unchanged
  3. A compact index remap translates original row/col indices to compact positions
  4. np.add.at scatters sparse values into a small dense array — correctly sums duplicate coordinates from split formulas (same behaviour as toarray())
  5. That compact array is passed to measure_of_chaos — identical to what the old code produced after toarray() + cropping

For two 20×20 blobs in a 1000×1000 bounding box this reduces the array passed to measure_of_chaos from 4 MB to ~4 KB per formula (~99% reduction). For already-dense images (>90% non-empty in both dimensions), it falls back to toarray() since compaction isn't worth it.

Correctness guarantee

Results are bit-for-bit identical to the previous implementation. The compact array is constructed by selecting the exact same rows and columns that _chaos_dilate previously selected after toarray(), so measure_of_chaos sees the same spatial pattern.

Tests

  • test_formula_validator.py — existing tests updated for the new call signature (mock updated from dense array indexing to coo_matrix)
  • spheroid.py sci test — run with --database cm3 --analysis-version 3 to confirm chaos shows no diff against the reference CSV

@Bisho2122 Bisho2122 self-assigned this Apr 17, 2026
@Bisho2122 Bisho2122 requested a review from lmacielvieira April 17, 2026 15:33
@Bisho2122 Bisho2122 added the enhancement New feature or request label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant