Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets by Bisho2122 · Pull Request #1718 · metaspace2020/metaspace

Bisho2122 · 2026-04-17T15:30:37Z

Problem

Datasets where spectra form isolated blobs (e.g. two regions at opposite corners of a large imzML bounding box) cause OOM crashes during annotation. The root cause is img.toarray() in formula_validator.py, which materialises the full h×w bounding box as a dense array for every formula × every isotope peak. For a 1000×1000 bounding box with 4 peaks and thousands of formulas, this repeatedly allocates ~4 MB arrays that are immediately cropped down to the actual data size. chaos_metric was the first consumer of this dense array and is addressed in this PR. The spectral/spatial metrics are addressed in a follow-up PR.

Change

chaos_metric now accepts the coo_matrix directly instead of a pre-densified np.ndarray. It builds the compact 2D array internally without ever allocating the full h×w bounding box:

Row/col occupancy masks are built from coo.row / coo.col — O(nnz), no dense allocation
The existing _chaos_dilate is applied to those masks — same dilation logic, unchanged
A compact index remap translates original row/col indices to compact positions
np.add.at scatters sparse values into a small dense array — correctly sums duplicate coordinates from split formulas (same behaviour as toarray())
That compact array is passed to measure_of_chaos — identical to what the old code produced after toarray() + cropping

For two 20×20 blobs in a 1000×1000 bounding box this reduces the array passed to measure_of_chaos from 4 MB to ~4 KB per formula (~99% reduction). For already-dense images (>90% non-empty in both dimensions), it falls back to toarray() since compaction isn't worth it.

Correctness guarantee

Results are bit-for-bit identical to the previous implementation. The compact array is constructed by selecting the exact same rows and columns that _chaos_dilate previously selected after toarray(), so measure_of_chaos sees the same spatial pattern.

Tests

test_formula_validator.py — existing tests updated for the new call signature (mock updated from dense array indexing to coo_matrix)
spheroid.py sci test — run with --database cm3 --analysis-version 3 to confirm chaos shows no diff against the reference CSV

…se datasets

Modify chaos metric to eliminate full bounding box allocation on spar…

e3f9e3c

…se datasets

Bisho2122 self-assigned this Apr 17, 2026

Bisho2122 requested a review from lmacielvieira April 17, 2026 15:33

Bisho2122 added the enhancement New feature or request label Apr 17, 2026

Bisho2122 mentioned this pull request Apr 17, 2026

Eliminate toarray() for spectral/spatial metrics #1719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets#1718

Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets#1718
Bisho2122 wants to merge 1 commit intomasterfrom
Fix/chaos_dense

Bisho2122 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bisho2122 commented Apr 17, 2026

Problem

Change

Correctness guarantee

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant