Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets#1718
Open
Fix chaos_metric: eliminate full bounding box allocation on sparse/blob datasets#1718
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Datasets where spectra form isolated blobs (e.g. two regions at opposite corners of a large imzML bounding box) cause OOM crashes during annotation. The root cause is
img.toarray()informula_validator.py, which materialises the full h×w bounding box as a dense array for every formula × every isotope peak. For a 1000×1000 bounding box with 4 peaks and thousands of formulas, this repeatedly allocates ~4 MB arrays that are immediately cropped down to the actual data size. chaos_metric was the first consumer of this dense array and is addressed in this PR. The spectral/spatial metrics are addressed in a follow-up PR.Change
chaos_metricnow accepts the coo_matrix directly instead of a pre-densified np.ndarray. It builds the compact 2D array internally without ever allocating the full h×w bounding box:coo.row / coo.col — O(nnz), no dense allocation_chaos_dilateis applied to those masks — same dilation logic, unchangednp.add.atscatters sparse values into a small dense array — correctly sums duplicate coordinates from split formulas (same behaviour astoarray())For two 20×20 blobs in a 1000×1000 bounding box this reduces the array passed to
measure_of_chaosfrom 4 MB to ~4 KB per formula (~99% reduction). For already-dense images (>90% non-empty in both dimensions), it falls back totoarray()since compaction isn't worth it.Correctness guarantee
Results are bit-for-bit identical to the previous implementation. The compact array is constructed by selecting the exact same rows and columns that _chaos_dilate previously selected after toarray(), so measure_of_chaos sees the same spatial pattern.
Tests
test_formula_validator.py— existing tests updated for the new call signature (mock updated from dense array indexing to coo_matrix)spheroid.pysci test — run with--database cm3 --analysis-version 3to confirm chaos shows no diff against the reference CSV