Skip to content

Preflight memory#2366

Open
mike-ferguson wants to merge 27 commits into
masterfrom
preflight_memory
Open

Preflight memory#2366
mike-ferguson wants to merge 27 commits into
masterfrom
preflight_memory

Conversation

@mike-ferguson
Copy link
Copy Markdown
Member

@mike-ferguson mike-ferguson commented Apr 16, 2026

Summary

Adds a pre-flight memory estimation system that catches OOM errors before committing to a multi-hour benchmark run.

Problem

Scoring large models on neural benchmarks can take 6+ hours. If the job runs out of RAM, you lose all that time with nothing to show for it. There was no way to know upfront whether a model would OOM on a given benchmark.

Solution: Metric-aware memory formula

Before scoring begins, the system estimates total RAM needed using a formula matched to each benchmark's regression type:

Benchmark type Formula Notes
RDM/RSA activation_gb × 3 Pairwise distance computation passes through the full activation matrix — overhead scales with features, not stimulus count. Validated: ≤4% error across alexnet/resnet50/ViT.
Ridge/RidgeCV (n_features ≤ n_stimuli, calibrated) activation_gb + fixed_benchmark_cost_gb sklearn primal solver — gram matrix is n_stimuli×n_stimuli, model-independent. Accurate when features are smaller than stimulus count (e.g. alexnet on most benchmarks). Validated: ≤1% error.
Ridge/RidgeCV (n_features > n_stimuli) activation_gb × 6 sklearn switches to SVD of X — overhead ≈ 5× activation, so calibrated fixed cost (measured on small models) severely underestimates. Falls back to ×6 so pre-flight raises MemoryError cleanly before the OS kills the container. Validated: −1.5% error for resnet50 and ViT on Gifford2022.
Ridge/RidgeCV (n_features ≤ n_stimuli, no calibration entry) activation_gb + n_stimuli²×4B Formula fallback using gram matrix size.
PLS (-pls, -reverse_pls) activation_gb × 7 + fixed_benchmark_cost_gb PLS cross-covariance matrices scale with num_features — overhead is not model-independent. Multiplier calibrated on 3-model × 2-benchmark grid; worst miss: −12.7% (within 15%). Warning printed at runtime.
Fallback activation_gb × 6 Used when benchmark type is unrecognised.
  • activation_gb — model-dependent. Measured by running a single forward pass (the "probe") on 1 stimulus and extrapolating: num_stimuli × num_features × num_timebins × 4 bytes
  • fixed_benchmark_cost_gb — benchmark-dependent, model-independent. Stored in benchmark_costs.json.

Changes

New file: brainscore_vision/benchmark_helpers/benchmark_costs.json

Calibration table with fixed overhead costs for 49 neural benchmarks. Used for ridge/ridgecv and PLS benchmarks. RDM benchmarks do not require calibration entries.

brainscore_vision/benchmark_helpers/memory.py

  • preallocate_memory(model, benchmark) — probes the model with 1 stimulus, dispatches to the correct formula based on benchmark type, and raises MemoryError if estimate exceeds available RAM
  • Metric-type detection — three new helpers: _is_pls_minimum, _is_rdm_benchmark, _is_ridge_benchmark. Detection is by identifier suffix (-pls, -reverse_pls, -temporal-pls, -rdm, -ridge, -ridgecv) and/or isinstance check for RSABenchmark
  • RSABenchmark support — previously raised TypeError for RSA/RDM benchmarks; now fully handled with the model-independent n_stimuli² formula
  • MemoryEstimate dataclass — added formula_type field ('pls', 'rdm', 'ridge_formula', 'calibrated', 'fallback') and rdm_overhead_gb field; __str__ renders the correct formula per type
  • PLS overhead multiplier — reduced from ×10 to ×7; a warning is now printed when PLS formula is used: WARNING: PLS overhead multiplier (×7) is approximate. Actual usage can vary significantly depending on model feature count and convergence.
  • load_calibration() / save_calibration() — unchanged; automatically loads bundled benchmark_costs.json; falls back to ~/.brainscore/benchmark_costs.json

brainscore_vision/__init__.py

  • Pre-flight runs automatically on every score() call via score_benchmark
  • Improved AssertionError message for stale activations cache, with the exact rm command to fix it

scripts/preflight_check.py (new)

Integration test script for a single (model, benchmark) pair:

scripts/mem_profile_suite.py

  • Added --calibrate mode: runs alexnet on all neural benchmarks to produce the fixed-cost table
  • Saves incrementally after each benchmark (crash-safe)
  • Added --resume-from BENCHMARK_ID to pick up after a crash

scripts/validation.py (new)

Runs a 3-model × 3-benchmark grid to validate how accurately the pre-flight estimator predicts actual peak RSS. Reports over/under estimates per pair and writes results to validation_results.jsonl.


How to extend the calibration table

# Run alexnet on all benchmarks and save results
python scripts/mem_profile_suite.py --calibrate --csv ~/calibration.csv

# Resume after a crash
python scripts/mem_profile_suite.py --calibrate --csv ~/calibration.csv \
    --resume-from MajajHong2015public.IT-pls

## Tests (`tests/test_plugin_management/test_memory_precheck.py`)

- **`TestCalibrationIO`** — load/save roundtrip, missing file, corrupt JSON, directory creation
- **`TestCalibratedFormula`** — calibrated, RDM formula, ridge formula fallback, PLS ×7, fallback ×6; OOM detection for each path
- **`TestMemoryEstimateStr`**`__str__` renders correct formula label per type; `OK` / `OOM LIKELY` status
- **`TestCalibratedIntegration`** — full roundtrip with mock model and benchmark; confirms `score_benchmark` calls `preallocate_memory` before scoring

---

## Still ToDo

- [x] Complete remaining benchmark estimates via EC2 runs (in progress) to populate JSON

Note: This PR is dependent on changes in core — specifically PR #168

@mike-ferguson mike-ferguson requested a review from KartikP May 7, 2026 19:35
Copy link
Copy Markdown
Collaborator

@KartikP KartikP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional change required in an unmodified file

brainscore_vision.benchmarks.Benchmark is a separate ABC from brainscore_core.benchmarks.Benchmark, so the new no-op from Core doesn't propagate. Most if not all behavioral/engineering benchmarks will likely crash with an AttributeError at _run_score's score_benchmark(benchmark, model) call.

To resolve this, you probably just have to add the same def preallocate_memory(self, candidate): pass to brainscore_vision/benchmarks/__init__.py Benchmark.

Actual review

Outside of that, I think the rest of the comments will help resolve this PR. Depending on the follow up from integrating with gated scoring, another recommendation might be to structure the failure signal for easier parsing rather than String parsing. Edit: Following up to this, how does an OOM from a behavior benchmark trigger retry policy in gated scoring. Since behavior and engineering benchmarks do not run through this path, if they run OOM on the smallest EC2 instances (As per gated scoring), will they be just incremented up one level to the next queue?

Additionally, this PR adds a lot of lines changed, but largely for the core Brain-Score developer team to maintain (likely unseen be regular users). If we continue to use this approach, we should consider logging every run in a table so we can refine the benchmark_costs.json over time.

Moving forward, we should aim to resolve common reasons for OOM e.g., switching to dual ridge when n_features > n_stimuli or incremental/chunked solvers for when feature count is large.

This PR adds an important step towards improving brain-score scoring success and appropriately and cost-effectively allocating compute resources.

Comment thread brainscore_vision/benchmark_helpers/neural_common.py
Comment thread brainscore_vision/benchmark_helpers/memory.py
Comment thread tests/test_plugin_management/test_memory_precheck.py Outdated
Comment thread brainscore_vision/__init__.py Outdated
Comment thread brainscore_vision/benchmark_helpers/memory.py Outdated
@mike-ferguson mike-ferguson reopened this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants