Preflight memory#2366
Conversation
There was a problem hiding this comment.
Additional change required in an unmodified file
brainscore_vision.benchmarks.Benchmark is a separate ABC from brainscore_core.benchmarks.Benchmark, so the new no-op from Core doesn't propagate. Most if not all behavioral/engineering benchmarks will likely crash with an AttributeError at _run_score's score_benchmark(benchmark, model) call.
To resolve this, you probably just have to add the same def preallocate_memory(self, candidate): pass to brainscore_vision/benchmarks/__init__.py Benchmark.
Actual review
Outside of that, I think the rest of the comments will help resolve this PR. Depending on the follow up from integrating with gated scoring, another recommendation might be to structure the failure signal for easier parsing rather than String parsing. Edit: Following up to this, how does an OOM from a behavior benchmark trigger retry policy in gated scoring. Since behavior and engineering benchmarks do not run through this path, if they run OOM on the smallest EC2 instances (As per gated scoring), will they be just incremented up one level to the next queue?
Additionally, this PR adds a lot of lines changed, but largely for the core Brain-Score developer team to maintain (likely unseen be regular users). If we continue to use this approach, we should consider logging every run in a table so we can refine the benchmark_costs.json over time.
Moving forward, we should aim to resolve common reasons for OOM e.g., switching to dual ridge when n_features > n_stimuli or incremental/chunked solvers for when feature count is large.
This PR adds an important step towards improving brain-score scoring success and appropriately and cost-effectively allocating compute resources.
Summary
Adds a pre-flight memory estimation system that catches OOM errors before committing to a multi-hour benchmark run.
Problem
Scoring large models on neural benchmarks can take 6+ hours. If the job runs out of RAM, you lose all that time with nothing to show for it. There was no way to know upfront whether a model would OOM on a given benchmark.
Solution: Metric-aware memory formula
Before scoring begins, the system estimates total RAM needed using a formula matched to each benchmark's regression type:
activation_gb × 3n_features ≤ n_stimuli, calibrated)activation_gb + fixed_benchmark_cost_gbn_stimuli×n_stimuli, model-independent. Accurate when features are smaller than stimulus count (e.g. alexnet on most benchmarks). Validated: ≤1% error.n_features > n_stimuli)activation_gb × 6MemoryErrorcleanly before the OS kills the container. Validated: −1.5% error for resnet50 and ViT on Gifford2022.n_features ≤ n_stimuli, no calibration entry)activation_gb + n_stimuli²×4B-pls,-reverse_pls)activation_gb × 7 + fixed_benchmark_cost_gbnum_features— overhead is not model-independent. Multiplier calibrated on 3-model × 2-benchmark grid; worst miss: −12.7% (within 15%). Warning printed at runtime.activation_gb × 6activation_gb— model-dependent. Measured by running a single forward pass (the "probe") on 1 stimulus and extrapolating:num_stimuli × num_features × num_timebins × 4 bytesfixed_benchmark_cost_gb— benchmark-dependent, model-independent. Stored inbenchmark_costs.json.Changes
New file:
brainscore_vision/benchmark_helpers/benchmark_costs.jsonCalibration table with fixed overhead costs for 49 neural benchmarks. Used for ridge/ridgecv and PLS benchmarks. RDM benchmarks do not require calibration entries.
brainscore_vision/benchmark_helpers/memory.pypreallocate_memory(model, benchmark)— probes the model with 1 stimulus, dispatches to the correct formula based on benchmark type, and raisesMemoryErrorif estimate exceeds available RAM_is_pls_minimum,_is_rdm_benchmark,_is_ridge_benchmark. Detection is by identifier suffix (-pls,-reverse_pls,-temporal-pls,-rdm,-ridge,-ridgecv) and/orisinstancecheck forRSABenchmarkRSABenchmarksupport — previously raisedTypeErrorfor RSA/RDM benchmarks; now fully handled with the model-independentn_stimuli²formulaMemoryEstimatedataclass — addedformula_typefield ('pls','rdm','ridge_formula','calibrated','fallback') andrdm_overhead_gbfield;__str__renders the correct formula per typeWARNING: PLS overhead multiplier (×7) is approximate. Actual usage can vary significantly depending on model feature count and convergence.load_calibration()/save_calibration()— unchanged; automatically loads bundledbenchmark_costs.json; falls back to~/.brainscore/benchmark_costs.jsonbrainscore_vision/__init__.pyscore()call viascore_benchmarkAssertionErrormessage for stale activations cache, with the exactrmcommand to fix itscripts/preflight_check.py(new)Integration test script for a single (model, benchmark) pair:
scripts/mem_profile_suite.py--calibratemode: runs alexnet on all neural benchmarks to produce the fixed-cost table--resume-from BENCHMARK_IDto pick up after a crashscripts/validation.py(new)Runs a 3-model × 3-benchmark grid to validate how accurately the pre-flight estimator predicts actual peak RSS. Reports over/under estimates per pair and writes results to
validation_results.jsonl.How to extend the calibration table
Note: This PR is dependent on changes in
core— specifically PR #168