Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners#8886
Draft
hua7450 wants to merge 4 commits into
Draft
Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners#8886hua7450 wants to merge 4 commits into
hua7450 wants to merge 4 commits into
Conversation
The NY EITC and pre-2024 CTC formulas deep-cloned the entire tax-benefit system (parameter tree + variable registry) on every call that hit the decoupled/pre-TCJA branch, driving the NY credits test memory to ~12 GB and requiring a dedicated quarantined CI runner. Build each pinned system (pre-ARPA 2020 EITC, pre-TCJA 2017 CTC) once per process in a module-level cache keyed on the baseline system's identity, so layered reforms still rebuild against their own base. Measured on the credits test files: ny_ctc_pre_2024.yaml 373s -> 57s, ny_eitc.yaml 106s -> 53s, peak RSS 4.8 -> 2.8 GB, outputs identical. Fixes PolicyEngine#8114 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New --workers N flag (default 1 = unchanged sequential behavior) runs batch subprocesses concurrently, longest-first by YAML file count, with per-batch output buffered and printed atomically plus a 60s heartbeat. Each batch now reports its subprocess peak RSS (VmHWM from /proc on Linux) so CI logs show real per-batch memory. Grace sleep after pytest completion reduced from 5s to 1s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
With the NY clone fix, states-ny folds back into the states matrix (8 batches, 2 shards). Reform and hhs merge into the baseline-contrib runner, usda moves to the ssa runner, contrib other-shards 1+3 merge, and the partners job fans analytics_coverage/edge_cases out per topic/state folder (invocation-only; partner files untouched). Heavy targets run batches 2-wide, the light rest job 3-wide. All test jobs set PYTEST_ADDOPTS to print the 25 slowest cases and skip pytest's unraisableexception gc sweep (~11s per subprocess). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
test_microsim.py now builds one Microsimulation per dataset via a module-scoped fixture shared across the parametrized years (2 full builds instead of 4; ~17 min of the Rest CI job). The RUN_HEAVY_TESTS LSR/CG interaction tests (still skipped by default) subsample to 10,000 households and reuse a single baseline, so opting in is cheaper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CI currently takes ~34 min wall-clock across 17 heavy runners. Decomposing a recent run (28628076632) batch-by-batch showed the time goes to three things: a memory workaround tax (153 sequential batch subprocesses × ~33s fixed startup ≈ 83 runner-minutes per PR run), idle CPUs (batches run one at a time, single-threaded, on 4-vCPU runners), and a handful of pathologically slow test cases — led by the NY EITC/CTC formulas deep-cloning the entire tax-benefit system per formula call (#8114).
This PR targets all three. Expected result (simulated from the measured per-batch times): ~20-21 min wall on 14 runners (from ~34 min on 17).
Fixes #8114
Changes
1. NY #8114 fix (
policyengine_us/tools/pinned_tbs.py+ 3 NY credit variables)Build each pinned system (pre-ARPA 2020 EITC, pre-TCJA 2017 CTC) once per process in a module-level cache keyed on the baseline system's identity (weakref, so reformed systems rebuild against their own base). Measured on the NY credits test files:
ny_ctc_pre_2024.yaml373s → 57s,ny_eitc.yaml106s → 53s, peak RSS 4.8 → 2.8 GB, outputs identical. The whole credits folder now runs in one subprocess (~3.3 GB), so the quarantinedstates-nyrunner is retired.2. Concurrent batches in
test_batched.pyNew
--workers Nflag (default 1 = today's behavior). Batches run longest-first in a thread pool, each still an isolated subprocess; output is buffered per batch and printed atomically with a 60s heartbeat. Each batch footer now reports the subprocess's true peak RSS (VmHWMfrom/proc), so future memory tuning reads straight off CI logs.3. Job re-layout: 17 → 14 heavy runners
states-nyfolded into the two states shards (--batches 8 --workers 2, no more--exclude ny)Reform+gov/hhsmerged into the baseline-contrib runner;gov/usdamoved to the ssa runnerother-shard-1+other-shard-3mergedanalytics_coverage/edge_cases(91% of the job as one 26.7-min batch) fans out per topic/state folder, 2-wide — invocation-only; no partner contract files were edited or movedPYTEST_ADDOPTS="--durations=25 -p no:unraisableexception -p no:threadexception"— every CI log now ends with the 25 slowest test cases, and each subprocess skips pytest 8.4's ~11s full-heap gc sweep at exit4. Microsimulation test cost (Rest job)
test_microsim.pybuilds oneMicrosimulationper dataset (module-scoped fixture shared across year params): 2 full builds instead of 4 — this was the real ~17-min hotspot of the Rest job (the heavy LSR/CG files were already skipped viaRUN_HEAVY_TESTS). Those heavy tests now also subsample (10k households) and share a baseline, so opting in is much cheaper; they remain skipped by default.Validation
py_compileon all touched Python, PyYAML parse of both workflows,make -non all 21 changed/new targets, and a batch-composition simulation confirming the new 14-job layout covers 4,210/4,210 YAML test files (no coverage lost)--durationstables are all now visible in the logsNotes for maintainers
Baseline (states-ny),Reform (per-file),Contrib (other-shard-1),Contrib (other-shard-3); renamedstates-non-ny-shard-1/2→states-shard-1/2,ssa→ssa-usda,contrib→contrib-reform-hhs; newContrib (other-shards-1-3)--workersto 1 restores exactly the old behaviorpush.yamlPublish.needsupdated to the new job setTest plan
🤖 Generated with Claude Code