Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners by hua7450 · Pull Request #8886 · PolicyEngine/policyengine-us

hua7450 · 2026-07-04T06:59:44Z

Summary

CI currently takes ~34 min wall-clock across 17 heavy runners. Decomposing a recent run (28628076632) batch-by-batch showed the time goes to three things: a memory workaround tax (153 sequential batch subprocesses × ~33s fixed startup ≈ 83 runner-minutes per PR run), idle CPUs (batches run one at a time, single-threaded, on 4-vCPU runners), and a handful of pathologically slow test cases — led by the NY EITC/CTC formulas deep-cloning the entire tax-benefit system per formula call (#8114).

This PR targets all three. Expected result (simulated from the measured per-batch times): ~20-21 min wall on 14 runners (from ~34 min on 17).

Fixes #8114

Changes

1. NY #8114 fix (`policyengine_us/tools/pinned_tbs.py` + 3 NY credit variables)

Build each pinned system (pre-ARPA 2020 EITC, pre-TCJA 2017 CTC) once per process in a module-level cache keyed on the baseline system's identity (weakref, so reformed systems rebuild against their own base). Measured on the NY credits test files: ny_ctc_pre_2024.yaml 373s → 57s, ny_eitc.yaml 106s → 53s, peak RSS 4.8 → 2.8 GB, outputs identical. The whole credits folder now runs in one subprocess (~3.3 GB), so the quarantined states-ny runner is retired.

2. Concurrent batches in `test_batched.py`

New --workers N flag (default 1 = today's behavior). Batches run longest-first in a thread pool, each still an isolated subprocess; output is buffered per batch and printed atomically with a 60s heartbeat. Each batch footer now reports the subprocess's true peak RSS (VmHWM from /proc), so future memory tuning reads straight off CI logs.

3. Job re-layout: 17 → 14 heavy runners

states-ny folded into the two states shards (--batches 8 --workers 2, no more --exclude ny)
Reform + gov/hhs merged into the baseline-contrib runner; gov/usda moved to the ssa runner
Contrib other-shard-1 + other-shard-3 merged
Partners job: analytics_coverage/edge_cases (91% of the job as one 26.7-min batch) fans out per topic/state folder, 2-wide — invocation-only; no partner contract files were edited or moved
All test jobs set PYTEST_ADDOPTS="--durations=25 -p no:unraisableexception -p no:threadexception" — every CI log now ends with the 25 slowest test cases, and each subprocess skips pytest 8.4's ~11s full-heap gc sweep at exit

4. Microsimulation test cost (Rest job)

test_microsim.py builds one Microsimulation per dataset (module-scoped fixture shared across year params): 2 full builds instead of 4 — this was the real ~17-min hotspot of the Rest job (the heavy LSR/CG files were already skipped via RUN_HEAVY_TESTS). Those heavy tests now also subsample (10k households) and share a baseline, so opting in is much cheaper; they remain skipped by default.

Validation

NY fix behavior was validated empirically before porting: identical test outputs, timings above (Apple Silicon; ratios are what carry to CI)
Static checks: py_compile on all touched Python, PyYAML parse of both workflows, make -n on all 21 changed/new targets, and a batch-composition simulation confirming the new 14-job layout covers 4,210/4,210 YAML test files (no coverage lost)
The real verdict is this PR's own Actions run: job wall times, per-batch peak RSS, and --durations tables are all now visible in the logs

Notes for maintainers

Required status checks need updating: deleted checks Baseline (states-ny), Reform (per-file), Contrib (other-shard-1), Contrib (other-shard-3); renamed states-non-ny-shard-1/2 → states-shard-1/2, ssa → ssa-usda, contrib → contrib-reform-hhs; new Contrib (other-shards-1-3)
Memory co-scheduling headroom (two ~5 GB batches side-by-side on 16 GB runners) is based on measured per-batch peaks; the new per-batch RSS logging in this run is the confirmation — if any job runs hot, dropping its --workers to 1 restores exactly the old behavior
push.yaml Publish.needs updated to the new job set

Test plan

CI passes on this PR (full re-architected pipeline)
Per-batch peak RSS in logs confirms co-scheduling headroom
Job wall times ≈ 20-21 min critical path

🤖 Generated with Claude Code

The NY EITC and pre-2024 CTC formulas deep-cloned the entire tax-benefit system (parameter tree + variable registry) on every call that hit the decoupled/pre-TCJA branch, driving the NY credits test memory to ~12 GB and requiring a dedicated quarantined CI runner. Build each pinned system (pre-ARPA 2020 EITC, pre-TCJA 2017 CTC) once per process in a module-level cache keyed on the baseline system's identity, so layered reforms still rebuild against their own base. Measured on the credits test files: ny_ctc_pre_2024.yaml 373s -> 57s, ny_eitc.yaml 106s -> 53s, peak RSS 4.8 -> 2.8 GB, outputs identical. Fixes PolicyEngine#8114 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

New --workers N flag (default 1 = unchanged sequential behavior) runs batch subprocesses concurrently, longest-first by YAML file count, with per-batch output buffered and printed atomically plus a 60s heartbeat. Each batch now reports its subprocess peak RSS (VmHWM from /proc on Linux) so CI logs show real per-batch memory. Grace sleep after pytest completion reduced from 5s to 1s. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

With the NY clone fix, states-ny folds back into the states matrix (8 batches, 2 shards). Reform and hhs merge into the baseline-contrib runner, usda moves to the ssa runner, contrib other-shards 1+3 merge, and the partners job fans analytics_coverage/edge_cases out per topic/state folder (invocation-only; partner files untouched). Heavy targets run batches 2-wide, the light rest job 3-wide. All test jobs set PYTEST_ADDOPTS to print the 25 slowest cases and skip pytest's unraisableexception gc sweep (~11s per subprocess). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

test_microsim.py now builds one Microsimulation per dataset via a module-scoped fixture shared across the parametrized years (2 full builds instead of 4; ~17 min of the Rest CI job). The RUN_HEAVY_TESTS LSR/CG interaction tests (still skipped by default) subsample to 10,000 households and reuse a single baseline, so opting in is cheaper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hua7450 and others added 4 commits July 4, 2026 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners#8886

Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners#8886
hua7450 wants to merge 4 commits into
PolicyEngine:mainfrom
hua7450:ci-speedup

hua7450 commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hua7450 commented Jul 4, 2026

Summary

Changes

1. NY #8114 fix (policyengine_us/tools/pinned_tbs.py + 3 NY credit variables)

2. Concurrent batches in test_batched.py

3. Job re-layout: 17 → 14 heavy runners

4. Microsimulation test cost (Rest job)

Validation

Notes for maintainers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. NY #8114 fix (`policyengine_us/tools/pinned_tbs.py` + 3 NY credit variables)

2. Concurrent batches in `test_batched.py`