Skip to content

Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners#8886

Draft
hua7450 wants to merge 4 commits into
PolicyEngine:mainfrom
hua7450:ci-speedup
Draft

Speed up CI: fix NY tax-benefit-system cloning, run batches concurrently, rebalance to 14 runners#8886
hua7450 wants to merge 4 commits into
PolicyEngine:mainfrom
hua7450:ci-speedup

Conversation

@hua7450

@hua7450 hua7450 commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

CI currently takes ~34 min wall-clock across 17 heavy runners. Decomposing a recent run (28628076632) batch-by-batch showed the time goes to three things: a memory workaround tax (153 sequential batch subprocesses × ~33s fixed startup ≈ 83 runner-minutes per PR run), idle CPUs (batches run one at a time, single-threaded, on 4-vCPU runners), and a handful of pathologically slow test cases — led by the NY EITC/CTC formulas deep-cloning the entire tax-benefit system per formula call (#8114).

This PR targets all three. Expected result (simulated from the measured per-batch times): ~20-21 min wall on 14 runners (from ~34 min on 17).

Fixes #8114

Changes

1. NY #8114 fix (policyengine_us/tools/pinned_tbs.py + 3 NY credit variables)

Build each pinned system (pre-ARPA 2020 EITC, pre-TCJA 2017 CTC) once per process in a module-level cache keyed on the baseline system's identity (weakref, so reformed systems rebuild against their own base). Measured on the NY credits test files: ny_ctc_pre_2024.yaml 373s → 57s, ny_eitc.yaml 106s → 53s, peak RSS 4.8 → 2.8 GB, outputs identical. The whole credits folder now runs in one subprocess (~3.3 GB), so the quarantined states-ny runner is retired.

2. Concurrent batches in test_batched.py

New --workers N flag (default 1 = today's behavior). Batches run longest-first in a thread pool, each still an isolated subprocess; output is buffered per batch and printed atomically with a 60s heartbeat. Each batch footer now reports the subprocess's true peak RSS (VmHWM from /proc), so future memory tuning reads straight off CI logs.

3. Job re-layout: 17 → 14 heavy runners

  • states-ny folded into the two states shards (--batches 8 --workers 2, no more --exclude ny)
  • Reform + gov/hhs merged into the baseline-contrib runner; gov/usda moved to the ssa runner
  • Contrib other-shard-1 + other-shard-3 merged
  • Partners job: analytics_coverage/edge_cases (91% of the job as one 26.7-min batch) fans out per topic/state folder, 2-wide — invocation-only; no partner contract files were edited or moved
  • All test jobs set PYTEST_ADDOPTS="--durations=25 -p no:unraisableexception -p no:threadexception" — every CI log now ends with the 25 slowest test cases, and each subprocess skips pytest 8.4's ~11s full-heap gc sweep at exit

4. Microsimulation test cost (Rest job)

test_microsim.py builds one Microsimulation per dataset (module-scoped fixture shared across year params): 2 full builds instead of 4 — this was the real ~17-min hotspot of the Rest job (the heavy LSR/CG files were already skipped via RUN_HEAVY_TESTS). Those heavy tests now also subsample (10k households) and share a baseline, so opting in is much cheaper; they remain skipped by default.

Validation

  • NY fix behavior was validated empirically before porting: identical test outputs, timings above (Apple Silicon; ratios are what carry to CI)
  • Static checks: py_compile on all touched Python, PyYAML parse of both workflows, make -n on all 21 changed/new targets, and a batch-composition simulation confirming the new 14-job layout covers 4,210/4,210 YAML test files (no coverage lost)
  • The real verdict is this PR's own Actions run: job wall times, per-batch peak RSS, and --durations tables are all now visible in the logs

Notes for maintainers

  • Required status checks need updating: deleted checks Baseline (states-ny), Reform (per-file), Contrib (other-shard-1), Contrib (other-shard-3); renamed states-non-ny-shard-1/2states-shard-1/2, ssassa-usda, contribcontrib-reform-hhs; new Contrib (other-shards-1-3)
  • Memory co-scheduling headroom (two ~5 GB batches side-by-side on 16 GB runners) is based on measured per-batch peaks; the new per-batch RSS logging in this run is the confirmation — if any job runs hot, dropping its --workers to 1 restores exactly the old behavior
  • push.yaml Publish.needs updated to the new job set

Test plan

  • CI passes on this PR (full re-architected pipeline)
  • Per-batch peak RSS in logs confirms co-scheduling headroom
  • Job wall times ≈ 20-21 min critical path

🤖 Generated with Claude Code

hua7450 and others added 4 commits July 4, 2026 02:59
The NY EITC and pre-2024 CTC formulas deep-cloned the entire
tax-benefit system (parameter tree + variable registry) on every call
that hit the decoupled/pre-TCJA branch, driving the NY credits test
memory to ~12 GB and requiring a dedicated quarantined CI runner.

Build each pinned system (pre-ARPA 2020 EITC, pre-TCJA 2017 CTC) once
per process in a module-level cache keyed on the baseline system's
identity, so layered reforms still rebuild against their own base.
Measured on the credits test files: ny_ctc_pre_2024.yaml 373s -> 57s,
ny_eitc.yaml 106s -> 53s, peak RSS 4.8 -> 2.8 GB, outputs identical.

Fixes PolicyEngine#8114

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New --workers N flag (default 1 = unchanged sequential behavior) runs
batch subprocesses concurrently, longest-first by YAML file count, with
per-batch output buffered and printed atomically plus a 60s heartbeat.
Each batch now reports its subprocess peak RSS (VmHWM from /proc on
Linux) so CI logs show real per-batch memory. Grace sleep after pytest
completion reduced from 5s to 1s.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
With the NY clone fix, states-ny folds back into the states matrix
(8 batches, 2 shards). Reform and hhs merge into the baseline-contrib
runner, usda moves to the ssa runner, contrib other-shards 1+3 merge,
and the partners job fans analytics_coverage/edge_cases out per
topic/state folder (invocation-only; partner files untouched). Heavy
targets run batches 2-wide, the light rest job 3-wide. All test jobs
set PYTEST_ADDOPTS to print the 25 slowest cases and skip pytest's
unraisableexception gc sweep (~11s per subprocess).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
test_microsim.py now builds one Microsimulation per dataset via a
module-scoped fixture shared across the parametrized years (2 full
builds instead of 4; ~17 min of the Rest CI job). The RUN_HEAVY_TESTS
LSR/CG interaction tests (still skipped by default) subsample to
10,000 households and reuse a single baseline, so opting in is cheaper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NY EITC/CTC formulas clone the full tax-benefit system on every call — CI memory blowup

1 participant