Skip to content

disable cyclic gc on hot ranks + add rt tuning docs#14

Open
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:gc-discipline-and-rt-tuning-docs
Open

disable cyclic gc on hot ranks + add rt tuning docs#14
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:gc-discipline-and-rt-tuning-docs

Conversation

@jl33-ai
Copy link
Copy Markdown

@jl33-ai jl33-ai commented May 29, 2026

this is the OS-level tail-latency follow-up to the per-tick allocation work in #9 and #12. python's GC and the linux scheduler/IRQ defaults are two separate sources of multi-ms hiccups, and for closed-loop the tail is what matters, not the mean.

code

  • new `realtime_decoder/rt_tuning.py` with a `gc_paused_for_main_loop(config)` context manager. runs one `gc.collect()` up front, `gc.disable()` for the duration of the with, re-enables and collects on exit. opt out via `performance.disable_gc_in_main_loop: false` (useful when hunting reference cycles in dev).
  • encoder, decoder, and ripple `main_loop` methods wrap their try/except body in that context manager. once preallocate decoder hot path scratch #9 + preallocate encoder and ripple hot path scratch #12 are in, steady-state allocations are low enough that disabling gen-2 collection cleanly removes one common source of 50-100ms spikes.

docs

  • new `docs/realtime_tuning.md` covering the OS recipe in the order I'd apply it:
    1. don't fight the GC (already wired up here)
    2. `mpiexec -bind-to hwthread` + rankfiles
    3. `isolcpus` / `nohz_full` / `rcu_nocbs` on the kernel cmdline
    4. IRQ affinity off the hot cores
    5. SCHED_FIFO via `chrt` + `kernel.sched_rt_runtime_us=-1`
    6. mlockall (with notes since stock python doesn't have a clean API)
    7. PREEMPT_RT kernel
    8. `cpupower frequency-set -g performance` + intel `no_turbo`
    9. a launcher shell script that ties it together
    10. a verification checklist (p99/p99.9 from the per-rank timing .npz)
  • README points at it.

deliberately not in this PR

no benchmark numbers in the doc. real latency numbers are setup-dependent and quoting them as universal would mislead. the doc is structured so each step is independently verifiable against the existing timing output.

no python-side mlockall implementation. the cleanest path is a tiny LD_PRELOAD shim or a C extension; both feel out of scope for one PR. happy to do it as a follow up if there's interest.

happy to gate the gc disable behind an env var instead of a config flag if you prefer; let me know.

addresses the OS-level tail latency story loren named directly: the python GC and the linux scheduler/IRQ defaults are two separate sources of multi-ms hiccups, and the tail is what matters for closed loop, not the mean.

code:
- new realtime_decoder/rt_tuning.py with a gc_paused_for_main_loop(config) context manager. runs one gc.collect() up front, gc.disable() for the duration of the with, then re-enables and collects on exit. opt out via performance.disable_gc_in_main_loop: false (useful when hunting reference cycles in dev).
- encoder, decoder, and ripple main_loop methods wrap their try/except body in that context manager. with the per-tick pre-allocation pass from LorenFrankLab#9 and LorenFrankLab#12 already in place, steady-state allocations are low enough that disabling gen-2 collection cleanly removes one common source of 50-100ms spikes.

docs:
- new docs/realtime_tuning.md covering the OS recipe in the order it's worth applying: gc, CPU pinning + rankfiles, isolcpus + nohz_full + rcu_nocbs, IRQ affinity, SCHED_FIFO via chrt + kernel.sched_rt_runtime_us, mlockall, PREEMPT_RT, frequency governor + no_turbo, a launcher script that ties it together, and a verification checklist (p99/p99.9 from the timing .npz files).
- README points at it.

deliberately no benchmark numbers in the doc. real numbers are setup-dependent and quoting them as universal would be misleading. the doc is structured so each step is verifiable independently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant