disable cyclic gc on hot ranks + add rt tuning docs#14
Open
jl33-ai wants to merge 1 commit into
Open
Conversation
addresses the OS-level tail latency story loren named directly: the python GC and the linux scheduler/IRQ defaults are two separate sources of multi-ms hiccups, and the tail is what matters for closed loop, not the mean. code: - new realtime_decoder/rt_tuning.py with a gc_paused_for_main_loop(config) context manager. runs one gc.collect() up front, gc.disable() for the duration of the with, then re-enables and collects on exit. opt out via performance.disable_gc_in_main_loop: false (useful when hunting reference cycles in dev). - encoder, decoder, and ripple main_loop methods wrap their try/except body in that context manager. with the per-tick pre-allocation pass from LorenFrankLab#9 and LorenFrankLab#12 already in place, steady-state allocations are low enough that disabling gen-2 collection cleanly removes one common source of 50-100ms spikes. docs: - new docs/realtime_tuning.md covering the OS recipe in the order it's worth applying: gc, CPU pinning + rankfiles, isolcpus + nohz_full + rcu_nocbs, IRQ affinity, SCHED_FIFO via chrt + kernel.sched_rt_runtime_us, mlockall, PREEMPT_RT, frequency governor + no_turbo, a launcher script that ties it together, and a verification checklist (p99/p99.9 from the timing .npz files). - README points at it. deliberately no benchmark numbers in the doc. real numbers are setup-dependent and quoting them as universal would be misleading. the doc is structured so each step is verifiable independently.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
this is the OS-level tail-latency follow-up to the per-tick allocation work in #9 and #12. python's GC and the linux scheduler/IRQ defaults are two separate sources of multi-ms hiccups, and for closed-loop the tail is what matters, not the mean.
code
docs
deliberately not in this PR
no benchmark numbers in the doc. real latency numbers are setup-dependent and quoting them as universal would mislead. the doc is structured so each step is independently verifiable against the existing timing output.
no python-side mlockall implementation. the cleanest path is a tiny LD_PRELOAD shim or a C extension; both feel out of scope for one PR. happy to do it as a follow up if there's interest.
happy to gate the gc disable behind an env var instead of a config flag if you prefer; let me know.