preallocate decoder hot path scratch by jl33-ai · Pull Request #9 · LorenFrankLab/realtime_decoder

jl33-ai · 2026-05-29T16:28:32Z

_process_lfp_timestamp runs at ~167hz (180-sample bin at 30khz spike clock) and was allocating per tick:

two np.zeros(cred_int_bufsize) scratch arrays
a logical_and mask
two no-op atleast_2d wrappers around already-2d slices
a fresh bytes copy of the structured numpy buffer for every MPI.Send via msg.tobytes()

in aggregate that is hundreds of short-lived numpy/bytes objects per second. exactly the kind of churn that lets generational gc accumulate enough pressure to produce the 50-100ms tail latency spikes the lab has seen on the python side. the math here is not the bottleneck, the allocator/gc interaction is.

changes:

enc_cred_intervals, enc_argmaxes, and the spike-bin mask move to instance attrs allocated once in __init__ and reused via .fill(0) / out= per tick.
dropped the two np.atleast_2d calls. boolean indexing a 2d array with a 1d bool mask already returns 2d so the wrapper was a no-op that just made an extra array per tick.
hoisted self.p[...] lookups out of the inner loop.
swapped msg.tobytes() for [msg, MPI.BYTE] in the three send paths so MPI hands the numpy buffer to the network layer directly without allocating a fresh bytes per send. wire format unchanged (raw bytes), so the bytearray + np.frombuffer receivers don't need any change.

no algorithmic change, semantics preserved. happy to add a microbenchmark if useful.

this is the first piece of a broader latency hygiene pass. the same approach is worth doing in encoder_process and ripple_process (both have similar per-tick allocations) but I kept this PR scoped to the decoder so it stays reviewable.

_process_lfp_timestamp runs at ~167hz (180-sample bin at 30khz spike clock) and was allocating per tick: - two np.zeros(cred_int_bufsize) scratch arrays - a logical_and mask - two no-op atleast_2d wrappers around already-2d slices that allocation churn is the kind of thing that lets generational gc accumulate enough pressure to produce 50-100ms tail latency spikes, which would explain some of the worst case timing the lab has seen on the python side. changes: - enc_cred_intervals, enc_argmaxes, and the spike mask are now instance attrs allocated once in __init__ and reused via .fill(0) / out= - dropped the two atleast_2d calls since boolean indexing a 2d array with a 1d mask already returns 2d - hoisted self.p[...] lookups out of the inner loop - swapped msg.tobytes() for [msg, MPI.BYTE] in the three send paths so MPI gets the numpy buffer directly. wire format unchanged so receivers don't need any change. no algorithmic change. semantics preserved. same pass is worth doing in encoder_process and ripple_process but kept this commit scoped to the decoder.

This was referenced May 29, 2026

preallocate encoder and ripple hot path scratch #12

Open

disable cyclic gc on hot ranks + add rt tuning docs #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preallocate decoder hot path scratch#9

preallocate decoder hot path scratch#9
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:preallocate-decoder-scratch

jl33-ai commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jl33-ai commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant