Skip to content

preallocate decoder hot path scratch#9

Open
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:preallocate-decoder-scratch
Open

preallocate decoder hot path scratch#9
jl33-ai wants to merge 1 commit into
LorenFrankLab:mainfrom
jl33-ai:preallocate-decoder-scratch

Conversation

@jl33-ai
Copy link
Copy Markdown

@jl33-ai jl33-ai commented May 29, 2026

_process_lfp_timestamp runs at ~167hz (180-sample bin at 30khz spike clock) and was allocating per tick:

  • two np.zeros(cred_int_bufsize) scratch arrays
  • a logical_and mask
  • two no-op atleast_2d wrappers around already-2d slices
  • a fresh bytes copy of the structured numpy buffer for every MPI.Send via msg.tobytes()

in aggregate that is hundreds of short-lived numpy/bytes objects per second. exactly the kind of churn that lets generational gc accumulate enough pressure to produce the 50-100ms tail latency spikes the lab has seen on the python side. the math here is not the bottleneck, the allocator/gc interaction is.

changes:

  • enc_cred_intervals, enc_argmaxes, and the spike-bin mask move to instance attrs allocated once in __init__ and reused via .fill(0) / out= per tick.
  • dropped the two np.atleast_2d calls. boolean indexing a 2d array with a 1d bool mask already returns 2d so the wrapper was a no-op that just made an extra array per tick.
  • hoisted self.p[...] lookups out of the inner loop.
  • swapped msg.tobytes() for [msg, MPI.BYTE] in the three send paths so MPI hands the numpy buffer to the network layer directly without allocating a fresh bytes per send. wire format unchanged (raw bytes), so the bytearray + np.frombuffer receivers don't need any change.

no algorithmic change, semantics preserved. happy to add a microbenchmark if useful.

this is the first piece of a broader latency hygiene pass. the same approach is worth doing in encoder_process and ripple_process (both have similar per-tick allocations) but I kept this PR scoped to the decoder so it stays reviewable.

_process_lfp_timestamp runs at ~167hz (180-sample bin at 30khz spike clock) and was allocating per tick:
- two np.zeros(cred_int_bufsize) scratch arrays
- a logical_and mask
- two no-op atleast_2d wrappers around already-2d slices

that allocation churn is the kind of thing that lets generational gc accumulate enough pressure to produce 50-100ms tail latency spikes, which would explain some of the worst case timing the lab has seen on the python side.

changes:
- enc_cred_intervals, enc_argmaxes, and the spike mask are now instance attrs allocated once in __init__ and reused via .fill(0) / out=
- dropped the two atleast_2d calls since boolean indexing a 2d array with a 1d mask already returns 2d
- hoisted self.p[...] lookups out of the inner loop
- swapped msg.tobytes() for [msg, MPI.BYTE] in the three send paths so MPI gets the numpy buffer directly. wire format unchanged so receivers don't need any change.

no algorithmic change. semantics preserved. same pass is worth doing in encoder_process and ripple_process but kept this commit scoped to the decoder.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant