KV cache: per-conversation slot separation + disk persistence/offload#1830
Open
aidiffuser wants to merge 5 commits intoexo-explore:mainfrom
Open
KV cache: per-conversation slot separation + disk persistence/offload#1830aidiffuser wants to merge 5 commits intoexo-explore:mainfrom
aidiffuser wants to merge 5 commits intoexo-explore:mainfrom
Conversation
f8164ef to
60f4fa2
Compare
- Per-conversation slot separation: exact prefix check prevents cross-conversation slot overwriting. Conversations sharing a system prompt now maintain separate cache slots (98-99% hit rates). - Single hot slot in RAM with disk persistence via mlx-lm save_prompt_cache/load_prompt_cache. Conversation switches cost ~1-2s (disk swap) instead of ~140s (cold prefill). Caches survive runner restarts. 24h TTL disk eviction. Tested on 2x Mac Studio M3 Ultra (512GB each) with Kimi-K2.5 (61 layers, 50k token conversations, Thunderbolt 5 RDMA).
60f4fa2 to
dfc6fc2
Compare
1432d2a to
cc0febb
Compare
added 3 commits
April 7, 2026 13:00
- Size-based eviction: after TTL pass, evicts oldest slots if disk cache directory exceeds 500GB. Hot slot is never evicted. - Force flush on idle: saves hot slot immediately when runner has no active tasks, ensuring completed conversations are always persisted.
- EXO_KV_DISK_PERSISTENCE=0 to disable disk persistence entirely - EXO_KV_DISK_PATH to set custom storage location - EXO_KV_DISK_TTL_HOURS to configure stale slot eviction (default 24h) - EXO_KV_DISK_MAX_SIZE_GB to set disk size cap (default 500GB) All optional with sensible defaults. No behavior change without setting them.
…ures Some models (e.g., GLM-4) produce cache layers with empty tensors. mx.save_safetensors rejects zero-length arrays. Replaced save_prompt_cache with inline logic that filters empty arrays before saving. Load path handles missing keys gracefully via from_state() reconstruction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
KVPrefixCachematches slots purely by token prefix overlap. When multiple conversations share a system prompt (typical in multi-channel API gateway setups), the prefix matcher sees 90%+ overlap and overwrites the existingslot on every conversation switch — destroying the previous conversation's cache. A 100k-token conversation reverts to a ~20-30k system prompt match if you were to switch to a fresh session, requiring near-full reprefill. The alternative — keeping multiple slots in RAM — scales linearly with conversation count, length, and model architecture, quickly becoming prohibitive. With this patch, you only have 1 slot active at any given time so it prevents OOMs. Additionally, KV cache is RAM-only: restarts and crashes force cold prefills from zero. I am now able to hold multiple conversations in multiple channels with my agent running locally on my computer with near zero prefill downtime (instant switching from one session to the other) after every conversation is fully loaded/flushed to disk, I think of it as a context extender, I hold 3-5 conversations at a time all of which are around 70k-100k tokens on a model as large as Kimi K2.5.
Solution
Per-conversation slot separation
A heuristic in
_save_prefix_cachedistinguishes "same conversation extended" from "different conversation with shared prefix" using an exact prefix check:A legitimate update (same conversation, new messages appended) matches all cached tokens as a prefix —
prefix_hit_length >= cached_len - 1. The new prompt is always longer, socached_coverage >= hit_ratio.A cross-conversation collision (different channel, shared system prompt) matches only a subset of the cached tokens — even at 98% overlap, the gap of hundreds of unmatched tokens at the end of the cached prompt fails the exact prefix check. This correctly routes to a new slot instead of overwriting.
Single hot slot + disk persistence
Only one KV cache slot stays in RAM at a time (the "hot" slot for the active conversation). All other slots are persisted to disk via mlx-lm's
save_prompt_cache/load_prompt_cache, which callmx.save_safetensorsinternally — writing directly from Apple Silicon unified memory to disk with no numpy intermediate, no bf16 casting, and no GPU→CPU copy.Conversation switch flow:
Three flush triggers:
Storage format
Each model gets its own directory via
model_hash(SHA-256[:16] of model_id). No cross-model contamination.Disk eviction
Stale disk slots are cleaned up automatically after every flush:
timestampolder than 24 hours is deleted. Active conversations have their timestamp refreshed on every flush, so they never expire while in use._cache.safetensors,_tokens.safetensors, and_meta.jsonare all removed together.Model compatibility
This feature works with any model using standard
KVCachefrom mlx-lm, which includes the vast majority of architectures:QuantizedKVCache— untested but should work viafrom_state()protocolRotatingKVCache(sliding window) — untested, may need verificationArraysCache/ SSM layers (Mamba, Jamba) — SSM state snapshots are not persisted to disk. These models fall through gracefully to fresh cache rather than serving corrupt state.Configuration
All optional. No configuration required — the feature works out of the box with sensible defaults.
EXO_KV_DISK_PERSISTENCE=01(enabled)EXO_KV_DISK_PATH=/path~/.exo/kv-cacheEXO_KV_DISK_TTL_HOURS=4824EXO_KV_DISK_MAX_SIZE_GB=200500Before / After
Before (Checkpoint 1 only)
After
Measured performance
Testing
Tested over multiple hours on:
Files changed
cache.pymodel_idparam, disk persistence methods, single hot slot enforcement, TTL eviction, disk search fallback inget_kv_cacherunner.pymodel_idtoKVPrefixCache, flush on idle (AllTasksComplete), force flush on shutdowngenerate.pycached_coverageguard in_save_prefix_cachebatch_generate.pycached_coverageguard (active code path)