KV cache: per-conversation slot separation + disk persistence/offload by aidiffuser · Pull Request #1830 · exo-explore/exo

aidiffuser · 2026-04-01T19:57:46Z

Problem

KVPrefixCache matches slots purely by token prefix overlap. When multiple conversations share a system prompt (typical in multi-channel API gateway setups), the prefix matcher sees 90%+ overlap and overwrites the existing
slot on every conversation switch — destroying the previous conversation's cache. A 100k-token conversation reverts to a ~20-30k system prompt match if you were to switch to a fresh session, requiring near-full reprefill. The alternative — keeping multiple slots in RAM — scales linearly with conversation count, length, and model architecture, quickly becoming prohibitive. With this patch, you only have 1 slot active at any given time so it prevents OOMs. Additionally, KV cache is RAM-only: restarts and crashes force cold prefills from zero. I am now able to hold multiple conversations in multiple channels with my agent running locally on my computer with near zero prefill downtime (instant switching from one session to the other) after every conversation is fully loaded/flushed to disk, I think of it as a context extender, I hold 3-5 conversations at a time all of which are around 70k-100k tokens on a model as large as Kimi K2.5.

Solution

Per-conversation slot separation

A heuristic in _save_prefix_cache distinguishes "same conversation extended" from "different conversation with shared prefix" using an exact prefix check:

cached_len = len(cached_prompt)
is_extension = prefix_hit_length >= cached_len - 1 and cached_coverage >= hit_ratio

A legitimate update (same conversation, new messages appended) matches all cached tokens as a prefix — prefix_hit_length >= cached_len - 1. The new prompt is always longer, so cached_coverage >= hit_ratio.

A cross-conversation collision (different channel, shared system prompt) matches only a subset of the cached tokens — even at 98% overlap, the gap of hundreds of unmatched tokens at the end of the cached prompt fails the exact prefix check. This correctly routes to a new slot instead of overwriting.

Single hot slot + disk persistence

Only one KV cache slot stays in RAM at a time (the "hot" slot for the active conversation). All other slots are persisted to disk via mlx-lm's save_prompt_cache / load_prompt_cache, which call mx.save_safetensors internally — writing directly from Apple Silicon unified memory to disk with no numpy intermediate, no bf16 casting, and no GPU→CPU copy.

Conversation switch flow:

New conversation arrives, hot slot doesn't match → flush hot slot to disk (~1-2s)
Search disk for a matching slot → load if found (~1-2s)
If not found → fresh prefill (first time only)
Total switch cost: ~2-4s vs ~2-3 minutes cold prefill

Three flush triggers:

15-second idle timer: Saves the hot slot when the runner has no active tasks for 15+ seconds. Protects against data loss from crashes.
Conversation switch: Immediate synchronous flush before clearing RAM and loading the new conversation.
Shutdown: Force flush on runner shutdown, bypassing the 15-second timer.

Storage format

~/.exo/kv-cache/<model_hash>/
  ├── slot_0_cache.safetensors      # KV tensors (bf16)
  ├── slot_0_tokens.safetensors     # Token array (int32, ~200 KB)
  ├── slot_0_meta.json              # {model_id, token_count, timestamp}
  ├── slot_1_cache.safetensors
  ├── slot_1_tokens.safetensors
  └── slot_1_meta.json

Each model gets its own directory via model_hash (SHA-256[:16] of model_id). No cross-model contamination.

Disk eviction

Stale disk slots are cleaned up automatically after every flush:

24-hour TTL: Any slot with a timestamp older than 24 hours is deleted. Active conversations have their timestamp refreshed on every flush, so they never expire while in use.
No slot count cap: Disk space is cheap on NVMe SSDs. 20 concurrent conversations ≈ 100 GB — negligible on multi-TB drives. A hard cap would risk evicting active conversations during testing bursts.
Hot slot protection: The currently loaded slot is never evicted, even if its metadata timestamp is stale.
Three files per eviction: _cache.safetensors, _tokens.safetensors, and _meta.json are all removed together.

Model compatibility

This feature works with any model using standard KVCache from mlx-lm, which includes the vast majority of architectures:

✅ Standard attention (Llama, Mistral, Qwen, Gemma, Phi, etc.)
✅ Multi-Latent Attention / MLA (DeepSeek V3, Kimi-K2.5)
✅ Grouped Query Attention / GQA (Llama 3, Mistral, etc.)
⚠️ QuantizedKVCache — untested but should work via from_state() protocol
⚠️ RotatingKVCache (sliding window) — untested, may need verification
❌ Models with ArraysCache / SSM layers (Mamba, Jamba) — SSM state snapshots are not persisted to disk. These models fall through gracefully to fresh cache rather than serving corrupt state.

Configuration

All optional. No configuration required — the feature works out of the box with sensible defaults.

Variable	Default	Effect
`EXO_KV_DISK_PERSISTENCE=0`	`1` (enabled)	Disable disk persistence entirely (revert to RAM-only behavior)
`EXO_KV_DISK_PATH=/path`	`~/.exo/kv-cache`	Custom storage location (e.g., faster NVMe, larger drive)
`EXO_KV_DISK_TTL_HOURS=48`	`24`	Hours before stale slot eviction
`EXO_KV_DISK_MAX_SIZE_GB=200`	`500`	Evict oldest slots when directory exceeds this limit

Before / After

Before (Checkpoint 1 only)

Channel A:  50k tokens → saved at slot 0
Channel B:  35k tokens → matches slot 0 at 92% → OVERWRITES slot 0
Channel A:  50k tokens → slot 0 is now Channel B → full prefill (~2-3 minutes)
Runner restart: all cache lost → full prefill for everything

After

Channel A:  50k tokens → hot slot (index 0), saved to disk as slot_0
Channel B:  35k tokens → flush slot_0 to disk, load or create slot_1, 99.2% hit
Channel A:  50k tokens → flush slot_1 to disk, load slot_0 from disk, 99.1% hit
Runner restart: load from disk → 99% hit, no cold prefill

Measured performance

Metric	Before	After
Same-conversation hit rate	93-99%	98-99%
Cross-conversation hit rate	0% (overwritten)	98-99% (separate slots)
Conversation switch cost	~2-3 minutes (full 50k prefill)	~2-4s (disk swap)
Cache survives restart	No	Yes

Testing

Tested over multiple hours on:

Hardware: 2× Mac Studio M3 Ultra (512 GB unified memory each), Thunderbolt 5 RDMA
Model: Kimi-K2.5 (DeepSeek V3 architecture, MLA attention, 61 layers, bf16)
Conversations: 4 concurrent Discord channels with 50k-100k token contexts, 0 prefill, instant switching.
Scenarios validated:
- Repeated channel switching with 98-99% hit rates on both channels
- Runner reload with immediate cache restoration from disk
- 15-second idle flush firing correctly between requests
- Force flush on shutdown preserving cache state
- 24-hour TTL eviction of stale slots

Files changed

File	Changes
`cache.py`	`model_id` param, disk persistence methods, single hot slot enforcement, TTL eviction, disk search fallback in `get_kv_cache`
`runner.py`	Pass `model_id` to `KVPrefixCache`, flush on idle (AllTasksComplete), force flush on shutdown
`generate.py`	`cached_coverage` guard in `_save_prefix_cache`
`batch_generate.py`	Same `cached_coverage` guard (active code path)

- Per-conversation slot separation: exact prefix check prevents cross-conversation slot overwriting. Conversations sharing a system prompt now maintain separate cache slots (98-99% hit rates). - Single hot slot in RAM with disk persistence via mlx-lm save_prompt_cache/load_prompt_cache. Conversation switches cost ~1-2s (disk swap) instead of ~140s (cold prefill). Caches survive runner restarts. 24h TTL disk eviction. Tested on 2x Mac Studio M3 Ultra (512GB each) with Kimi-K2.5 (61 layers, 50k token conversations, Thunderbolt 5 RDMA).

- Size-based eviction: after TTL pass, evicts oldest slots if disk cache directory exceeds 500GB. Hot slot is never evicted. - Force flush on idle: saves hot slot immediately when runner has no active tasks, ensuring completed conversations are always persisted.

- EXO_KV_DISK_PERSISTENCE=0 to disable disk persistence entirely - EXO_KV_DISK_PATH to set custom storage location - EXO_KV_DISK_TTL_HOURS to configure stale slot eviction (default 24h) - EXO_KV_DISK_MAX_SIZE_GB to set disk size cap (default 500GB) All optional with sensible defaults. No behavior change without setting them.

…ures Some models (e.g., GLM-4) produce cache layers with empty tensors. mx.save_safetensors rejects zero-length arrays. Replaced save_prompt_cache with inline logic that filters empty arrays before saving. Load path handles missing keys gracefully via from_state() reconstruction.

aidiffuser force-pushed the kv-cache-disk-persistence branch 2 times, most recently from f8164ef to 60f4fa2 Compare April 1, 2026 21:56

aidiffuser force-pushed the kv-cache-disk-persistence branch from 60f4fa2 to dfc6fc2 Compare April 1, 2026 21:58

aidiffuser changed the title ~~KV cache: per-conversation slot separation + disk persistence~~ KV cache: per-conversation slot separation + disk persistence/offload Apr 2, 2026

KV cache: ensure disk dir exists before flush

cc0febb

aidiffuser force-pushed the kv-cache-disk-persistence branch from 1432d2a to cc0febb Compare April 3, 2026 07:34

aidiffuser added 3 commits April 7, 2026 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache: per-conversation slot separation + disk persistence/offload#1830

KV cache: per-conversation slot separation + disk persistence/offload#1830
aidiffuser wants to merge 5 commits intoexo-explore:mainfrom
aidiffuser:kv-cache-disk-persistence

aidiffuser commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aidiffuser commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Per-conversation slot separation

Single hot slot + disk persistence

Storage format

Disk eviction

Model compatibility

Configuration

Before / After

Before (Checkpoint 1 only)

After

Measured performance

Testing

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aidiffuser commented Apr 1, 2026 •

edited

Loading