Skip to content

KV cache: per-conversation slot separation + disk persistence/offload#1830

Open
aidiffuser wants to merge 5 commits intoexo-explore:mainfrom
aidiffuser:kv-cache-disk-persistence
Open

KV cache: per-conversation slot separation + disk persistence/offload#1830
aidiffuser wants to merge 5 commits intoexo-explore:mainfrom
aidiffuser:kv-cache-disk-persistence

Conversation

@aidiffuser
Copy link
Copy Markdown

@aidiffuser aidiffuser commented Apr 1, 2026

Problem

KVPrefixCache matches slots purely by token prefix overlap. When multiple conversations share a system prompt (typical in multi-channel API gateway setups), the prefix matcher sees 90%+ overlap and overwrites the existing
slot on every conversation switch — destroying the previous conversation's cache. A 100k-token conversation reverts to a ~20-30k system prompt match if you were to switch to a fresh session, requiring near-full reprefill. The alternative — keeping multiple slots in RAM — scales linearly with conversation count, length, and model architecture, quickly becoming prohibitive. With this patch, you only have 1 slot active at any given time so it prevents OOMs. Additionally, KV cache is RAM-only: restarts and crashes force cold prefills from zero. I am now able to hold multiple conversations in multiple channels with my agent running locally on my computer with near zero prefill downtime (instant switching from one session to the other) after every conversation is fully loaded/flushed to disk, I think of it as a context extender, I hold 3-5 conversations at a time all of which are around 70k-100k tokens on a model as large as Kimi K2.5.

Solution

Per-conversation slot separation

A heuristic in _save_prefix_cache distinguishes "same conversation extended" from "different conversation with shared prefix" using an exact prefix check:

cached_len = len(cached_prompt)
is_extension = prefix_hit_length >= cached_len - 1 and cached_coverage >= hit_ratio

A legitimate update (same conversation, new messages appended) matches all cached tokens as a prefix — prefix_hit_length >= cached_len - 1. The new prompt is always longer, so cached_coverage >= hit_ratio.

A cross-conversation collision (different channel, shared system prompt) matches only a subset of the cached tokens — even at 98% overlap, the gap of hundreds of unmatched tokens at the end of the cached prompt fails the exact prefix check. This correctly routes to a new slot instead of overwriting.

Single hot slot + disk persistence

Only one KV cache slot stays in RAM at a time (the "hot" slot for the active conversation). All other slots are persisted to disk via mlx-lm's save_prompt_cache / load_prompt_cache, which call mx.save_safetensors internally — writing directly from Apple Silicon unified memory to disk with no numpy intermediate, no bf16 casting, and no GPU→CPU copy.

Conversation switch flow:

  1. New conversation arrives, hot slot doesn't match → flush hot slot to disk (~1-2s)
  2. Search disk for a matching slot → load if found (~1-2s)
  3. If not found → fresh prefill (first time only)
  4. Total switch cost: ~2-4s vs ~2-3 minutes cold prefill

Three flush triggers:

  • 15-second idle timer: Saves the hot slot when the runner has no active tasks for 15+ seconds. Protects against data loss from crashes.
  • Conversation switch: Immediate synchronous flush before clearing RAM and loading the new conversation.
  • Shutdown: Force flush on runner shutdown, bypassing the 15-second timer.

Storage format

~/.exo/kv-cache/<model_hash>/
  ├── slot_0_cache.safetensors      # KV tensors (bf16)
  ├── slot_0_tokens.safetensors     # Token array (int32, ~200 KB)
  ├── slot_0_meta.json              # {model_id, token_count, timestamp}
  ├── slot_1_cache.safetensors
  ├── slot_1_tokens.safetensors
  └── slot_1_meta.json

Each model gets its own directory via model_hash (SHA-256[:16] of model_id). No cross-model contamination.

Disk eviction

Stale disk slots are cleaned up automatically after every flush:

  • 24-hour TTL: Any slot with a timestamp older than 24 hours is deleted. Active conversations have their timestamp refreshed on every flush, so they never expire while in use.
  • No slot count cap: Disk space is cheap on NVMe SSDs. 20 concurrent conversations ≈ 100 GB — negligible on multi-TB drives. A hard cap would risk evicting active conversations during testing bursts.
  • Hot slot protection: The currently loaded slot is never evicted, even if its metadata timestamp is stale.
  • Three files per eviction: _cache.safetensors, _tokens.safetensors, and _meta.json are all removed together.

Model compatibility

This feature works with any model using standard KVCache from mlx-lm, which includes the vast majority of architectures:

  • ✅ Standard attention (Llama, Mistral, Qwen, Gemma, Phi, etc.)
  • ✅ Multi-Latent Attention / MLA (DeepSeek V3, Kimi-K2.5)
  • ✅ Grouped Query Attention / GQA (Llama 3, Mistral, etc.)
  • ⚠️ QuantizedKVCache — untested but should work via from_state() protocol
  • ⚠️ RotatingKVCache (sliding window) — untested, may need verification
  • ❌ Models with ArraysCache / SSM layers (Mamba, Jamba) — SSM state snapshots are not persisted to disk. These models fall through gracefully to fresh cache rather than serving corrupt state.

Configuration

All optional. No configuration required — the feature works out of the box with sensible defaults.

Variable Default Effect
EXO_KV_DISK_PERSISTENCE=0 1 (enabled) Disable disk persistence entirely (revert to RAM-only behavior)
EXO_KV_DISK_PATH=/path ~/.exo/kv-cache Custom storage location (e.g., faster NVMe, larger drive)
EXO_KV_DISK_TTL_HOURS=48 24 Hours before stale slot eviction
EXO_KV_DISK_MAX_SIZE_GB=200 500 Evict oldest slots when directory exceeds this limit

Before / After

Before (Checkpoint 1 only)

Channel A:  50k tokens → saved at slot 0
Channel B:  35k tokens → matches slot 0 at 92% → OVERWRITES slot 0
Channel A:  50k tokens → slot 0 is now Channel B → full prefill (~2-3 minutes)
Runner restart: all cache lost → full prefill for everything

After

Channel A:  50k tokens → hot slot (index 0), saved to disk as slot_0
Channel B:  35k tokens → flush slot_0 to disk, load or create slot_1, 99.2% hit
Channel A:  50k tokens → flush slot_1 to disk, load slot_0 from disk, 99.1% hit
Runner restart: load from disk → 99% hit, no cold prefill

Measured performance

Metric Before After
Same-conversation hit rate 93-99% 98-99%
Cross-conversation hit rate 0% (overwritten) 98-99% (separate slots)
Conversation switch cost ~2-3 minutes (full 50k prefill) ~2-4s (disk swap)
Cache survives restart No Yes

Testing

Tested over multiple hours on:

  • Hardware: 2× Mac Studio M3 Ultra (512 GB unified memory each), Thunderbolt 5 RDMA
  • Model: Kimi-K2.5 (DeepSeek V3 architecture, MLA attention, 61 layers, bf16)
  • Conversations: 4 concurrent Discord channels with 50k-100k token contexts, 0 prefill, instant switching.
  • Scenarios validated:
    • Repeated channel switching with 98-99% hit rates on both channels
    • Runner reload with immediate cache restoration from disk
    • 15-second idle flush firing correctly between requests
    • Force flush on shutdown preserving cache state
    • 24-hour TTL eviction of stale slots

Files changed

File Changes
cache.py model_id param, disk persistence methods, single hot slot enforcement, TTL eviction, disk search fallback in get_kv_cache
runner.py Pass model_id to KVPrefixCache, flush on idle (AllTasksComplete), force flush on shutdown
generate.py cached_coverage guard in _save_prefix_cache
batch_generate.py Same cached_coverage guard (active code path)

@aidiffuser aidiffuser force-pushed the kv-cache-disk-persistence branch 2 times, most recently from f8164ef to 60f4fa2 Compare April 1, 2026 21:56
- Per-conversation slot separation: exact prefix check prevents
  cross-conversation slot overwriting. Conversations sharing a system
  prompt now maintain separate cache slots (98-99% hit rates).

- Single hot slot in RAM with disk persistence via mlx-lm
  save_prompt_cache/load_prompt_cache. Conversation switches cost
  ~1-2s (disk swap) instead of ~140s (cold prefill). Caches survive
  runner restarts. 24h TTL disk eviction.

Tested on 2x Mac Studio M3 Ultra (512GB each) with Kimi-K2.5
(61 layers, 50k token conversations, Thunderbolt 5 RDMA).
@aidiffuser aidiffuser force-pushed the kv-cache-disk-persistence branch from 60f4fa2 to dfc6fc2 Compare April 1, 2026 21:58
@aidiffuser aidiffuser changed the title KV cache: per-conversation slot separation + disk persistence KV cache: per-conversation slot separation + disk persistence/offload Apr 2, 2026
@aidiffuser aidiffuser force-pushed the kv-cache-disk-persistence branch from 1432d2a to cc0febb Compare April 3, 2026 07:34
aidiffuser added 3 commits April 7, 2026 13:00
- Size-based eviction: after TTL pass, evicts oldest slots if disk
  cache directory exceeds 500GB. Hot slot is never evicted.

- Force flush on idle: saves hot slot immediately when runner has no
  active tasks, ensuring completed conversations are always persisted.
- EXO_KV_DISK_PERSISTENCE=0 to disable disk persistence entirely
- EXO_KV_DISK_PATH to set custom storage location
- EXO_KV_DISK_TTL_HOURS to configure stale slot eviction (default 24h)
- EXO_KV_DISK_MAX_SIZE_GB to set disk size cap (default 500GB)

All optional with sensible defaults. No behavior change without setting them.
…ures

Some models (e.g., GLM-4) produce cache layers with empty tensors.
mx.save_safetensors rejects zero-length arrays. Replaced save_prompt_cache
with inline logic that filters empty arrays before saving. Load path
handles missing keys gracefully via from_state() reconstruction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant