Skip to content

Fix unbounded KV cache memory growth by wiring up RotatingKVCache#1860

Open
humanrouter wants to merge 1 commit intoexo-explore:mainfrom
humanrouter:fix/wire-up-rotating-kv-cache
Open

Fix unbounded KV cache memory growth by wiring up RotatingKVCache#1860
humanrouter wants to merge 1 commit intoexo-explore:mainfrom
humanrouter:fix/wire-up-rotating-kv-cache

Conversation

@humanrouter
Copy link
Copy Markdown

Summary

  • Bug: MAX_KV_SIZE and KEEP_KV_SIZE constants are defined in constants.py but never passed to make_kv_cache() — all call sites use the default unbounded KVCache, causing memory to grow without limit as conversations accumulate in the KV prefix cache.
  • Impact: On a 256GB Mac Studio, this caused a single exo worker process to consume 147GB of memory (40GB model weights + 107GB accumulated KV caches from past conversations).
  • Fix: Wire MAX_KV_SIZE and KEEP_KV_SIZE through all make_kv_cache() call sites so RotatingKVCache is actually used with a bounded context window.
  • Updated defaults: MAX_KV_SIZE 3200→131072 (128K tokens, matching modern model context lengths), KEEP_KV_SIZE 1600→4096 (preserves system prompt).

Details

The make_kv_cache() function already supports max_kv_size and keep parameters, and uses RotatingKVCache when they are provided. However, none of the 5 call sites across generate.py, batch_generate.py, and cache.py were passing these values — they all called make_kv_cache(model) with no size arguments, defaulting to unbounded KVCache.

The existing EXO_MEMORY_THRESHOLD env var controls LRU eviction of entire conversations from the prefix cache, but individual conversations' KV caches could still grow without bound. With this fix, each conversation's KV cache is capped at 128K tokens via the rotating window.

Test plan

  • Verified on Mac Studio (256GB) with Qwen 3.5 27B — memory usage dropped from 195GB to ~28GB after restart
  • Inference requests complete successfully with rotating cache
  • KV prefix cache hit/miss logic works as before (no behavioral change for conversations under 128K tokens)

🤖 Generated with Claude Code

MAX_KV_SIZE and KEEP_KV_SIZE constants exist in constants.py but are
never passed to make_kv_cache() — all call sites default to unbounded
KVCache. This causes memory to grow without limit as conversations
accumulate in the KV prefix cache, leading to 100GB+ memory usage on
machines with large unified memory (observed 147GB on a 256GB Mac Studio).

Pass MAX_KV_SIZE and KEEP_KV_SIZE through all make_kv_cache() call sites
in generate.py, batch_generate.py, and cache.py so that RotatingKVCache
is used with a bounded context window.

Update default values:
- MAX_KV_SIZE: 3200 → 131072 (128K tokens, matching modern model context)
- KEEP_KV_SIZE: 1600 → 4096 (preserves system prompt in rotating window)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant