Fix unbounded KV cache memory growth by wiring up RotatingKVCache by humanrouter · Pull Request #1860 · exo-explore/exo

humanrouter · 2026-04-08T18:41:01Z

Summary

Bug: MAX_KV_SIZE and KEEP_KV_SIZE constants are defined in constants.py but never passed to make_kv_cache() — all call sites use the default unbounded KVCache, causing memory to grow without limit as conversations accumulate in the KV prefix cache.
Impact: On a 256GB Mac Studio, this caused a single exo worker process to consume 147GB of memory (40GB model weights + 107GB accumulated KV caches from past conversations).
Fix: Wire MAX_KV_SIZE and KEEP_KV_SIZE through all make_kv_cache() call sites so RotatingKVCache is actually used with a bounded context window.
Updated defaults: MAX_KV_SIZE 3200→131072 (128K tokens, matching modern model context lengths), KEEP_KV_SIZE 1600→4096 (preserves system prompt).

Details

The make_kv_cache() function already supports max_kv_size and keep parameters, and uses RotatingKVCache when they are provided. However, none of the 5 call sites across generate.py, batch_generate.py, and cache.py were passing these values — they all called make_kv_cache(model) with no size arguments, defaulting to unbounded KVCache.

The existing EXO_MEMORY_THRESHOLD env var controls LRU eviction of entire conversations from the prefix cache, but individual conversations' KV caches could still grow without bound. With this fix, each conversation's KV cache is capped at 128K tokens via the rotating window.

Test plan

Verified on Mac Studio (256GB) with Qwen 3.5 27B — memory usage dropped from 195GB to ~28GB after restart
Inference requests complete successfully with rotating cache
KV prefix cache hit/miss logic works as before (no behavioral change for conversations under 128K tokens)

🤖 Generated with Claude Code

MAX_KV_SIZE and KEEP_KV_SIZE constants exist in constants.py but are never passed to make_kv_cache() — all call sites default to unbounded KVCache. This causes memory to grow without limit as conversations accumulate in the KV prefix cache, leading to 100GB+ memory usage on machines with large unified memory (observed 147GB on a 256GB Mac Studio). Pass MAX_KV_SIZE and KEEP_KV_SIZE through all make_kv_cache() call sites in generate.py, batch_generate.py, and cache.py so that RotatingKVCache is used with a bounded context window. Update default values: - MAX_KV_SIZE: 3200 → 131072 (128K tokens, matching modern model context) - KEEP_KV_SIZE: 1600 → 4096 (preserves system prompt in rotating window) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unbounded KV cache memory growth by wiring up RotatingKVCache#1860

Fix unbounded KV cache memory growth by wiring up RotatingKVCache#1860
humanrouter wants to merge 1 commit intoexo-explore:mainfrom
humanrouter:fix/wire-up-rotating-kv-cache

humanrouter commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

humanrouter commented Apr 8, 2026

Summary

Details

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant