Fix unbounded KV cache memory growth by wiring up RotatingKVCache#1860
Open
humanrouter wants to merge 1 commit intoexo-explore:mainfrom
Open
Fix unbounded KV cache memory growth by wiring up RotatingKVCache#1860humanrouter wants to merge 1 commit intoexo-explore:mainfrom
humanrouter wants to merge 1 commit intoexo-explore:mainfrom
Conversation
MAX_KV_SIZE and KEEP_KV_SIZE constants exist in constants.py but are never passed to make_kv_cache() — all call sites default to unbounded KVCache. This causes memory to grow without limit as conversations accumulate in the KV prefix cache, leading to 100GB+ memory usage on machines with large unified memory (observed 147GB on a 256GB Mac Studio). Pass MAX_KV_SIZE and KEEP_KV_SIZE through all make_kv_cache() call sites in generate.py, batch_generate.py, and cache.py so that RotatingKVCache is used with a bounded context window. Update default values: - MAX_KV_SIZE: 3200 → 131072 (128K tokens, matching modern model context) - KEEP_KV_SIZE: 1600 → 4096 (preserves system prompt in rotating window) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MAX_KV_SIZEandKEEP_KV_SIZEconstants are defined inconstants.pybut never passed tomake_kv_cache()— all call sites use the default unboundedKVCache, causing memory to grow without limit as conversations accumulate in the KV prefix cache.MAX_KV_SIZEandKEEP_KV_SIZEthrough allmake_kv_cache()call sites soRotatingKVCacheis actually used with a bounded context window.MAX_KV_SIZE3200→131072 (128K tokens, matching modern model context lengths),KEEP_KV_SIZE1600→4096 (preserves system prompt).Details
The
make_kv_cache()function already supportsmax_kv_sizeandkeepparameters, and usesRotatingKVCachewhen they are provided. However, none of the 5 call sites acrossgenerate.py,batch_generate.py, andcache.pywere passing these values — they all calledmake_kv_cache(model)with no size arguments, defaulting to unboundedKVCache.The existing
EXO_MEMORY_THRESHOLDenv var controls LRU eviction of entire conversations from the prefix cache, but individual conversations' KV caches could still grow without bound. With this fix, each conversation's KV cache is capped at 128K tokens via the rotating window.Test plan
🤖 Generated with Claude Code