Skip to content

fix: force gc + clear_cache after KV prefix cache eviction#1832

Open
adurham wants to merge 1 commit intoexo-explore:mainfrom
adurham:fix/gc-after-kv-eviction
Open

fix: force gc + clear_cache after KV prefix cache eviction#1832
adurham wants to merge 1 commit intoexo-explore:mainfrom
adurham:fix/gc-after-kv-eviction

Conversation

@adurham
Copy link
Copy Markdown

@adurham adurham commented Apr 2, 2026

Summary

  • After KVPrefixCache evicts LRU entries, the MLX Metal buffers stay allocated until Python's GC runs
  • This leaks ~3-4 GB between long-context requests, reducing the effective context ceiling for back-to-back requests
  • Adding gc.collect() + mx.clear_cache() after eviction frees Metal buffers promptly

Test plan

  • Measured on 2-node PP cluster with Qwen3.5-397B-A17B-4bit at 63K context
  • Before: 108.88 GB retained after eviction (3.78 GB above baseline)
  • After: 105.48 GB retained after eviction (0.38 GB above baseline — draft model KV + minor overhead)
  • gc.collect() adds ~2-3ms latency, runs once per eviction cycle (not per token)
  • Verify with uv run pytest

🤖 Generated with Claude Code

When `KVPrefixCache` evicts LRU entries under memory pressure, the
Python list `pop()` removes references but the underlying MLX Metal
buffers stay allocated until Python's garbage collector runs. This
causes ~3-4 GB of leaked memory between long-context requests, reducing
the effective context ceiling for back-to-back requests.

Adding `gc.collect()` + `mx.clear_cache()` after eviction ensures Metal
buffers are freed promptly. Measured: 3.78 GB leak reduced to 0.38 GB
(draft model KV + minor Python heap overhead).

The `gc.collect()` call adds a few milliseconds of latency but only runs
once per eviction cycle (not per token), so the impact on generation
throughput is negligible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant