Add prefix cache flag to exo bench by rltakashige · Pull Request #1888 · exo-explore/exo

rltakashige · 2026-04-13T22:38:00Z

Motivation

For using Exo-Bench extensively, there are many cases that we could use prefix caching to speed up the benchmarks, especially when the focus is on the token generation.

At the same time, it's very clear that prefix caching decode tokens is not very useful in most current scenarios. Surprisingly, even for non-thinking models, the chat template means that a continued conversation will be formatted such that the existing cache is not effective.

We already (slightly accidentally) do this for the batch generator - we should do it for the sequential generator too.

Changes

We can now speed up exo bench by having a use prefix caching flag. Of course, for most accurate pp results, it is better to not have it, but this speeds up tg and large benchmarking significantly.
Updated methodology to match

Test Plan

Manual Testing

Tested on many configurations that the difference in results is negligible, even with multiple --pp options.

rltakashige · 2026-04-13T22:39:42Z

src/exo/worker/engines/mlx/generator/batch_generate.py

                    f"cached ({100 * prefix_hit_length / len(all_prompt_tokens):.1f}%)"
                )
                prompt_tokens = remaining_tokens
-            else:


We don't need to do this twice

rltakashige · 2026-04-13T22:40:59Z

src/exo/worker/engines/mlx/generator/generate.py

        )
    cache_snapshots: list[CacheSnapshot] | None = ssm_snapshots_list or None

+    if kv_prefix_cache is not None and matched_index is not None and is_exact_hit:


Save prefix cache at prefill time rather than decode time

We don't do this in the batch generator anyway atm

Evanev7

do you actually use the prefill tps in the kv cache anywhere or just plumbing?

Evanev7 · 2026-04-14T08:55:59Z

src/exo/worker/engines/mlx/generator/batch_generate.py

    all_prompt_tokens: mx.array
    prefix_hit_length: int
    matched_index: int | None
-    cache_snapshots: list[CacheSnapshot] | None


was this just unused?

rltakashige added 2 commits April 13, 2026 23:21

Add prefix cache flag to exo bench and don't prefix cache decode

91adfc6

Pass CI

1288d9c

rltakashige commented Apr 13, 2026

View reviewed changes

rltakashige changed the title ~~Add prefix cache flag to exo bench and don't prefix cache decode tokens~~ Add prefix cache flag to exo bench Apr 13, 2026

Evanev7 reviewed Apr 14, 2026

View reviewed changes

rltakashige enabled auto-merge (squash) April 14, 2026 10:12

Evanev7 approved these changes Apr 14, 2026

View reviewed changes

rltakashige merged commit f2709dc into main Apr 14, 2026
6 checks passed

rltakashige deleted the leo/speed-up-exo-bench branch April 14, 2026 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefix cache flag to exo bench#1888

Add prefix cache flag to exo bench#1888
rltakashige merged 2 commits intomainfrom
leo/speed-up-exo-bench

rltakashige commented Apr 13, 2026 •

edited

Loading

Uh oh!

rltakashige Apr 13, 2026

Uh oh!

rltakashige Apr 13, 2026

Uh oh!

rltakashige Apr 13, 2026

Uh oh!

Evanev7 left a comment

Uh oh!

Evanev7 Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rltakashige commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Test Plan

Manual Testing

Uh oh!

rltakashige Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rltakashige Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rltakashige Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Evanev7 left a comment

Choose a reason for hiding this comment

Uh oh!

Evanev7 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rltakashige commented Apr 13, 2026 •

edited

Loading