Skip to content

Add prefix cache flag to exo bench#1888

Merged
rltakashige merged 2 commits intomainfrom
leo/speed-up-exo-bench
Apr 14, 2026
Merged

Add prefix cache flag to exo bench#1888
rltakashige merged 2 commits intomainfrom
leo/speed-up-exo-bench

Conversation

@rltakashige
Copy link
Copy Markdown
Collaborator

@rltakashige rltakashige commented Apr 13, 2026

Motivation

For using Exo-Bench extensively, there are many cases that we could use prefix caching to speed up the benchmarks, especially when the focus is on the token generation.

At the same time, it's very clear that prefix caching decode tokens is not very useful in most current scenarios. Surprisingly, even for non-thinking models, the chat template means that a continued conversation will be formatted such that the existing cache is not effective.

We already (slightly accidentally) do this for the batch generator - we should do it for the sequential generator too.

Changes

We can now speed up exo bench by having a use prefix caching flag. Of course, for most accurate pp results, it is better to not have it, but this speeds up tg and large benchmarking significantly.
Updated methodology to match

Test Plan

Manual Testing

Tested on many configurations that the difference in results is negligible, even with multiple --pp options.

f"cached ({100 * prefix_hit_length / len(all_prompt_tokens):.1f}%)"
)
prompt_tokens = remaining_tokens
else:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this twice

)
cache_snapshots: list[CacheSnapshot] | None = ssm_snapshots_list or None

if kv_prefix_cache is not None and matched_index is not None and is_exact_hit:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save prefix cache at prefill time rather than decode time

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do this in the batch generator anyway atm

@rltakashige rltakashige changed the title Add prefix cache flag to exo bench and don't prefix cache decode tokens Add prefix cache flag to exo bench Apr 13, 2026
Copy link
Copy Markdown
Member

@Evanev7 Evanev7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you actually use the prefill tps in the kv cache anywhere or just plumbing?

all_prompt_tokens: mx.array
prefix_hit_length: int
matched_index: int | None
cache_snapshots: list[CacheSnapshot] | None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this just unused?

@rltakashige rltakashige enabled auto-merge (squash) April 14, 2026 10:12
@rltakashige rltakashige merged commit f2709dc into main Apr 14, 2026
6 checks passed
@rltakashige rltakashige deleted the leo/speed-up-exo-bench branch April 14, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants