Benchmark: async GPU decode via next_token_async (flare 0.2.15)#316
Benchmark: async GPU decode via next_token_async (flare 0.2.15)#316sauravpanda merged 1 commit intomainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe benchmark's Flare integration is updated to version 0.2.15 with reversed GPU enabling logic (now enabled by default, disabled via Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Flips the Flare benchmark to GPU-backed decode using the new async API landed in flare-web 0.2.14/0.2.15.
Result on SmolLM2-135M Q8_0 (M-series Mac, Chrome)
Flare now leads decode throughput by 71% over MLC. Load is 2× faster. TTFT still trails MLC because prefill runs on CPU — closing that gap is the next upstream refactor (async prefill propagation in flare-web).
Changes
next_token_async+ wasm32 prefill CPU fallback)await flareEngine.next_token_async()when available, fall back to syncnext_token()otherwise?gpu=0opts into CPU-only for debuggingTest plan
npx jest— 62 passingSummary by CodeRabbit
New Features
?gpu=0to disable.Improvements