Revert Flare worker — wgpu readback deadlocks in dedicated Workers#311
Revert Flare worker — wgpu readback deadlocks in dedicated Workers#311sauravpanda merged 1 commit intomainfrom
Conversation
PR #309 moved Flare into a dedicated Worker to keep the UI responsive during the 138 MB GGUF parse. That fixed the freeze, but dropped WebGPU silently to CPU fallback — then the next release (flare-web 0.2.12) fixed the WebGPU-in-worker detection, and the benchmark immediately deadlocked on the first inference run. Root cause: flare-gpu's dispatch_and_readback does slice.map_async(Read, |r| sender.send(r)); device.poll(Wait); // no-op on wasm32 receiver.recv(); // blocks the worker forever The WebGPU map_async callback is serviced by browser-internal microtasks that only drain on the main thread. In a Worker, the sync recv() call deadlocks — we hung for 240+ s on the warmup run. Main-thread load still briefly freezes the UI, but that's the lesser evil compared to CPU-fallback-at-20-tok/s or a hung tab. Proper fix requires a worker-safe async readback path in flare-web. Tracked separately. The flare-worker.js helper is removed since nothing else uses it.
📝 WalkthroughWalkthroughThe Flare benchmark execution is refactored to run directly on the main thread instead of in a dedicated Web Worker. The worker implementation file is deleted entirely, and the main benchmark HTML is updated to remove the RPC layer and call Flare engine methods synchronously, with module loading and streaming logic moved from the worker to the main thread. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/benchmark/index.html (1)
679-728:⚠️ Potential issue | 🟡 MinorTTFT is captured before the first token is generated, skewing Flare vs MLC comparisons.
begin_stream_with_paramsruns prefill and returnsvoid; the first token is only produced by the subsequentnext_token()call. CapturingfirstTokenTimeon line 692 therefore measures "prefill complete", not "time to first token". MLC's path (lines 463-464) and Transformers' path (lines 544-546) both stamp TTFT on the arrival of the first token.Downstream effect on throughput:
decodeTime = totalTime - firstTokenTimethen covers alltokenCounttokens for Flare but onlytokenCount - 1tokens for MLC, so(tokenCount - 1) / decodeTimeon line 726 systematically under-reports Flare's tok/s relative to the other engines — which is the opposite of what you want from a benchmark harness. Since this is the first on-main-thread Flare run with WebGPU actually active (per the PR description), the skew will be more visible in the upcoming re-benchmark.🔧 Proposed fix: stamp TTFT on the first decoded token
flareEngine.reset(); flareEngine.begin_stream_with_params( promptIds, opts.maxTokens, opts.temperature || 0.001, 1.0, 40, 1.0, 0.0, ); - const firstTokenTime = performance.now() - t0; - // First-run-only profile snapshot if (typeof flareEngine.prefill_profile_json === 'function' && !window.__flareProfileLogged) { try { const profile = JSON.parse(flareEngine.prefill_profile_json()); if (profile && profile.seq_len > 0) { console.log('[Flare] prefill profile:', profile); log(`Flare prefill profile: ${JSON.stringify(profile)}`, 'info'); window.__flareProfileLogged = true; } } catch (e) { console.warn('[Flare] prefill profile read failed:', e); } } let tokenCount = 0; let output = ''; + let firstTokenTime = null; while (!flareEngine.stream_done) { const id = flareEngine.next_token(); if (id === undefined) break; + if (firstTokenTime === null) firstTokenTime = performance.now() - t0; tokenCount++; output += flareEngine.decode_token_chunk(id); } const totalTime = performance.now() - t0; - const decodeTime = totalTime - firstTokenTime; + const decodeTime = totalTime - (firstTokenTime ?? totalTime); return { output, - ttft: firstTokenTime, + ttft: firstTokenTime ?? totalTime, totalTime, tokenCount, tokensPerSec: (tokenCount > 1 && decodeTime > 0) ? ((tokenCount - 1) / (decodeTime / 1000)) : 0, };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmark/index.html` around lines 679 - 728, The TTFT for Flare is measured too early (after begin_stream_with_params) — move the firstTokenTime stamp to the moment the first token is actually produced by next_token() to match MLC/Transformers; specifically remove the current assignment to firstTokenTime after begin_stream_with_params and instead set firstTokenTime = performance.now() - t0 inside the while loop when you detect the very first token (e.g., when tokenCount === 0 or before you increment tokenCount after a successful id from flareEngine.next_token()), so decodeTime and tokensPerSec calculations use the true time-to-first-token for Flare.
🧹 Nitpick comments (1)
examples/benchmark/index.html (1)
730-732: Consider explicitly freeing the WASM instance on dispose.wasm-bindgen–generated classes expose a
free()method that releases the underlying linear-memory allocation immediately; nulling the reference leaves reclamation up to the GC / FinalizationRegistry, which can keep ~138 MB pinned between benchmark runs (especially when switching models back-to-back in the loop at lines 816-920). Sincefree()isn't in the TypeScript surface insrc/engines/flare-engine-wrapper.ts, guard the call.♻️ Optional cleanup
function disposeFlare() { + if (flareEngine && typeof flareEngine.free === 'function') { + try { flareEngine.free(); } catch {} + } flareEngine = null; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmark/index.html` around lines 730 - 732, The disposeFlare function currently just nulls flareEngine; change it to explicitly call the wasm-bindgen free() when available: if flareEngine has a free method (guard with typeof flareEngine?.free === "function") call flareEngine.free() before setting flareEngine = null. Update any wrapper type/usage around FlareEngineWrapper/flareEngine to allow this guarded call so the underlying WASM linear memory is released immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/benchmark/index.html`:
- Around line 679-728: The TTFT for Flare is measured too early (after
begin_stream_with_params) — move the firstTokenTime stamp to the moment the
first token is actually produced by next_token() to match MLC/Transformers;
specifically remove the current assignment to firstTokenTime after
begin_stream_with_params and instead set firstTokenTime = performance.now() - t0
inside the while loop when you detect the very first token (e.g., when
tokenCount === 0 or before you increment tokenCount after a successful id from
flareEngine.next_token()), so decodeTime and tokensPerSec calculations use the
true time-to-first-token for Flare.
---
Nitpick comments:
In `@examples/benchmark/index.html`:
- Around line 730-732: The disposeFlare function currently just nulls
flareEngine; change it to explicitly call the wasm-bindgen free() when
available: if flareEngine has a free method (guard with typeof flareEngine?.free
=== "function") call flareEngine.free() before setting flareEngine = null.
Update any wrapper type/usage around FlareEngineWrapper/flareEngine to allow
this guarded call so the underlying WASM linear memory is released immediately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 685fda9e-364f-4038-941c-f0fb9c958aaf
📒 Files selected for processing (2)
examples/benchmark/flare-worker.jsexamples/benchmark/index.html
💤 Files with no reviewable changes (1)
- examples/benchmark/flare-worker.js
PR #309 moved Flare into a dedicated Web Worker to keep the UI responsive during the 138 MB GGUF parse. That fixed the freeze, but silently dropped WebGPU to CPU fallback (see upstream #498 which fixed the worker detection in flare-web 0.2.12). After that upstream fix landed, the worker immediately deadlocked on the first inference run.
Root cause is in flare-gpu's `dispatch_and_readback`:
```rust
slice.map_async(Read, |r| sender.send(r));
device.poll(Wait); // no-op on wasm32
receiver.recv(); // blocks the worker forever
```
The WebGPU `map_async` callback is serviced by browser-internal microtasks that only drain on the main thread. In a Worker, the sync `recv()` call deadlocks — I reproduced a 240+ s hang on the warmup run.
Main-thread load still briefly freezes the UI for 2-3 s while the 138 MB GGUF is parsed, but that's the lesser evil vs. CPU-fallback-at-20-tok/s or a fully hung tab.
The real fix is a worker-safe async readback path in flare-web — tracked separately.
Test plan
Summary by CodeRabbit