Revert Flare worker — wgpu readback deadlocks in dedicated Workers by sauravpanda · Pull Request #311 · sauravpanda/BrowserAI

sauravpanda · 2026-04-20T22:34:04Z

PR #309 moved Flare into a dedicated Web Worker to keep the UI responsive during the 138 MB GGUF parse. That fixed the freeze, but silently dropped WebGPU to CPU fallback (see upstream #498 which fixed the worker detection in flare-web 0.2.12). After that upstream fix landed, the worker immediately deadlocked on the first inference run.

Root cause is in flare-gpu's `dispatch_and_readback`:

```rust
slice.map_async(Read, |r| sender.send(r));
device.poll(Wait); // no-op on wasm32
receiver.recv(); // blocks the worker forever
```

The WebGPU `map_async` callback is serviced by browser-internal microtasks that only drain on the main thread. In a Worker, the sync `recv()` call deadlocks — I reproduced a 240+ s hang on the warmup run.

Main-thread load still briefly freezes the UI for 2-3 s while the 138 MB GGUF is parsed, but that's the lesser evil vs. CPU-fallback-at-20-tok/s or a fully hung tab.

The real fix is a worker-safe async readback path in flare-web — tracked separately.

Test plan

`npx jest` — 62 passing
Lint + build clean
Re-benchmark on main thread with 0.2.12 — expect ~70 tok/s + webgpu backend

Summary by CodeRabbit

Refactor
- Simplified Flare benchmark execution by consolidating engine processing to the main thread, removing the separate worker implementation and RPC layer for more streamlined code organization.

PR #309 moved Flare into a dedicated Worker to keep the UI responsive during the 138 MB GGUF parse. That fixed the freeze, but dropped WebGPU silently to CPU fallback — then the next release (flare-web 0.2.12) fixed the WebGPU-in-worker detection, and the benchmark immediately deadlocked on the first inference run. Root cause: flare-gpu's dispatch_and_readback does slice.map_async(Read, |r| sender.send(r)); device.poll(Wait); // no-op on wasm32 receiver.recv(); // blocks the worker forever The WebGPU map_async callback is serviced by browser-internal microtasks that only drain on the main thread. In a Worker, the sync recv() call deadlocks — we hung for 240+ s on the warmup run. Main-thread load still briefly freezes the UI, but that's the lesser evil compared to CPU-fallback-at-20-tok/s or a hung tab. Proper fix requires a worker-safe async readback path in flare-web. Tracked separately. The flare-worker.js helper is removed since nothing else uses it.

coderabbitai · 2026-04-20T22:34:21Z

📝 Walkthrough

Walkthrough

The Flare benchmark execution is refactored to run directly on the main thread instead of in a dedicated Web Worker. The worker implementation file is deleted entirely, and the main benchmark HTML is updated to remove the RPC layer and call Flare engine methods synchronously, with module loading and streaming logic moved from the worker to the main thread.

Changes

Cohort / File(s)	Summary
Worker Removal `examples/benchmark/flare-worker.js`	Entire dedicated Web Worker implementation deleted. Removed message dispatch routing, RPC response handling, streaming token chunk logic, WASM initialization flow, and error forwarding (145 lines eliminated).
Main Thread Refactor `examples/benchmark/index.html`	Removed worker/RPC layer (`flareWorker`, `flareRpcPending`, `ensureFlareWorker`, `flareCall`, message handlers). Flare module loading moved to main thread with blob-based patching of `import.meta.url`. Model loading and GPU initialization now call `flareLib.FlareEngine.load()` and `flareEngine.init_gpu()` directly. Inference refactored from worker streaming to main-thread loop using `begin_stream_with_params`, `next_token()`, and `decode_token_chunk()` until completion. Disposal simplified to null out `flareEngine`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Move Flare benchmark onto a dedicated Web Worker #309: Inverse change that adds the dedicated Flare worker and RPC bridge now being removed by this PR.
feat: add engine benchmark demo and fix Flare package name #307: Initial PR that adds the Flare benchmark (examples/benchmark/index.html) which is being refactored here.
fix: benchmark Flare loading and broken SmolLM2 GGUF URLs #308: Also modifies Flare benchmark's WASM/module loading and initialization logic in the same file.

Suggested labels

size/M

Poem

🐰 Hop! The worker bows,
its messages now rest—
Flare dances solo,
on the main thread's quest,
streaming tokens swift,
where once threads did jest. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: reverting the Flare worker implementation due to wgpu readback deadlocks in dedicated Workers, which directly addresses the core issue documented in the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch restore-main-thread-flare

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/benchmark/index.html (1)

679-728: ⚠️ Potential issue | 🟡 Minor

TTFT is captured before the first token is generated, skewing Flare vs MLC comparisons.

begin_stream_with_params runs prefill and returns void; the first token is only produced by the subsequent next_token() call. Capturing firstTokenTime on line 692 therefore measures "prefill complete", not "time to first token". MLC's path (lines 463-464) and Transformers' path (lines 544-546) both stamp TTFT on the arrival of the first token.

Downstream effect on throughput: decodeTime = totalTime - firstTokenTime then covers all tokenCount tokens for Flare but only tokenCount - 1 tokens for MLC, so (tokenCount - 1) / decodeTime on line 726 systematically under-reports Flare's tok/s relative to the other engines — which is the opposite of what you want from a benchmark harness. Since this is the first on-main-thread Flare run with WebGPU actually active (per the PR description), the skew will be more visible in the upcoming re-benchmark.

🔧 Proposed fix: stamp TTFT on the first decoded token

         flareEngine.reset();
         flareEngine.begin_stream_with_params(
           promptIds,
           opts.maxTokens,
           opts.temperature || 0.001,
           1.0, 40, 1.0, 0.0,
         );
 
-        const firstTokenTime = performance.now() - t0;
-
         // First-run-only profile snapshot
         if (typeof flareEngine.prefill_profile_json === 'function' && !window.__flareProfileLogged) {
           try {
             const profile = JSON.parse(flareEngine.prefill_profile_json());
             if (profile && profile.seq_len > 0) {
               console.log('[Flare] prefill profile:', profile);
               log(`Flare prefill profile: ${JSON.stringify(profile)}`, 'info');
               window.__flareProfileLogged = true;
             }
           } catch (e) {
             console.warn('[Flare] prefill profile read failed:', e);
           }
         }
 
         let tokenCount = 0;
         let output = '';
+        let firstTokenTime = null;
 
         while (!flareEngine.stream_done) {
           const id = flareEngine.next_token();
           if (id === undefined) break;
+          if (firstTokenTime === null) firstTokenTime = performance.now() - t0;
           tokenCount++;
           output += flareEngine.decode_token_chunk(id);
         }
 
         const totalTime = performance.now() - t0;
-        const decodeTime = totalTime - firstTokenTime;
+        const decodeTime = totalTime - (firstTokenTime ?? totalTime);
 
         return {
           output,
-          ttft: firstTokenTime,
+          ttft: firstTokenTime ?? totalTime,
           totalTime,
           tokenCount,
           tokensPerSec: (tokenCount > 1 && decodeTime > 0) ? ((tokenCount - 1) / (decodeTime / 1000)) : 0,
         };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 679 - 728, The TTFT for Flare is
measured too early (after begin_stream_with_params) — move the firstTokenTime
stamp to the moment the first token is actually produced by next_token() to
match MLC/Transformers; specifically remove the current assignment to
firstTokenTime after begin_stream_with_params and instead set firstTokenTime =
performance.now() - t0 inside the while loop when you detect the very first
token (e.g., when tokenCount === 0 or before you increment tokenCount after a
successful id from flareEngine.next_token()), so decodeTime and tokensPerSec
calculations use the true time-to-first-token for Flare.

🧹 Nitpick comments (1)

examples/benchmark/index.html (1)
730-732: Consider explicitly freeing the WASM instance on dispose.

wasm-bindgen–generated classes expose a free() method that releases the underlying linear-memory allocation immediately; nulling the reference leaves reclamation up to the GC / FinalizationRegistry, which can keep ~138 MB pinned between benchmark runs (especially when switching models back-to-back in the loop at lines 816-920). Since free() isn't in the TypeScript surface in src/engines/flare-engine-wrapper.ts, guard the call.
♻️ Optional cleanup
       function disposeFlare() {
+        if (flareEngine && typeof flareEngine.free === 'function') {
+          try { flareEngine.free(); } catch {}
+        }
         flareEngine = null;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 730 - 732, The disposeFlare
function currently just nulls flareEngine; change it to explicitly call the
wasm-bindgen free() when available: if flareEngine has a free method (guard with
typeof flareEngine?.free === "function") call flareEngine.free() before setting
flareEngine = null. Update any wrapper type/usage around
FlareEngineWrapper/flareEngine to allow this guarded call so the underlying WASM
linear memory is released immediately.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/benchmark/index.html`:
- Around line 679-728: The TTFT for Flare is measured too early (after
begin_stream_with_params) — move the firstTokenTime stamp to the moment the
first token is actually produced by next_token() to match MLC/Transformers;
specifically remove the current assignment to firstTokenTime after
begin_stream_with_params and instead set firstTokenTime = performance.now() - t0
inside the while loop when you detect the very first token (e.g., when
tokenCount === 0 or before you increment tokenCount after a successful id from
flareEngine.next_token()), so decodeTime and tokensPerSec calculations use the
true time-to-first-token for Flare.

---

Nitpick comments:
In `@examples/benchmark/index.html`:
- Around line 730-732: The disposeFlare function currently just nulls
flareEngine; change it to explicitly call the wasm-bindgen free() when
available: if flareEngine has a free method (guard with typeof flareEngine?.free
=== "function") call flareEngine.free() before setting flareEngine = null.
Update any wrapper type/usage around FlareEngineWrapper/flareEngine to allow
this guarded call so the underlying WASM linear memory is released immediately.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 685fda9e-364f-4038-941c-f0fb9c958aaf

📥 Commits

Reviewing files that changed from the base of the PR and between ffd39b1 and 4427fbc.

📒 Files selected for processing (2)

examples/benchmark/flare-worker.js
examples/benchmark/index.html

💤 Files with no reviewable changes (1)

examples/benchmark/flare-worker.js

github-actions bot added the size/M label Apr 20, 2026

coderabbitai bot reviewed Apr 20, 2026

View reviewed changes

sauravpanda merged commit 58ca248 into main Apr 20, 2026
10 checks passed

sauravpanda deleted the restore-main-thread-flare branch April 20, 2026 22:38

coderabbitai bot mentioned this pull request Apr 21, 2026

Bump @sauravpanda/flare to 0.2.13 #312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert Flare worker — wgpu readback deadlocks in dedicated Workers#311

Revert Flare worker — wgpu readback deadlocks in dedicated Workers#311
sauravpanda merged 1 commit intomainfrom
restore-main-thread-flare

sauravpanda commented Apr 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sauravpanda commented Apr 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sauravpanda commented Apr 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 20, 2026 •

edited

Loading