Skip to content

Revert Flare worker — wgpu readback deadlocks in dedicated Workers#311

Merged
sauravpanda merged 1 commit intomainfrom
restore-main-thread-flare
Apr 20, 2026
Merged

Revert Flare worker — wgpu readback deadlocks in dedicated Workers#311
sauravpanda merged 1 commit intomainfrom
restore-main-thread-flare

Conversation

@sauravpanda
Copy link
Copy Markdown
Owner

@sauravpanda sauravpanda commented Apr 20, 2026

PR #309 moved Flare into a dedicated Web Worker to keep the UI responsive during the 138 MB GGUF parse. That fixed the freeze, but silently dropped WebGPU to CPU fallback (see upstream #498 which fixed the worker detection in flare-web 0.2.12). After that upstream fix landed, the worker immediately deadlocked on the first inference run.

Root cause is in flare-gpu's `dispatch_and_readback`:

```rust
slice.map_async(Read, |r| sender.send(r));
device.poll(Wait); // no-op on wasm32
receiver.recv(); // blocks the worker forever
```

The WebGPU `map_async` callback is serviced by browser-internal microtasks that only drain on the main thread. In a Worker, the sync `recv()` call deadlocks — I reproduced a 240+ s hang on the warmup run.

Main-thread load still briefly freezes the UI for 2-3 s while the 138 MB GGUF is parsed, but that's the lesser evil vs. CPU-fallback-at-20-tok/s or a fully hung tab.

The real fix is a worker-safe async readback path in flare-web — tracked separately.

Test plan

  • `npx jest` — 62 passing
  • Lint + build clean
  • Re-benchmark on main thread with 0.2.12 — expect ~70 tok/s + webgpu backend

Summary by CodeRabbit

  • Refactor
    • Simplified Flare benchmark execution by consolidating engine processing to the main thread, removing the separate worker implementation and RPC layer for more streamlined code organization.

PR #309 moved Flare into a dedicated Worker to keep the UI responsive
during the 138 MB GGUF parse.  That fixed the freeze, but dropped
WebGPU silently to CPU fallback — then the next release (flare-web
0.2.12) fixed the WebGPU-in-worker detection, and the benchmark
immediately deadlocked on the first inference run.

Root cause: flare-gpu's dispatch_and_readback does

  slice.map_async(Read, |r| sender.send(r));
  device.poll(Wait);        // no-op on wasm32
  receiver.recv();          // blocks the worker forever

The WebGPU map_async callback is serviced by browser-internal
microtasks that only drain on the main thread.  In a Worker, the sync
recv() call deadlocks — we hung for 240+ s on the warmup run.

Main-thread load still briefly freezes the UI, but that's the
lesser evil compared to CPU-fallback-at-20-tok/s or a hung tab.
Proper fix requires a worker-safe async readback path in flare-web.
Tracked separately.

The flare-worker.js helper is removed since nothing else uses it.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

The Flare benchmark execution is refactored to run directly on the main thread instead of in a dedicated Web Worker. The worker implementation file is deleted entirely, and the main benchmark HTML is updated to remove the RPC layer and call Flare engine methods synchronously, with module loading and streaming logic moved from the worker to the main thread.

Changes

Cohort / File(s) Summary
Worker Removal
examples/benchmark/flare-worker.js
Entire dedicated Web Worker implementation deleted. Removed message dispatch routing, RPC response handling, streaming token chunk logic, WASM initialization flow, and error forwarding (145 lines eliminated).
Main Thread Refactor
examples/benchmark/index.html
Removed worker/RPC layer (flareWorker, flareRpcPending, ensureFlareWorker, flareCall, message handlers). Flare module loading moved to main thread with blob-based patching of import.meta.url. Model loading and GPU initialization now call flareLib.FlareEngine.load() and flareEngine.init_gpu() directly. Inference refactored from worker streaming to main-thread loop using begin_stream_with_params, next_token(), and decode_token_chunk() until completion. Disposal simplified to null out flareEngine.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

size/M

Poem

🐰 Hop! The worker bows,
its messages now rest—
Flare dances solo,
on the main thread's quest,
streaming tokens swift,
where once threads did jest.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: reverting the Flare worker implementation due to wgpu readback deadlocks in dedicated Workers, which directly addresses the core issue documented in the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch restore-main-thread-flare

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/benchmark/index.html (1)

679-728: ⚠️ Potential issue | 🟡 Minor

TTFT is captured before the first token is generated, skewing Flare vs MLC comparisons.

begin_stream_with_params runs prefill and returns void; the first token is only produced by the subsequent next_token() call. Capturing firstTokenTime on line 692 therefore measures "prefill complete", not "time to first token". MLC's path (lines 463-464) and Transformers' path (lines 544-546) both stamp TTFT on the arrival of the first token.

Downstream effect on throughput: decodeTime = totalTime - firstTokenTime then covers all tokenCount tokens for Flare but only tokenCount - 1 tokens for MLC, so (tokenCount - 1) / decodeTime on line 726 systematically under-reports Flare's tok/s relative to the other engines — which is the opposite of what you want from a benchmark harness. Since this is the first on-main-thread Flare run with WebGPU actually active (per the PR description), the skew will be more visible in the upcoming re-benchmark.

🔧 Proposed fix: stamp TTFT on the first decoded token
         flareEngine.reset();
         flareEngine.begin_stream_with_params(
           promptIds,
           opts.maxTokens,
           opts.temperature || 0.001,
           1.0, 40, 1.0, 0.0,
         );
 
-        const firstTokenTime = performance.now() - t0;
-
         // First-run-only profile snapshot
         if (typeof flareEngine.prefill_profile_json === 'function' && !window.__flareProfileLogged) {
           try {
             const profile = JSON.parse(flareEngine.prefill_profile_json());
             if (profile && profile.seq_len > 0) {
               console.log('[Flare] prefill profile:', profile);
               log(`Flare prefill profile: ${JSON.stringify(profile)}`, 'info');
               window.__flareProfileLogged = true;
             }
           } catch (e) {
             console.warn('[Flare] prefill profile read failed:', e);
           }
         }
 
         let tokenCount = 0;
         let output = '';
+        let firstTokenTime = null;
 
         while (!flareEngine.stream_done) {
           const id = flareEngine.next_token();
           if (id === undefined) break;
+          if (firstTokenTime === null) firstTokenTime = performance.now() - t0;
           tokenCount++;
           output += flareEngine.decode_token_chunk(id);
         }
 
         const totalTime = performance.now() - t0;
-        const decodeTime = totalTime - firstTokenTime;
+        const decodeTime = totalTime - (firstTokenTime ?? totalTime);
 
         return {
           output,
-          ttft: firstTokenTime,
+          ttft: firstTokenTime ?? totalTime,
           totalTime,
           tokenCount,
           tokensPerSec: (tokenCount > 1 && decodeTime > 0) ? ((tokenCount - 1) / (decodeTime / 1000)) : 0,
         };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 679 - 728, The TTFT for Flare is
measured too early (after begin_stream_with_params) — move the firstTokenTime
stamp to the moment the first token is actually produced by next_token() to
match MLC/Transformers; specifically remove the current assignment to
firstTokenTime after begin_stream_with_params and instead set firstTokenTime =
performance.now() - t0 inside the while loop when you detect the very first
token (e.g., when tokenCount === 0 or before you increment tokenCount after a
successful id from flareEngine.next_token()), so decodeTime and tokensPerSec
calculations use the true time-to-first-token for Flare.
🧹 Nitpick comments (1)
examples/benchmark/index.html (1)

730-732: Consider explicitly freeing the WASM instance on dispose.

wasm-bindgen–generated classes expose a free() method that releases the underlying linear-memory allocation immediately; nulling the reference leaves reclamation up to the GC / FinalizationRegistry, which can keep ~138 MB pinned between benchmark runs (especially when switching models back-to-back in the loop at lines 816-920). Since free() isn't in the TypeScript surface in src/engines/flare-engine-wrapper.ts, guard the call.

♻️ Optional cleanup
       function disposeFlare() {
+        if (flareEngine && typeof flareEngine.free === 'function') {
+          try { flareEngine.free(); } catch {}
+        }
         flareEngine = null;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 730 - 732, The disposeFlare
function currently just nulls flareEngine; change it to explicitly call the
wasm-bindgen free() when available: if flareEngine has a free method (guard with
typeof flareEngine?.free === "function") call flareEngine.free() before setting
flareEngine = null. Update any wrapper type/usage around
FlareEngineWrapper/flareEngine to allow this guarded call so the underlying WASM
linear memory is released immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/benchmark/index.html`:
- Around line 679-728: The TTFT for Flare is measured too early (after
begin_stream_with_params) — move the firstTokenTime stamp to the moment the
first token is actually produced by next_token() to match MLC/Transformers;
specifically remove the current assignment to firstTokenTime after
begin_stream_with_params and instead set firstTokenTime = performance.now() - t0
inside the while loop when you detect the very first token (e.g., when
tokenCount === 0 or before you increment tokenCount after a successful id from
flareEngine.next_token()), so decodeTime and tokensPerSec calculations use the
true time-to-first-token for Flare.

---

Nitpick comments:
In `@examples/benchmark/index.html`:
- Around line 730-732: The disposeFlare function currently just nulls
flareEngine; change it to explicitly call the wasm-bindgen free() when
available: if flareEngine has a free method (guard with typeof flareEngine?.free
=== "function") call flareEngine.free() before setting flareEngine = null.
Update any wrapper type/usage around FlareEngineWrapper/flareEngine to allow
this guarded call so the underlying WASM linear memory is released immediately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 685fda9e-364f-4038-941c-f0fb9c958aaf

📥 Commits

Reviewing files that changed from the base of the PR and between ffd39b1 and 4427fbc.

📒 Files selected for processing (2)
  • examples/benchmark/flare-worker.js
  • examples/benchmark/index.html
💤 Files with no reviewable changes (1)
  • examples/benchmark/flare-worker.js

@sauravpanda sauravpanda merged commit 58ca248 into main Apr 20, 2026
10 checks passed
@sauravpanda sauravpanda deleted the restore-main-thread-flare branch April 20, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant