Skip to content

feat(sdk): replace Meteor DDP transport with @rocket.chat/ddp-client#40301

Draft
ggazzo wants to merge 33 commits intodevelopfrom
worktree-sdk-over-ddp-client
Draft

feat(sdk): replace Meteor DDP transport with @rocket.chat/ddp-client#40301
ggazzo wants to merge 33 commits intodevelopfrom
worktree-sdk-over-ddp-client

Conversation

@ggazzo
Copy link
Copy Markdown
Member

@ggazzo ggazzo commented Apr 24, 2026

Summary

Migrates the frontend DDP transport from Meteor's WebSocket to our own
@rocket.chat/ddp-client SDK, while keeping the existing Meteor Accounts
code as the auth anchor. Every client-side method call and subscription
now runs on a single WebSocket we own; Meteor's socket stays present
only to keep Accounts' in-memory state happy.

What moves onto the DDPSDK socket

  • sdk.call / sdk.publish / sdk.stream — every consumer of
    SDKClient now dispatches against DDPSDK (no Meteor fallback).
  • ServerContext.callMethod and writeStream — delegate through the
    same SDK.
  • Meteor.apply / Meteor.call / Meteor.callAsync — intercepted in
    ddpOverSDK (formerly ddpOverREST) and routed to
    ddpSdk.client.callAsync. Includes login (password, SAML, LDAP,
    CAS, resume), UserPresence:*, setUserStatus, and logout. REST
    stays only as a transient fallback if the SDK is still handshaking.
  • Meteor.connection.subscribe / Meteor.subscribe — intercepted in
    subscribeViaSDK and translated to ddpSdk.client.subscribe, so
    Accounts' loginServiceConfiguration, autoupdate and any stray
    internal publications ride our socket too.
  • stream-user-presence — subscribed via DDPSDK and also listened to
    via a streamerCentral adapter bridging DDPSDK's onMessage into
    the existing _stream.on('message') contract.

Supporting changes

  • DDPSDK is instantiated lazily and connects eagerly; it queues writes
    until the WS handshake resolves and authenticates with Meteor's
    resume token as soon as userIdStore shows a uid.
  • Wire encoding is forced to EJSON (vs. DDPSDK's default JSON) so
    Dates and other EJSON extensions round-trip identically to Meteor's
    native frames.
  • ServerProvider.getStatus flipped to DDPSDK-primary — connected /
    status now derive from ddpSdk.connection.status, with retry
    counters falling back to Meteor.status().
  • ServerProvider.disconnect / reconnect drive both transports.
  • CachedStore version bumped to invalidate entries persisted before
    the EJSON switch (ISO-string dates would fail .getTime() on
    fields like subscription.ls).
  • Two residual client-side Meteor.methods stubs (setReaction,
    sendMessage) are converted into explicit runOptimistic* calls
    in their flows/* so the optimistic UI no longer relies on
    Meteor's stub-dispatch machinery.
  • @rocket.chat/ddp-client promoted from a transitive dep (via
    ui-contexts) to a direct workspace dep of apps/meteor.

What still lives on Meteor

  • Accounts state (userId, loginToken, onLogin/onLogout callbacks) — by
    design; removing is a separate phase.
  • Meteor.connection's WebSocket still opens on page load. Every
    outgoing message is intercepted, but the socket itself is kept so
    Meteor's internal reactive status dependencies (used by a few
    Tracker.autoruns outside this PR's scope) don't see a permanently
    "connecting" state. Neutralising that socket is the next step.

Test plan

  • TypeScript passes across the meteor workspace.
  • EJSON round-trip verified (subscription.ls, message.ts, etc.).
  • CI green (unit, storybook, UI E2E, API).
  • Manual (dev): inspect DevTools → Network → WS. After login,
    two sockets to /websocket are open — one Meteor, one DDPSDK.
    Every method / sub frame should appear on the DDPSDK socket;
    Meteor's is effectively idle.
  • Manual: login with password, logout, SAML login, LDAP login
    still work end-to-end. Session resume on reload takes the user
    straight into the app.
  • Manual: user-presence updates (online/away) from another tab
    appear in near-realtime.
  • Manual: ConnectionStatusBar reflects DDPSDK outages — disable
    the DDPSDK socket via DevTools and verify the offline banner
    appears.

@dionisio-bot
Copy link
Copy Markdown
Contributor

dionisio-bot Bot commented Apr 24, 2026

Looks like this PR is not ready to merge, because of the following issues:

  • This PR is missing the 'stat: QA assured' label
  • This PR is missing the required milestone or project

Please fix the issues and try again

If you have any trouble, please check the PR guidelines

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 24, 2026

⚠️ No Changeset found

Latest commit: 42a6987

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 27133580-6815-4e5e-b866-b3e335982738

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ggazzo ggazzo changed the title chore: Worktree sdk over ddp client feat(sdk): migrate frontend transport to @rocket.chat/ddp-client (phase 2) Apr 24, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 47.40484% with 152 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.99%. Comparing base (981af32) to head (42a6987).
⚠️ Report is 21 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop   #40301      +/-   ##
===========================================
+ Coverage    69.86%   69.99%   +0.13%     
===========================================
  Files         3298     3300       +2     
  Lines       119347   120583    +1236     
  Branches     21530    21613      +83     
===========================================
+ Hits         83377    84399    +1022     
- Misses       32672    32915     +243     
+ Partials      3298     3269      -29     
Flag Coverage Δ
unit 70.78% <47.40%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ggazzo ggazzo force-pushed the worktree-sdk-over-ddp-client branch 3 times, most recently from e0d347f to de8a406 Compare April 24, 2026 19:38
@ggazzo ggazzo changed the title feat(sdk): migrate frontend transport to @rocket.chat/ddp-client (phase 2) feat(sdk): replace Meteor DDP transport with @rocket.chat/ddp-client Apr 24, 2026
@ggazzo ggazzo force-pushed the worktree-sdk-over-ddp-client branch 6 times, most recently from 4dfa46d to 56c5e1e Compare April 29, 2026 19:33
@ggazzo
Copy link
Copy Markdown
Member Author

ggazzo commented Apr 29, 2026

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 29, 2026

@copilot resolve the merge conflicts in this pull request

Merge conflicts resolved in d51c070. The only conflict was in apps/meteor/package.json — kept @rocket.chat/ddp-client: "workspace:^" from this branch and took the updated @rocket.chat/css-in-js: "^0.32.0" from develop.

@ggazzo ggazzo force-pushed the worktree-sdk-over-ddp-client branch from d51c070 to c2c5bd8 Compare April 30, 2026 00:04
ggazzo added 10 commits April 30, 2026 16:54
DDPDispatcher.dispatch() was pushing every payload (including the
`connect`, `sub`/`unsub`, and `ping`/`pong` frames) into the
serialized queue, which serializes wait blocks at its head. A login
frame dispatched while the socket is still connecting therefore
queues ahead of the `connect` frame ws.onopen later emits — the
`connect` ends up wedged in a non-wait block behind the wait block
and never flushes, leaving the socket open but DDP-unhandshaked.

Bypass the queue for non-method frames so they go straight to the
wire while wait-block ordering still applies to method calls.
…fns crash

The hook called formatDate(time, String(undefined)) when the
Message_DateFormat setting was momentarily unloaded — passing the
literal string "undefined" to date-fns, which throws because it
contains an unescaped 'n'. Reachable any time the setting is
mid-load (e.g. /admin/info mounted via dynamic import while
public settings are still streaming in).

Pass 'LL' as the fallback so the formatter never sees a non-string
format token. Drops the now-redundant String() coercion.
Stand up a singleton DDPSDK instance (apps/meteor/client/lib/sdk/ddpSdk.ts)
pointing at the current origin and keep its connection in sync with the
authenticated session via a userIdStore subscription:

- ensureConnectedAndAuthenticated() drives DDPSDK.connect() + login-with-token
  and is awaited from runUserDataSync(uid) so subscribe-on-resume races
  resolve in the right order. Recognized auth-rejection reasons trigger
  Meteor.logout() so a server-side force-logout cleanly drains client state;
  a token-stable guard avoids that path firing on transient 401s where a
  parallel flow has already swapped the stored token.
- adoptAccountFromMeteorLoginResult() syncs DDPSDK.account from a Meteor
  login result so a downstream ensureConnectedAndAuthenticated() doesn't
  fire a second redundant login on the same socket.
- onLoggedIn now bridges off Accounts.onLogin AND userIdStore so the
  callback fires reliably when Meteor's autorun chain is wedged
  (logout → fresh login over the SDK socket).
- CachedStore version bumped (18 → 19) to invalidate caches written
  before the DDP wire encoding switched from JSON to EJSON, since
  fields like Date were stringified incorrectly in the JSON window.

Also wires the SDK module into client/main.ts and bumps
@rocket.chat/ddp-client into apps/meteor's manifest. The streamerAdapter
file gets the symbol the override layer (next commit) will consume.
Switch the live DDP transport from Meteor's bundled WebSocket to the
DDPSDK socket, while preserving Meteor.connection's invoker/Accounts
machinery so existing flows (login resume, methods with stubs, the
Mongo.Collection registry) keep working unchanged:

- ddpOverREST: new `Meteor.connection._send` override routes method
  calls through DDPSDK.client.callAsync when the SDK is connected and
  the session is authenticated (or the call is the login that does the
  authenticating). Falls back to REST while DDPSDK is still booting.
  Login results are fed back through Meteor's invoker so Accounts
  state updates the same way it would over Meteor's own WS.
- ddpSdkCollectionBridge: re-feeds DDPSDK collection frames into
  Meteor.connection._streamHandlers so the Mongo.Collection registry
  keeps updating as the user logs in.
- subscribeViaSDK: routes Meteor.connection.subscribe through DDPSDK,
  with a recursion guard so the bridge above doesn't double-emit.
- killMeteorStream: permanently closes Meteor's own WS at boot now
  that DDPSDK owns the transport. Drains _outstandingMethodBlocks /
  _methodsBlockingQuiescence on logout so a logout invoker can't get
  stuck after Meteor's WS goes down.
- SDKClient (sdk.call): block on ensureConnectedAndAuthenticated()
  before dispatching, so cached-store gets fired on the SDK socket
  immediately after a re-login don't hit an unauthenticated session
  and persist empty arrays.
- Presence streamer: route notify-logged-user-presence subscriptions
  through DDPSDK and bridge frames back into the SDK.stream event bus.
- ServerProvider: combine Meteor.connection + DDPSDK status so the
  status indicator reflects the actual transport.
The helper waited on a REST response matching `/api/v1/method.call/sendMessage`
to extract the new message's id. Once Meteor.connection._send routes
through DDPSDK, sendMessage goes over DDP/WS instead and that REST
waiter never resolves — every spec calling this helper times out at 60s.

Wait for the optimistic list item to appear and reach a non-pending
state instead. Use `>= before + 1` rather than `==` because some
flows (e.g. just-created encrypted channels) drop additional list
items in alongside the user's send.
The previous routing in ddpOverREST shipped methods over the DDPSDK
WebSocket whenever the SDK socket was up — including cached-store gets
that fire immediately after re-login, where the SDK session was briefly
unauthenticated and the server returned [] for everything. The earlier
mitigation (auth-gating) added complexity without resolving the deeper
mismatch with develop's transport choice. Realign on develop's logic
and only use DDPSDK for what it's specialised for:

- All non-bypassed DDP methods now route to REST (`method.call` /
  `method.callAnon`), matching develop. `bypassMethods` is restored to
  ['setUserStatus', 'logout'] and `UserPresence:*` / `stream-*` keep
  bypassing.
- `login` (resume + password) routes through the DDPSDK socket when
  it's connected so the same login authenticates Meteor's session AND
  the SDK socket in one hop. `adoptAccountFromMeteorLoginResult` syncs
  `sdk.account` so a downstream `ensureConnectedAndAuthenticated`
  short-circuits instead of firing a redundant second login.
- `sdk.call` (used by cached stores) now goes via `Meteor.callAsync`,
  same as develop, so methods that previously bypassed ddpOverREST
  (`permissions/get`, `subscriptions/get`, `private-settings/get`)
  now hit REST too.

Two sub-fixes that fall out of this:

- ddpOverREST's REST error path was rethrowing the API middleware's
  parsed JSON body (a plain string `error.message`) into
  `processResult`. Meteor's stream handler couldn't parse it as a DDP
  result frame, so the resume invoker never saw the rejection, the
  stale token stayed in localStorage, and the user wedged on /home with
  no main UI. Re-encode it as a proper DDP error result.
- `ensureConnectedAndAuthenticated` now drops the local credentials on
  an authentication error (`Accounts._unstoreLoginToken` +
  `Meteor.connection.setUserId(null)`) when the stored token didn't
  change mid-flight. Keeps the dead-WS path off limits — the previous
  fix using `Meteor.logout()` flaked in CI's parallel-shard runs by
  racing fresh registration / re-auth.

Reverts `killMeteorStream` to leave Meteor's WS connected: the
permanent-disconnect path broke
`MethodInvoker.sendMessage()`'s `if (this.connection._stream._connected)
{ _send(...) }` gate, leaving every method invoker queued behind a
connection that never returned and ddpOverREST's `_send` wrapper never
firing.

Verified locally:
- Reload with valid token: `/home` renders, 18 REST method calls, 2 WS
  frames (login resume on DDPSDK + the SDK login from
  ensureConnectedAndAuthenticated).
- Reload with invalid token: localStorage cleared, userId null, login
  form rendered.
- administration-settings.spec.ts:26 passes locally in 8.3s.
…nticated

Stream subscriptions fired immediately after re-login (notably the
SubscriptionsCachedStore listener that re-arms via onLoggedIn) hit the
SDK socket while it was still anonymous. The server rejected those
subs with `not-allowed`/`nosub`, the stream's `ready` promise
emitted an error, and the cached store never received subsequent
server events — its in-memory state stayed frozen at the boot
snapshot.

Visible failure: an agent that took a livechat chat after relogout +
relogin saw the chat work (composer enabled, "joined" system
message), but the "Move to the queue" quick-action never appeared.
Tracing it back: in canMoveQueue (`!!routeConfig?.returnQueue && room?.u !== undefined`),
`room.u` is missing — there's a `// TODO: Solve u missing issue`
in app/livechat/server/lib/Helper.ts and livechat rooms in DB never
have `u`. RoomProvider compensates by spreading the subscription
into pseudoRoom (`{...sub, ...room}`), and the agent's subscription
brings its own `u`. With the missing notify-user/<uid>/subscriptions-changed
event, that subscription never lands in the client store, so the
spread leaves room.u undefined and the button hides.

Wrap the DDPSDK subscribe call in createNewDdpSdkStream with
`await ensureConnectedAndAuthenticated()` so subscriptions are only
sent once the socket has the agent's identity. Same pattern as the
sdk.call gate; resolves the same race for streams. `stop()` becomes
nullable-safe because subscribe might still be pending when the
caller unsubscribes.
…rvices

ddp-streamer-service (ee/apps/ddp-streamer/src/configureServer.ts) registers
`login`, `logout`, `setUserStatus` and `UserPresence:*` as native
methods on its own WebSocket — every other method delegates to the Meteor
service via `MeteorService.callMethodWithToken`, paying an extra hop that
goes (client WS → ddp-streamer → Meteor service → response back). The
develop `shouldBypass` is shaped to keep exactly those methods on the
client's own DDP WS for the fast path, and route everything else through
REST.

Our PR had aligned bypassMethods + UserPresence:* + stream-* but dropped
the `login + resume` bypass, on the rationale that killMeteorStream tore
down Meteor's WS and a bypass would deadlock. After we stopped
disconnecting Meteor's stream, that constraint went away — restoring the
resume bypass routes the fast-path back through ddp-streamer in CI's
microservices runs and lifts the post-relogin slowness that was pushing
several specs (auth.ts:9, login.ts:24) past the 5s toBeVisible timeout.

Verified locally: invalid-token reload still clears storage and shows the
login form; admin-settings/login/presence/e2ee-encryption-decryption/
omnichannel-manual-selection-logout (12 tests) all pass.
ddp-streamer-service in microservices CI only registers `login` natively
for the `{resume}` shape (see ee/apps/ddp-streamer/src/configureServer.ts).
Anything else — `{saml: true, credentialToken}`, `{user, password}`,
OAuth credentials — falls through to MeteorService.callMethodWithToken,
which is the slow extra-hop path the bypass list was designed to avoid.

The previous SDK route fired any `login` call through DDPSDK whenever
the socket was up. That meant SAML credential exchange went DDPSDK →
ddp-streamer → MeteorService.callMethodWithToken, and the success
handler then triggered `Meteor.loginWithToken(result.token)` which
queues a follow-up resume that goes via Meteor's WS bypass to
ddp-streamer again — two distinct logins for the same user, on
different sockets, with diverging account state.

The 5 SAML specs (Login, Allow password change, Logout × 2, User Merge)
in CI EE shard 5/5 all bailed at `getUserInfo` for samluser1 right
after the URL navigated to /home, even though the SAML credential
exchange completed: the user document and ensuing onLogin chain were
trapped in the cross-socket race.

Drop the SDK route for non-resume logins. They go through REST →
rocketchat-main (one hop, no extra delegation), the success handler's
`Meteor.loginWithToken` resume hits the existing bypass to Meteor's WS
→ ddp-streamer's native handler (fast path), and the SDK socket
authenticates via that resume's onLogin chain through
ensureConnectedAndAuthenticated. Login resume itself is still bypassed
in shouldBypass; this hunk just removes the dead pre-bypass SDK detour
that only fired for non-resume callers.

Sanity locally: admin-settings, login × 5, presence × 4,
e2ee-encryption-decryption, omnichannel-manual-selection-logout —
12/12 pass.
Replace Meteor.connection._stream with a stub that pretends to be
connected and forwards outbound DDP frames through the DDPSDK socket.
Goal: one WebSocket per page (the DDPSDK one), eliminating the second
auth roundtrip that was inflating boot time by ~1.5s in EE microservices.

- stubMeteorStream: disconnects Meteor's real stream, swaps in a stub
  with currentStatus.connected=true, routes method/sub/unsub frames via
  sdk.client.ddp.emit('send', ...) using Meteor's id namespace, and
  answers heartbeat pings locally with synthetic pongs. Carries the
  message/reset/disconnect listeners that Meteor registered before the
  swap onto the stub. Synthesizes a `connected` frame after a microtask
  if Meteor's WS hadn't finished its DDP handshake yet.
- ddpSdkCollectionBridge: also forwards `result` and `updated` frames so
  bypassed methods routed through the SDK socket reach Meteor's
  _methodInvokers (SDK-internal ids never collide with Meteor's numeric
  ids, so the bridge is a no-op for SDK's own callers).
- overrides/index: import order now guarantees _send override and the
  inbound bridge are wired before the stream swap.
ggazzo and others added 13 commits April 30, 2026 16:55
Two follow-ups to the stub-stream prototype:

- ddpSdkCollectionBridge: result/updated frames are only forwarded to
  Meteor when their id is Meteor-shaped. Previously bridging
  unconditionally caused "No callback invoker for method
  rc-ddp-client-1" because Meteor's _livedata_data throws when the
  methodId in an `updated` frame isn't in _methodInvokers
  (document_processors.js:168). SDK-internal ids start with
  rc-ddp-client- so a simple prefix check isolates them.
- stubMeteorStream: when a `login` method frame goes through the stub
  to the SDK socket, register an onResult listener and call
  adoptAccountFromMeteorLoginResult on success. This populates
  sdk.account.uid/user/token from Meteor's login result so
  ensureConnectedAndAuthenticated short-circuits its own
  loginWithToken — eliminating the duplicate login on every page boot.
When ensureConnectedAndAuthenticated runs at boot (from the userIdStore
subscriber) it can race Meteor's own resume login that's routed through
stubMeteorStream. Both end up as `login` method frames on the SDK socket
within the same tick. ddp-streamer's Account.login has no dedup, so each
fires Accounts.onLogin → Presence.newConnection → a duplicate row in
usersSessions for the same session id.

The duplicate stays ONLINE while the active connection flips to AWAY on
idle. processConnectionStatus prefers ONLINE over AWAY in the aggregate,
so the user.status update is a no-op (modifiedCount=0) and the
`presence.status` broadcast never fires — the navbar badge stays online
even after `UserPresence:away` succeeds server-side.

Fixes:
- Drop the boot-time `ensureConnectedAndAuthenticated` call. The Meteor
  login resume going through stubMeteorStream (with
  adoptAccountFromMeteorLoginResult populating sdk.account on the result)
  is the only auth path needed at boot.
- Gate the userIdStore-subscriber path on `Accounts.loggingIn()` and
  `sdk.account.uid`: if Meteor's login is in flight (or has just
  finished and the adopt callback set sdk.account.uid), short-circuit
  instead of issuing a redundant loginWithToken.
- Single-flight `inflightLogin` so concurrent calls share one promise.

Verified: tests/e2e/omnichannel/omnichannel-rooms-forward.spec.ts
"should be set to the queue" passes locally in 7.2s with IS_EE=true.
- new-cap: rename local 'tracker' → 'TrackerDependency' so 'new ...()' is OK
- prefer-template: use template string for thrown Error message
- prettier formatting in stubMeteorStream pong helper
- prefer-destructuring on sdk.client.ddp
- no-useless-return on the discard-only switch cases
- drop the leftover eslint-disable directive in the bridge
…bscriber

Both `onLoggedIn` (accounts-base) and `userIdStore.subscribe` fire for
the same uid on a successful login. runUserDataSync calls
userSetUtcOffset which is rate-limited in CI/prod, so the unguarded
second invocation returns 400 too-many-requests. Worse, follow-up REST
calls (sessions/list etc.) start coming back 401 because the limiter
throttles auth checks for the rest of the window — the user lands on
'Manage Devices' and sees 'Something went wrong' because /sessions/list
got rejected.

Use a single guarded gate (`syncOnce` with a shared `lastSyncedUid`)
across both call sites and the boot fast-path.
…nnect

DDPSDK auto-fires loginWithToken on every `connected` event using the
in-memory account.user.token (DDPSDK.create line 115-122). When the
server force-logs the user out (resetUserE2EKey,
account-manage-devices logout, admin device management, etc.), the
flow on the server is:

1. broadcast 'user.forceLogout' → meteor.service listener closes the
   user's WebSocket sessions
2. Users.unsetLoginTokens(uid) wipes services.resume.loginTokens

The client sees the WS close, the SDK reconnects, and on the new
`connected` it auto-retries loginWithToken with the now-dead token.
DDPSDK calls this with `void` so the rejection is swallowed —
`account.user` stays populated, `Meteor.userId()` stays set, the
router never falls back to Login, and the navbar continues to render
Home with a stale session.

Wrap account.loginWithToken to observe rejections from this auto-retry
path: on auth error (and only when the token in localStorage is still
the same one we tried — guard against a concurrent fresh login that
already rotated it), drop local credentials so the router goes to
Login. Mirrors what ensureConnectedAndAuthenticated already does for
its own loginWithToken call.

Verified: e2ee-key-reset and e2ee-passphrase-management now pass.
The SAML login flow exposed a third concurrent-login window: after
ddpOverREST routes the SAML credential exchange via REST, line 98
fires `Meteor.loginWithToken(token)` (a fresh resume) which goes
through stubMeteorStream → SDK socket. While that resume is in flight,
`onLoggedIn` synchronously runs syncOnce → ensureConnectedAndAuthenticated.
The Accounts.loggingIn() gate doesn't help here — Meteor's accounts
package treats _loggingIn as a flag, not a counter, so the first
login's onLogin callback resets it to false even though the resume
from line 98 is still pending. ensureConnectedAndAuthenticated then
proceeds and dispatches its own loginWithToken on the SDK socket,
giving us TWO concurrent logins on the same socket again — which
ends up creating duplicate connections in usersSessions and stalls
the SAML test on PageLoading because synchronizeUserData waits on
divergent auth state.

Wire stubMeteorStream's outbound login frames into the same
`inflightLogin` slot ensureConnectedAndAuthenticated checks. Now any
Meteor-routed login (via the stub) holds the lock; the boot path
awaits it (and then short-circuits because adopt has populated
sdk.account.uid). One login on the SDK socket per page boot,
regardless of whether the trigger is a resume, SAML, password, or
OAuth flow.
…thToken wrap

The previous `account.loginWithToken` wrap (commit 08a8238) was
clearing local credentials whenever the SDK socket's auto-retry
loginWithToken hit an auth error. That fixed the e2ee-key-reset force
logout chain but broke SAML: in some flows the wrap was triggered by
SDK's auto-relogin while a fresh login was concurrently completing,
clearing the just-stored token and stranding the user mid-login.

Move the cleanup to useForceLogout instead. When the
`notify-user/<uid>/force_logout` stream fires, do an actual local
logout (`Accounts._unstoreLoginToken` + `Meteor.connection.setUserId(null)`)
on top of the existing `forceLogout` session flag. This is targeted
to the actual force-logout signal rather than any auth error and
doesn't race with normal login flows.
Two related issues showed up in SAML and post-logout-relogin tests
(e2ee-passphrase-management, account-manage-devices in EE):

- The Accounts.loggingIn() gate in ensureConnectedAndAuthenticated
  could loop forever if Meteor's _loggingIn flag stayed set (e.g.
  while the SAML follow-up resume from ddpOverREST line 98 was still
  in flight on the SDK socket). Cap it at 2s so we don't wedge the
  page on PageLoading.
- syncOnce was deduping runUserDataSync per uid, but if the first
  call rejected (notify-user/userData stream sub coming back nosub
  because the SDK socket session wasn't auth'd yet), useUserDataSyncReady
  stayed false and useMainReady never flipped. Allow a retry by clearing
  lastSyncedUid on rejection, so the next userIdStore subscriber fire
  retries against a now-authenticated session.

Also drop the setInflightStubLogin shared lock — the stub-routed
Meteor login can complete in scenarios where the response frame
doesn't reach the SDK socket (e.g. server's force_logout listener
closes the socket before the matching frame is delivered), and the
shared lock would then hold ensureConnectedAndAuthenticated open
indefinitely. The Accounts.loggingIn gate (now bounded) and the adopt
short-circuit cover the common case.
Re-add the auto-relogin auth-error handler that fixes e2ee-key-reset and
device-management force-logout flows, with two extra guards to avoid
the SAML regression that the previous version caused:

  - readStoredLoginToken() === token: nothing rotated the stored token
    mid-flight (a concurrent SAML/password/OAuth login already wrote a
    fresh one)
  - sdk.account.uid === triedWithUid: the SDK account didn't get
    refreshed by a successful adopt while we were awaiting (a parallel
    Meteor-routed login completed and updated the in-memory state)

If both guards hold, the only plausible explanation is a true server
force-logout (Users.unsetLoginTokens), so we drop local credentials and
let the router fall back to Login.

Verified locally: login.spec, account-login, e2ee-key-reset and
omnichannel-rooms-forward all pass.
…sh logins

The previous wrap cleared local credentials synchronously when DDPSDK's
auto-relogin on `connected` came back with an auth error. That fixed
the e2ee-key-reset flow (server force-logs out, SDK reconnects with
dead token, wrap clears creds, user falls back to Login) but raced
with concurrent fresh logins on:

  - e2ee-passphrase-management :76/:87 (loginByUserState +
    _pollStoredLoginToken inject a fresh token while the auto-retry
    with the dead one is still in flight)
  - saml :307 SLO (post-logout redirect chain rotates state under us)

Defer the cleanup by 500ms and re-verify the guards at the deadline.
If a concurrent fresh login completed in the meantime it will have
rotated either the stored token or sdk.account.uid; the deferred
check then bails out instead of nuking the just-stored credentials.
For genuine force-logout flows nothing else touches the state, so
the cleanup runs as before — just half a second later, well within
test timeouts.
The previous `Accounts.loggingIn?.()` gate threw at runtime
('Cannot read properties of undefined reading _loggingIn') because it
was called as a free reference and lost its `this` binding. The throw
was caught silently by runUserDataSync's try/catch, which meant the
gate never actually ran — boot timing happened to work because the
microtask between the throw and synchronizeUserData's stream subscribe
gave adopt enough time to populate sdk.account.

Replace it with an explicit ~500ms poll on sdk.account.uid. If the
Meteor-routed resume login (via stubMeteorStream → SDK socket → adopt)
completes within the window, we short-circuit and avoid issuing a
duplicate loginWithToken on the same socket — the duplicate causes a
second Presence connection, the aggregate stays online instead of
flipping to away, and the omnichannel idle/away tests fail. If adopt
doesn't fire in time, fall back to our own loginWithToken (existing
inflightLogin gate keeps it idempotent).

Verified: login, account-login, e2ee-key-reset and the full
omnichannel-rooms-forward suite pass locally.
The loginWithToken wrap clears creds via Accounts._unstoreLoginToken() +
Meteor.connection.setUserId(null), which does NOT fire Accounts.onLogout.
Without resetting lastSyncedUid on the resulting uid=undefined transition,
a subsequent re-login (e.g. e2ee-passphrase-management's loginByUserState
of the same user) is deduped and runUserDataSync never runs — the page
stays wedged on PageLoading and the Login button never hides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… invoker

After force-logout cycles (e.g. resetUserE2EKey → ws.terminate → SDK
reconnect), stale result/updated frames for methods whose Meteor invoker
already cleared can arrive over the SDK socket. _process_updated then
throws "No callback invoker for method N" out of an async generator —
the throw escapes the bridge's try/catch as an unhandled rejection and
aborts Meteor's frame queue, so subsequent login result frames never
land and the page stays wedged on /login.

Gate the result/updated bridge on _methodInvokers[id] existing so stale
frames are silently dropped instead of corrupting Meteor's frame
processing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggazzo ggazzo force-pushed the worktree-sdk-over-ddp-client branch from dc87f2e to f6cd3b7 Compare April 30, 2026 20:00
ggazzo and others added 10 commits April 30, 2026 17:50
…raining

Meteor's _streamHandlers.onMessage returns a Promise. Sync try/catch
around the bridge call doesn't catch throws inside _process_updated
("No callback invoker for method N" when a stale frame arrives) — the
throw escapes as an unhandled rejection and stops Meteor's frame queue,
so the next login's result never gets processed and the page stays
on /login.

Revert the prior _methodInvokers gating (it dropped legitimate login
result frames in the same cycle) and instead capture the async
rejection at the bridge boundary so individual bad frames don't poison
subsequent ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h-error

The 500ms-deferred cleanup wasn't enough to handle e2ee-passphrase-management:
the test's loginByUserState fires _pollStoredLoginToken with the same token
already in localStorage, so Meteor's poller bails (cached token == current).
By the time the wrap's setTimeout fires, the test has already injected the
SAME token (mongo $addToSet re-added it server-side after unsetLoginTokens),
but the wrap was clearing creds anyway, leaving the page stuck on /login
with no follow-up login firing.

Replace the unconditional clear with a Meteor.loginWithToken retry against
whatever's in localStorage right now. If the token was rotated (or re-added
to mongo concurrently), the retry succeeds; if it's truly stale (real
force-logout, no concurrent recovery), Meteor's callback invokes
forceClientLoggedOut to drive the user to /login as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…failure

Add console.warn diagnostics around the wrap's deferred recovery so the
trace shows whether the retry is firing, what the guard values are, and
whether Meteor.loginWithToken throws/resolves/rejects. Will revert once
we have a fix locked in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Meteor's accounts-base registers a per-call DDP.onReconnect handler in
callLoginMethod (accounts_client.js:292) that retries login with the
latest stored token and calls makeClientLoggedOut on failure. With our
stub keeping Meteor's connection permanently 'connected', that handler
never fires when the underlying SDK socket reconnects, so server-side
force-logouts (resetUserE2EKey → ws.terminate) leave the user with stale
credentials and no automatic recovery.

Listen for the SDK's 'connected' event in stubMeteorStream and fire
'reset' on every subsequent reconnect — Meteor's _streamHandlers.onReset
then drives _callOnReconnectAndSendAppropriateOutstandingMethods, which
in turn invokes the onReconnect callback. That covers both branches:
  - if localStorage was rotated by a concurrent flow, the resume retry
    succeeds and setUserId fires;
  - if the token is genuinely stale, the resume retry fails and Meteor's
    own makeClientLoggedOut clears creds and routes to /login.

Now that Meteor's flow handles the recovery, simplify the
sdk.account.loginWithToken wrap to just swallow the auth-error
rejection (so DDPSDK's `void` auto-relogin doesn't surface as an
unhandled rejection / pageError) — no more deferred cleanup, no
retry-through-Meteor duplication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous full _streamHandlers.onReset path was too aggressive: it
re-sent the outstanding method blocks, but those methods had already
completed on the prior SDK socket session, and re-feeding their
result/updated frames through the bridge yielded "method result but no
methods outstanding" + "No callback invoker for method N" warnings.

Drive only what we actually need on SDK reconnect — the DDP._reconnectHook
callbacks. The accounts-base _reconnectStopper registered by
callLoginMethod is in there, and that's what retries login with the
latest stored token and calls makeClientLoggedOut on auth failure.
DDPSDK already handles its own subscription resends on reconnect, so we
don't need _resendSubscriptions either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surgical onReconnect-only call broke message-actions, report-message,
and e2ee-encryption-decryption — those tests fire methods exactly when
the SDK socket churns from a force-logout cycle, and without the full
reset's _callOnReconnectAndSendAppropriateOutstandingMethods step the
methods stay marked sentMessage=true forever. The bridge's async catch
already absorbs the "method result but no methods outstanding" warnings
that the resent blocks generate, so the noise is harmless.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Log every SDK 'connected' event, the stored token at that moment, the
Meteor userId, and every _pollStoredLoginToken call (with the
current/last token and whether it would fire). Will revert once we
have :87 nailed down — the trace currently shows nothing between the
force-logout disconnect and the failing waitForLogin assertion, so we
need eyes on what actually runs in that window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The microservices ddp-streamer was using ws.terminate() (TCP RST) for
broadcast force-logouts. The monolith path uses session.connectionHandle.close()
which is graceful and flushes the WS buffer first — letting the
`notify-user/<uid>/force_logout` stream message (queued by
apps/meteor/server/modules/listeners/listeners.module.ts:49) reach the
client before the socket goes down.

In EE that stream message races with the terminate, terminate wins, the
client's useForceLogout hook never fires, and tests like
e2e-encryption/e2ee-passphrase-management.spec.ts:87 are left with
stale localStorage credentials and a Login button that never hides.

Switch to ws.close() with a 5s setTimeout fallback to terminate() for
unresponsive sockets — matches the graceful-close semantics the
monolith already relies on without losing the safety net for zombies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tStopper

The full fire('reset') was firing accounts-base's _reconnectStopper, which
retries login with the captured `result.token` from the original
callLoginMethod scope. After force-logout, that token is the stale one
the server just invalidated. The retry runs on a wait:true block that
queues AHEAD of the test's own loginByUserState; the stopper's
userCallback then calls makeClientLoggedOut, which clobbers the
credentials the test just injected, and the test's queued login never
sends a frame.

With the ddp-streamer ws.close() change, useForceLogout now reliably
fires (the notify-user/<uid>/force_logout stream message arrives before
the socket closes), so we don't need accounts-base's reconnect-time
relogin retry at all. We still need to resend in-flight methods so that
tests like message-actions / report-message / e2ee-encryption-decryption
don't wedge.

Mirror onReset's _handleOutstandingMethodsOnReset +
_sendOutstandingMethodBlocksMessages + _resendSubscriptions directly,
skipping _callOnReconnectAndSendAppropriateOutstandingMethods (which is
where _reconnectStopper would fire).

Also drop the diagnostic logging now that we have a fix in mind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 'skip _reconnectStopper' approach broke force-logout coverage:
e2ee-key-reset, account-manage-devices, admin-device-management all
failed because, even with the ddp-streamer ws.close() server fix, the
notify-user/<uid>/force_logout stream message still races with the
close in microservices (it travels rocketchat-main → broker → ddp-streamer
→ WS, while the close listener fires directly on ddp-streamer). The
graceful close only flushes what's already in the WS buffer at that
moment — if the stream message is still in transit, it's still lost.

So the per-call _reconnectStopper from callLoginMethod IS the
load-bearing failsafe in microservices: it retries login with the
latest stored token and calls makeClientLoggedOut on auth failure,
which is what drives the user to /login when the stream message is
lost.

Restore the full fire('reset'). :87 is the only remaining EE 2/5
regression and a separate problem to solve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants