Skip to content

fix(deps): backport matrix-rust-sdk#6361 to stop idle sync loop#101

Open
TigerInYourDream wants to merge 2 commits intomainfrom
fix/30-room-list-long-poll-backport
Open

fix(deps): backport matrix-rust-sdk#6361 to stop idle sync loop#101
TigerInYourDream wants to merge 2 commits intomainfrom
fix/30-room-list-long-poll-backport

Conversation

@TigerInYourDream
Copy link
Copy Markdown

@TigerInYourDream TigerInYourDream commented Apr 16, 2026

Summary

Backport of upstream matrix-org/matrix-rust-sdk#6361 onto our space_room_suggested base, to address the idle sliding-sync request loop tracked in #30.

Pin matrix-sdk{, -base, -ui} to a SHA on the Robius-China fork that is 627563bb (our previous space_room_suggested tip) + one cherry-pick:

What the upstream fix does

RoomListService::requires_timeout was forcing timeout=0 for all post-init states — SettingUp, Recovering, and Running (before fully_loaded). While idle, the client kept re-sending the same pos right away instead of long-polling, producing the pos=<n>&timeout=0 spam tracked in #30.

After the fix:

  • State::InitPollTimeout::Some(0) (unchanged; first sync still returns immediately so the session establishes fast)
  • State::SettingUp | Recovering | RunningPollTimeout::Default (server long-polls when idle; still responds immediately when it has pending changes)
  • State::Error { .. } | State::Terminated { .. }PollTimeout::Some(0)

Why minimum-risk

  • Runtime delta vs. previous dep is exactly one commit — a closure-body rewrite in crates/matrix-sdk-ui/src/room_list_service/mod.rs. No public API / type / feature-flag changes, so zero call-site ripple in Robrix2.
  • space_room_suggested customizations (thread-subscriptions, ring crypto provider, suggested field on SpaceRoom, etc.) are all preserved on the same base commit 627563bb.
  • SHA-pinned (rev = "..."), not branch-tracked — the release build cannot silently absorb later drift on the fork.
  • Cherry-pick applied conflict-free. space_room_suggested modifies other parts of room_list_service/mod.rs but does not overlap the requires_timeout closure; pre-verified with git merge-tree --write-tree.

Test plan

  • cargo check --all-targets passes locally (22.77s with incremental reuse)
  • Smoke-test against local Palpo — confirm the pos=<n>&timeout=0 spam stops once the client reaches idle (verified 2026-04-16)
  • Smoke-test against matrix.org — normal room list / timeline / message send work unchanged
  • Confirm no regression in thread-subscriptions or SpaceRoom suggested field behavior (the space_room_suggested customizations that sit alongside the fix)

Rollback plan

Once upstream #6361 merges and propagates onto project-robius/space_room_suggested:

  1. Revert the 3 Cargo.toml lines to project-robius/matrix-rust-sdk branch = "space_room_suggested"
  2. cargo update -p matrix-sdk -p matrix-sdk-base -p matrix-sdk-ui
  3. Delete the Project-Robius-China/matrix-rust-sdk@fix/room-list-long-poll-after-initial-sync branch
  4. Close Track idle sliding-sync request loop after initial sync #30

Refs #30

Pin matrix-sdk{, -base, -ui} to Project-Robius-China/matrix-rust-sdk
@ cb391f70 — which is 627563bb (our previous space_room_suggested
tip) plus a single cherry-pick of upstream matrix-org#6361's
production-code commit (c7573469b).

The `requires_timeout` closure in RoomListService was forcing
`timeout=0` for all post-init states, so idle clients kept re-sending
the same `pos` immediately instead of long-polling. The backport
restricts `timeout=0` to `State::Init` only and lets SettingUp /
Recovering / Running use `PollTimeout::Default`, so the server can
long-poll while idle (and still answer immediately when it has
pending changes).

Runtime delta is exactly one closure-body rewrite in
crates/matrix-sdk-ui/src/room_list_service/mod.rs — no public API,
type, or feature-flag changes.

Refs #30
@TigerInYourDream
Copy link
Copy Markdown
Author

TigerInYourDream commented Apr 16, 2026

Smoke-test verification against local Palpo

Ran the release binary (target/release/robrix, built 2026-04-16 13:34 local, same branch as this PR) against the local robrix2-testenv-palpo-1 container. Observed the MSC3575 /sync request pattern via docker logs --timestamps robrix2-testenv-palpo-1.

Phase 1 — catching up with backlog (~27s)

05:35:02  pos=4278  timeout=0       ← State::Init, fix preserves timeout=0 by design
05:35:08  pos=4278  timeout=30000   ← switched to long-poll immediately after Init
05:35:11  pos=4279  timeout=30000   ← server returns early because it has data
05:35:14  pos=4281  timeout=30000
...
05:35:29  pos=4304  timeout=30000   ← caught up

The fast cadence here is textbook long-poll "server returns early when it has pending changes" — not the old bug.

Phase 2 — idle, pos stable at 4306

05:35:29.246  pos=4306  timeout=30000
05:35:59.258  pos=4306  timeout=30000   (Δ = 30.012 s)
05:36:29.278  pos=4306  timeout=30000   (Δ = 30.020 s)
05:36:59.301  pos=4306  timeout=30000   (Δ = 30.022 s)
05:37:29.318  pos=4306  timeout=30000   (Δ = 30.017 s)
05:37:59.351  pos=4306  timeout=30000   (Δ = 30.033 s)

Each request is held by Palpo for exactly ~30 s (the requested timeout), then returns empty, and the client re-sends. This is correct long-polling — the exact behavior the upstream PR targets.

Quantified comparison

Metric Bug (pre-fix) Observed (post-fix)
timeout=0 share ~100% 1/112 ≈ 0.9% (only the State::Init sync)
Idle request rate >1000 req/min (tight loop) 5.6 req/min
Same-pos spacing <1 ms ~30 s (= requested timeout)

Verdict

Running binary confirmed to be built from this branch (mtime Apr 16 13:34 > commit d18631a3 at 13:11:43).


Real pre/post measurement from the 25-hour Palpo log

The same Palpo container has been running for 25 hours — covering both the buggy pre-fix binary and this fix. Aggregating docker logs --timestamps of robrix2-testenv-palpo-1 gives direct before/after numbers (not estimates):

MSC3575 /sync request volume per hour

Hour (UTC, 2026-04-16) Robrix binary Requests Rate
02:00 pre-fix (buggy) 12,329 ~205 req/min
03:00 pre-fix (buggy) 21,409 ~357 req/min (peak)
04:00 pre-fix (tail) 3,839 ~64 req/min
05:00 transition 102
05:36 – 05:56 post-fix 84 total ~4 req/min

Busiest single minute across 25 hours vs. idle minute now

Minute (UTC) Binary Requests timeout=0 share
2026-04-16T02:16 pre-fix peak 738 / min 731 / 738 = 99.1%
2026-04-16T05:55 post-fix idle 4 / min 0 / 4 = 0%

Reduction factor

  • Request rate: 738 / 4 ≈ 184× reduction at the peak-vs-idle point
  • Sustained hourly load: 21,409 / 240 (extrapolated 4 req/min × 60) ≈ 89× reduction
  • timeout=0 share: 99.1% → 0.9% ≈ 110× reduction

Server-side CPU impact (docker stats --no-stream)

Container CPU % now (post-fix, idle)
robrix2-testenv-palpo-1 0.05%
robrix2-testenv-palpo_postgres-1 0.32%

Palpo is essentially idle on this route now. Before the fix, at ~357 req/min sustained, it was doing MSC3575 room-list diff computation + Postgres lookups hundreds of times per minute; that load is gone.

Bottom line

This is not a "theoretical cleanup" — it is an observable ~100× reduction in sync traffic and a corresponding drop in Palpo CPU load from the same local test env, captured from real logs spanning the binary swap.

Comment thread Cargo.lock Outdated
dependencies = [
"libc",
"windows-sys 0.59.0",
"windows-sys 0.61.1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to keep window-sys version unchanged.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Project-robius Robrix does not have this issue.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for flagging this — you're right that cargo update -p matrix-sdk ... opportunistically re-resolved unrelated transitive deps. That was scope creep, not intentional.

Fixed in e3fbcae:

  1. Reset Cargo.lock to origin/main baseline.
  2. Surgically replaced only the 8 matrix-sdk{, -base, -common, -crypto, -sqlite, -store-encryption, -ui, -indexeddb-stores} source = URL lines to point at the backport fork at rev cb391f70.
  3. windows-sys 0.59.0 at line 2037 (and all other transitive entries) now match origin/main byte-for-byte — verified: 8 / 8 occurrences identical.
  4. cargo check --locked passes in 34.63s with zero warnings.

The only extra diff line is robrix = 0.0.1-pre-alpha-4 (vs the lock's stale 0.1.0-pre-alpha-1) — that's Cargo auto-correcting a pre-existing inconsistency in origin/main's lockfile under --locked mode; unavoidable without a separate Cargo.toml version bump.

Final PR scope: 3 Cargo.toml lines + 8 Cargo.lock source-URL lines + 1 auto-corrected version field. Nothing else changed.

Happy to squash d18631a3 + e3fbcae8 into a single clean commit if you'd prefer that for merge.

Reviewer flagged that the previous `cargo update -p matrix-sdk ...`
opportunistically re-resolved unrelated transitive deps (windows-sys
0.59.0 → 0.61.1 for errno, etc.).

This commit resets Cargo.lock to `origin/main` and then surgically
patches only the 8 matrix-sdk{, -base, -common, -crypto, -sqlite,
-store-encryption, -ui, -indexeddb-stores} source URLs to point to
the backport fork at rev cb391f70. All other transitive entries now
exactly match `origin/main`.

The single extra line (`robrix = 0.0.1-pre-alpha-4` vs the lock's
stale `0.1.0-pre-alpha-1`) is Cargo auto-correcting a pre-existing
inconsistency in `origin/main` — unavoidable under `--locked`.

Verified with `cargo check --locked` (34.63s, zero warnings).

Refs #30, addresses review comment from @alanpoon on PR #101
TigerInYourDream added a commit that referenced this pull request Apr 16, 2026
Reviewer flagged that the previous `cargo update -p matrix-sdk ...`
opportunistically re-resolved unrelated transitive deps (windows-sys
0.59.0 → 0.61.1 for errno, etc.).

This commit resets Cargo.lock to `origin/main` and then surgically
patches only the 8 matrix-sdk{, -base, -common, -crypto, -sqlite,
-store-encryption, -ui, -indexeddb-stores} source URLs to point to
the backport fork at rev cb391f70. All other transitive entries now
exactly match `origin/main`.

The single extra line (`robrix = 0.0.1-pre-alpha-4` vs the lock's
stale `0.1.0-pre-alpha-1`) is Cargo auto-correcting a pre-existing
inconsistency in `origin/main` — unavoidable under `--locked`.

Verified with `cargo check --locked` (34.63s, zero warnings).

Refs #30, addresses review comment from @alanpoon on PR #101

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TigerInYourDream TigerInYourDream force-pushed the fix/30-room-list-long-poll-backport branch from e3fbcae to 5b74337 Compare April 16, 2026 14:44
@TigerInYourDream
Copy link
Copy Markdown
Author

Pre-fix (25h Palpo log, same container):

  • Peak minute 2026-04-16T02:16: 738 requests/min, 731 of them with timeout=0 (99.1%)
  • Peak hour: 21,409 requests, ~357 req/min sustained
  • Palpo doing MSC3575 room-list diff + Postgres lookups hundreds of times/min

Post-fix (same binary, same container, idle):

  • 4 requests/min, 0% timeout=0, 30.012–30.033s spacing (= long-poll held)
  • Palpo CPU 0.05%

180× fewer requests, ≈ 110× drop in timeout=0 share. Full breakdown in the previous comment.

The bug window is State::Running && !is_fully_loaded. robrix2's sliding_sync config + Octos appservice injection keeps that window open much longer than a vanilla setup — which plausibly explains why it's not observable in project-robius/robrix. The underlying SDK code path is the same regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track idle sliding-sync request loop after initial sync

2 participants