fix(deps): backport matrix-rust-sdk#6361 to stop idle sync loop by TigerInYourDream · Pull Request #101 · Project-Robius-China/robrix2

TigerInYourDream · 2026-04-16T05:13:37Z

Summary

Backport of upstream matrix-org/matrix-rust-sdk#6361 onto our space_room_suggested base, to address the idle sliding-sync request loop tracked in #30.

Pin matrix-sdk{, -base, -ui} to a SHA on the Robius-China fork that is 627563bb (our previous space_room_suggested tip) + one cherry-pick:

Fork branch: Project-Robius-China/matrix-rust-sdk@fix/room-list-long-poll-after-initial-sync
Pinned SHA: cb391f70ade93aee295108e623f6bed3ef1cea53
Cherry-picked from upstream: c7573469b (the production-code commit of PR #6361; the test-assertion companion commit is intentionally omitted since we don't compile SDK test targets)

What the upstream fix does

RoomListService::requires_timeout was forcing timeout=0 for all post-init states — SettingUp, Recovering, and Running (before fully_loaded). While idle, the client kept re-sending the same pos right away instead of long-polling, producing the pos=<n>&timeout=0 spam tracked in #30.

After the fix:

State::Init → PollTimeout::Some(0) (unchanged; first sync still returns immediately so the session establishes fast)
State::SettingUp | Recovering | Running → PollTimeout::Default (server long-polls when idle; still responds immediately when it has pending changes)
State::Error { .. } | State::Terminated { .. } → PollTimeout::Some(0)

Why minimum-risk

Runtime delta vs. previous dep is exactly one commit — a closure-body rewrite in crates/matrix-sdk-ui/src/room_list_service/mod.rs. No public API / type / feature-flag changes, so zero call-site ripple in Robrix2.
space_room_suggested customizations (thread-subscriptions, ring crypto provider, suggested field on SpaceRoom, etc.) are all preserved on the same base commit 627563bb.
SHA-pinned (rev = "..."), not branch-tracked — the release build cannot silently absorb later drift on the fork.
Cherry-pick applied conflict-free. space_room_suggested modifies other parts of room_list_service/mod.rs but does not overlap the requires_timeout closure; pre-verified with git merge-tree --write-tree.

Test plan

cargo check --all-targets passes locally (22.77s with incremental reuse)
Smoke-test against local Palpo — confirm the pos=<n>&timeout=0 spam stops once the client reaches idle (verified 2026-04-16)
Smoke-test against matrix.org — normal room list / timeline / message send work unchanged
Confirm no regression in thread-subscriptions or SpaceRoom suggested field behavior (the space_room_suggested customizations that sit alongside the fix)

Rollback plan

Once upstream #6361 merges and propagates onto project-robius/space_room_suggested:

Revert the 3 Cargo.toml lines to project-robius/matrix-rust-sdk branch = "space_room_suggested"
cargo update -p matrix-sdk -p matrix-sdk-base -p matrix-sdk-ui
Delete the Project-Robius-China/matrix-rust-sdk@fix/room-list-long-poll-after-initial-sync branch
Close Track idle sliding-sync request loop after initial sync #30

Refs #30

Pin matrix-sdk{, -base, -ui} to Project-Robius-China/matrix-rust-sdk @ cb391f70 — which is 627563bb (our previous space_room_suggested tip) plus a single cherry-pick of upstream matrix-org#6361's production-code commit (c7573469b). The `requires_timeout` closure in RoomListService was forcing `timeout=0` for all post-init states, so idle clients kept re-sending the same `pos` immediately instead of long-polling. The backport restricts `timeout=0` to `State::Init` only and lets SettingUp / Recovering / Running use `PollTimeout::Default`, so the server can long-poll while idle (and still answer immediately when it has pending changes). Runtime delta is exactly one closure-body rewrite in crates/matrix-sdk-ui/src/room_list_service/mod.rs — no public API, type, or feature-flag changes. Refs #30

TigerInYourDream · 2026-04-16T05:42:18Z

Smoke-test verification against local Palpo

Ran the release binary (target/release/robrix, built 2026-04-16 13:34 local, same branch as this PR) against the local robrix2-testenv-palpo-1 container. Observed the MSC3575 /sync request pattern via docker logs --timestamps robrix2-testenv-palpo-1.

Phase 1 — catching up with backlog (~27s)

05:35:02  pos=4278  timeout=0       ← State::Init, fix preserves timeout=0 by design
05:35:08  pos=4278  timeout=30000   ← switched to long-poll immediately after Init
05:35:11  pos=4279  timeout=30000   ← server returns early because it has data
05:35:14  pos=4281  timeout=30000
...
05:35:29  pos=4304  timeout=30000   ← caught up

The fast cadence here is textbook long-poll "server returns early when it has pending changes" — not the old bug.

Phase 2 — idle, `pos` stable at 4306

05:35:29.246  pos=4306  timeout=30000
05:35:59.258  pos=4306  timeout=30000   (Δ = 30.012 s)
05:36:29.278  pos=4306  timeout=30000   (Δ = 30.020 s)
05:36:59.301  pos=4306  timeout=30000   (Δ = 30.022 s)
05:37:29.318  pos=4306  timeout=30000   (Δ = 30.017 s)
05:37:59.351  pos=4306  timeout=30000   (Δ = 30.033 s)

Each request is held by Palpo for exactly ~30 s (the requested timeout), then returns empty, and the client re-sends. This is correct long-polling — the exact behavior the upstream PR targets.

Quantified comparison

Metric	Bug (pre-fix)	Observed (post-fix)
`timeout=0` share	~100%	1/112 ≈ 0.9% (only the `State::Init` sync)
Idle request rate	>1000 req/min (tight loop)	5.6 req/min
Same-`pos` spacing	<1 ms	~30 s (= requested timeout)

Verdict

✅ SDK-side fix (this PR) is working end-to-end
✅ No observable regression in sync behavior during ~10 minutes of idle + active usage
As a bonus: Palpo is also holding the connection the full 30 s when truly idle, so fix: ignore static metadata when long-polling sliding sync palpo-im/palpo#72's trigger conditions are not being hit in this test env either

Running binary confirmed to be built from this branch (mtime Apr 16 13:34 > commit d18631a3 at 13:11:43).

Real pre/post measurement from the 25-hour Palpo log

The same Palpo container has been running for 25 hours — covering both the buggy pre-fix binary and this fix. Aggregating docker logs --timestamps of robrix2-testenv-palpo-1 gives direct before/after numbers (not estimates):

MSC3575 `/sync` request volume per hour

Hour (UTC, 2026-04-16)	Robrix binary	Requests	Rate
02:00	pre-fix (buggy)	12,329	~205 req/min
03:00	pre-fix (buggy)	21,409	~357 req/min (peak)
04:00	pre-fix (tail)	3,839	~64 req/min
05:00	transition	102	—
05:36 – 05:56	post-fix	84 total	~4 req/min

Busiest single minute across 25 hours vs. idle minute now

Minute (UTC)	Binary	Requests	`timeout=0` share
2026-04-16T02:16	pre-fix peak	738 / min	731 / 738 = 99.1%
2026-04-16T05:55	post-fix idle	4 / min	0 / 4 = 0%

Reduction factor

Request rate: 738 / 4 ≈ 184× reduction at the peak-vs-idle point
Sustained hourly load: 21,409 / 240 (extrapolated 4 req/min × 60) ≈ 89× reduction
timeout=0 share: 99.1% → 0.9% ≈ 110× reduction

Server-side CPU impact (`docker stats --no-stream`)

Container	CPU % now (post-fix, idle)
`robrix2-testenv-palpo-1`	0.05%
`robrix2-testenv-palpo_postgres-1`	0.32%

Palpo is essentially idle on this route now. Before the fix, at ~357 req/min sustained, it was doing MSC3575 room-list diff computation + Postgres lookups hundreds of times per minute; that load is gone.

Bottom line

This is not a "theoretical cleanup" — it is an observable ~100× reduction in sync traffic and a corresponding drop in Palpo CPU load from the same local test env, captured from real logs spanning the binary swap.

alanpoon · 2026-04-16T09:49:53Z

 dependencies = [
 "libc",
- "windows-sys 0.59.0",
+ "windows-sys 0.61.1",


Need to keep window-sys version unchanged.

It seems that Project-robius Robrix does not have this issue.

Good catch, thanks for flagging this — you're right that cargo update -p matrix-sdk ... opportunistically re-resolved unrelated transitive deps. That was scope creep, not intentional.

Fixed in e3fbcae:

Reset Cargo.lock to origin/main baseline.

Surgically replaced only the 8 matrix-sdk{, -base, -common, -crypto, -sqlite, -store-encryption, -ui, -indexeddb-stores} source = URL lines to point at the backport fork at rev cb391f70.

windows-sys 0.59.0 at line 2037 (and all other transitive entries) now match origin/main byte-for-byte — verified: 8 / 8 occurrences identical.

cargo check --locked passes in 34.63s with zero warnings.

The only extra diff line is robrix = 0.0.1-pre-alpha-4 (vs the lock's stale 0.1.0-pre-alpha-1) — that's Cargo auto-correcting a pre-existing inconsistency in origin/main's lockfile under --locked mode; unavoidable without a separate Cargo.toml version bump.

Final PR scope: 3 Cargo.toml lines + 8 Cargo.lock source-URL lines + 1 auto-corrected version field. Nothing else changed.

Happy to squash d18631a3 + e3fbcae8 into a single clean commit if you'd prefer that for merge.

@alanpoon

Reviewer flagged that the previous `cargo update -p matrix-sdk ...` opportunistically re-resolved unrelated transitive deps (windows-sys 0.59.0 → 0.61.1 for errno, etc.). This commit resets Cargo.lock to `origin/main` and then surgically patches only the 8 matrix-sdk{, -base, -common, -crypto, -sqlite, -store-encryption, -ui, -indexeddb-stores} source URLs to point to the backport fork at rev cb391f70. All other transitive entries now exactly match `origin/main`. The single extra line (`robrix = 0.0.1-pre-alpha-4` vs the lock's stale `0.1.0-pre-alpha-1`) is Cargo auto-correcting a pre-existing inconsistency in `origin/main` — unavoidable under `--locked`. Verified with `cargo check --locked` (34.63s, zero warnings). Refs #30, addresses review comment from @alanpoon on PR #101

@alanpoon

Reviewer flagged that the previous `cargo update -p matrix-sdk ...` opportunistically re-resolved unrelated transitive deps (windows-sys 0.59.0 → 0.61.1 for errno, etc.). This commit resets Cargo.lock to `origin/main` and then surgically patches only the 8 matrix-sdk{, -base, -common, -crypto, -sqlite, -store-encryption, -ui, -indexeddb-stores} source URLs to point to the backport fork at rev cb391f70. All other transitive entries now exactly match `origin/main`. The single extra line (`robrix = 0.0.1-pre-alpha-4` vs the lock's stale `0.1.0-pre-alpha-1`) is Cargo auto-correcting a pre-existing inconsistency in `origin/main` — unavoidable under `--locked`. Verified with `cargo check --locked` (34.63s, zero warnings). Refs #30, addresses review comment from @alanpoon on PR #101 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TigerInYourDream · 2026-04-16T14:48:20Z

Pre-fix (25h Palpo log, same container):

Peak minute 2026-04-16T02:16: 738 requests/min, 731 of them with timeout=0 (99.1%)
Peak hour: 21,409 requests, ~357 req/min sustained
Palpo doing MSC3575 room-list diff + Postgres lookups hundreds of times/min

Post-fix (same binary, same container, idle):

4 requests/min, 0% timeout=0, 30.012–30.033s spacing (= long-poll held)
Palpo CPU 0.05%

≈ 180× fewer requests, ≈ 110× drop in timeout=0 share. Full breakdown in the previous comment.

The bug window is State::Running && !is_fully_loaded. robrix2's sliding_sync config + Octos appservice injection keeps that window open much longer than a vanilla setup — which plausibly explains why it's not observable in project-robius/robrix. The underlying SDK code path is the same regardless.

TigerInYourDream added the ready to review label Apr 16, 2026

alanpoon reviewed Apr 16, 2026

View reviewed changes

TigerInYourDream force-pushed the fix/30-room-list-long-poll-backport branch from e3fbcae to 5b74337 Compare April 16, 2026 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deps): backport matrix-rust-sdk#6361 to stop idle sync loop#101

fix(deps): backport matrix-rust-sdk#6361 to stop idle sync loop#101
TigerInYourDream wants to merge 2 commits intomainfrom
fix/30-room-list-long-poll-backport

TigerInYourDream commented Apr 16, 2026 •

edited

Loading

Uh oh!

TigerInYourDream commented Apr 16, 2026 •

edited

Loading

Uh oh!

alanpoon Apr 16, 2026

Uh oh!

alanpoon Apr 16, 2026

Uh oh!

TigerInYourDream Apr 16, 2026

Uh oh!

TigerInYourDream commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TigerInYourDream commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the upstream fix does

Why minimum-risk

Test plan

Rollback plan

Uh oh!

TigerInYourDream commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Smoke-test verification against local Palpo

Phase 1 — catching up with backlog (~27s)

Phase 2 — idle, pos stable at 4306

Quantified comparison

Verdict

Real pre/post measurement from the 25-hour Palpo log

MSC3575 /sync request volume per hour

Busiest single minute across 25 hours vs. idle minute now

Reduction factor

Server-side CPU impact (docker stats --no-stream)

Bottom line

Uh oh!

alanpoon Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

alanpoon Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

TigerInYourDream Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

TigerInYourDream commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TigerInYourDream commented Apr 16, 2026 •

edited

Loading

TigerInYourDream commented Apr 16, 2026 •

edited

Loading

Phase 2 — idle, `pos` stable at 4306

MSC3575 `/sync` request volume per hour

Server-side CPU impact (`docker stats --no-stream`)