[WIP, AIT-569] Investigation of push registration failures by lawrence-forooghian · Pull Request #2204 · ably/ably-cocoa

lawrence-forooghian · 2026-04-17T19:39:47Z

Note: This PR is based on top of #2203, which provides groundwork for the demo app contained in this PR.

This contains a WIP investigation of https://ably.atlassian.net/browse/AIT-569. I have not yet reproduced the customer's issue and am awaiting further information. However, the investigation document and corresponding specification changes do contain some things that we need to fix no matter what.

The two main things contained in this PR are:

an investigation document which reflects the results of conversations I had with Claude and thinking about what could have caused this issue; take it with a pinch of salt, I have not yet read through it in its entirety and have just been letting Claude update it as it goes along
a test harness app (Examples/LocalDeviceStorageBugTest), which allows us to explore how the SDK behaves under before-first-unlock data inaccessibility; see its README for information on how to use it

There are some corresponding specification changes that I've drafted; see ably/specification#450. Some of these may be valuable, some may not, because until I actually reproduce the issue some of the things it's trying to address may just be speculative and not actually in need of "fixing".

Document two possible approaches for handling legacy data where the deviceIdentityToken may not match the current device id: Direction A: validate the token with a GET, then re-register if rejected. Currently written into the spec (RSH3i/RSH3j). The spec has a TODO in RSH3j2a for a possible improvement: using a PATCH with the deviceSecret to preserve the registration rather than discarding it. Direction B: skip validation, discard the token, and go through the normal registration flow. The POST is an upsert on the server (confirmed in realtime code) so the existing registration is preserved. Simpler but relies on undocumented server behaviour. Also adds context on how much we care about preserving the registration — devices that have been through the Keychain bug already have orphaned registrations, but many devices may never have been affected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add analysis of RSH3a2a — the existing "validation" mechanism in the spec. It appears to be unreachable through normal state machine transitions (all paths into NotActivated clear the token), and its failure path loops on 401 errors. Add direction C: hook into RSH3a2a by keeping the token and starting in NotActivated. Would fix the token-mismatch loop for all cases (not just legacy migration) by modifying the sync failure path to discard the token on 401 and fall through to fresh registration. But loses context about why the 401 happened, making it impossible to distinguish legacy migration recovery from other auth failures. Note that directions B and C are not yet written into the spec (only direction A is specced). Update recommendation to reflect that we haven't reached a decision between the three. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Background research on which push activation states are persistent vs non-persistent across ably-cocoa, ably-java, and ably-js. Traces the history of these decisions, the motivation (ably-java#546), the #966 bug caused by non-persistent states, and the unmerged attempt to fix it (0e92186). Out of scope for the current push registration failure investigation but captured for future reference, particularly for informing ably-swift's push implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add comprehensive inventory of everything ably-cocoa persists, with storage mechanisms and consequences of loss for each item - Note Apple's APNS token caching guidance and how RSH8i addresses it (see specification#25) - Add direction C (hook into RSH3a2a) as an alternative approach - Add analysis of RSH3a2a's purpose, reachability, and whether we can use it for recovery - Clarify that directions B and C are not yet written into the spec - Update direction A/B comparison to be more balanced Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add section clarifying two implicit assumptions in the spec: 1. LocalDevice is treated as a single atomic blob — the spec doesn't anticipate split storage (Keychain + NSUserDefaults). RSH8a2 acknowledges this but we haven't proposed how ably-cocoa would achieve atomicity going forward. 2. RSH3h1's failure recovery is a safety net, not a routine code path. If it fires frequently (due to sometimes-unavailable storage), orphaned registrations accumulate with no cleanup mechanism. Paddy's comment on #1109 confirms always-available storage was the intended approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Discuss whether the spec's atomic unit should be just the (id, secret, token) tuple (as proposed in RSH8a2) or the entire set of persisted data including state machine state. The (id, secret, token) tuple is the only group where atomicity is critical for correctness. Other items (clientId, APNS tokens, state machine state) are self-correcting or have fallbacks. But the state machine state is logically part of the same unit, and storing everything as one blob in ably-cocoa would be simplest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers: - Test app requirements (display device state, debug tampering) - How to create the stale token state directly (debug button) - How to reproduce the actual Keychain failure (reboot + silent push before unlock) - Migration testing (old SDK → new SDK, with and without stale token, with and without Keychain availability) - Verification criteria for all cases These test plans have not been verified and are initial proposals to ensure we have a testing strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Create two Ably instances: mainAbly (the client under test) and eventLoggingAbly (for publishing events to the LocalDeviceStorageBugTest-events channel). Wire up a custom ARTLog handler on mainAbly that publishes all log messages to the events channel via eventLoggingAbly. Also add a Secrets.example.swift template and .gitignore to keep API keys out of source control, and a README describing the app. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set up PushKit to register for VoIP push tokens, publishing them to the events channel as voipToken events. Add a minimal CallKit handler (CXProvider/CXProviderDelegate) to satisfy the iOS requirement that VoIP pushes must report an incoming call. Add a shell script (send-voip-push.sh) that fetches the latest VoIP token from the events channel via the Ably CLI and sends a push to APNs. Also configure the Xcode project with the required capabilities: push notification entitlement, VoIP background mode, and usage description. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduce a Codable Event enum that unifies all events published to the events channel, with documented associated-value structs for each case. Custom Codable conformance avoids the _0 wrapper that Swift's default enum synthesis produces. Add an ARTRealtimeChannel extension in a separate file for publishing Event values directly. Update all call sites and the send-voip-push script to use the new event names and JSON structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add buttons to activate push and subscribe to a push channel, with result display. Each action publishes attempt/result event pairs to the events channel, linked by an attempt ID and tagged with a reason (currently userTappedButton, extensible to automatic triggers). Add CodableErrorInfo to capture the full ARTErrorInfo in event payloads (code, statusCode, message, reason, href, requestId, cause). Add an AppDelegate to forward APNs device tokens to ARTPush, as required by the Ably push activation flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a settings section with toggles for auto-activating push and auto-subscribing to the push channel on app launch. When both are enabled, activation runs first and subscription follows on success. Settings are stored in a JSON file with FileProtectionType.none so they remain readable before first unlock — necessary because a VoIP push can launch the app while the device is still locked. Add appLaunch as an ActionReason to distinguish automatic actions from user-initiated ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add a setup step to enable message persistence on the events channel, since the default 2-minute retention is too short for the send script to find the VoIP token. Also increase the history limit in send-voip-push.sh from 100 to 1000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Generate a UUID on first launch and persist it to an unprotected file (installation-id.txt). Include it in every event alongside appLaunchID, so events can be correlated across launches of the same installation. The ID does not survive reinstallation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Publish an appLaunched event before any other setup, capturing whether protected data was available at launch time. This is key for identifying launches that occur before first unlock (e.g. from a VoIP push). Also observe protectedDataDidBecomeAvailable and protectedDataWillBecomeUnavailable notifications, publishing a protectedDataAvailability event on each subsequent change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add CodableLocalDevice (and CodablePushDetails, CodableIdentityTokenDetails) to capture the full state of ARTLocalDevice. Include it in the pushActivateResult event so that changes to device details (e.g. ID, secret) can be detected when the SDK is unable to load persisted data — as happens when the app is launched before first unlock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use appInstallation-<UUID> as the clientId so that multiple device registrations from the same installation are easy to identify — this is the failure mode under investigation where the device ID gets unnecessarily recreated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Read the file protection attribute of the NSUserDefaults plist at startup and include it in the appLaunched event. This tells us whether the file that ARTLocalDeviceStorage uses to persist device details is accessible before first unlock. The plist path (Library/Preferences/<bundle-id>.plist) is an implementation detail of NSUserDefaults. The file may not exist on a fresh install before any defaults have been written. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Include a full dump of UserDefaults.standard in both the appLaunched and pushActivateResult events, so that we can compare before and after to see whether the SDK wrote new values during activation — even if the file was previously unavailable (before first unlock). Non-JSON-serialisable values (e.g. Data) are sanitised to string representations. The dump is inlined as a dictionary in the Ably message payload rather than a JSON string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Forthcoming work on this branch needs a test-only Ably API that is only exposed through the `Ably.Private` submodule, so import it here. Importing `Ably.Private` exposes additional designated initialisers on `ARTLog` (notably `-initCapturingOutput:` and `-initCapturingOutput:historyLines:`, declared in `ARTLog+Private.h`), and that has a knock-on effect on the `EventLoggingLogHandler: ARTLog` subclass defined in this file. Without any action, the app now fatally errors at launch with "Use of unimplemented initializer 'init(capturingOutput:)'". Work around it by routing `super.init(...)` through the terminal 3-arg initialiser rather than `-init`, which avoids the self-dispatch that triggers the Swift-synthesised trap. The comment in the code flags that the reasoning is Claude's speculation and should be verified if it ever becomes load-bearing.

Use the private test-only options added in 571c79ae and 4015f8e6: - Set `disableLocalDevice = true` on the event-logging client so it doesn't become the owner of the shared `ARTLocalDevice` and the first client to access `rest.device_nosync` (the main client) ends up bound to the shared device instead, keeping its device-storage activity attributed to its own logger. - Set `logLocalDeviceStorageValues = true` on the main client so that storage read/write log lines include the persisted values rather than the `(retracted)` placeholder. This is what we're here to investigate.

coderabbitai · 2026-04-17T19:39:53Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7cab16f5-8385-49d1-b240-95cb19947c92

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch AIT-569-investigating-push-registration-failures

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Point to the LocalDeviceStorageBugTest app as the current test approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lawrence-forooghian and others added 30 commits April 17, 2026 16:37

notes on investigation

65a3411

Further notes

3ebfeba

Add thoughts on storage availability

21d5a99

Further thoughts from conversation with Claude

6ff2720

Add empty app for reproducing LocalDevice storage bug

e144f43

Add Ably to example app

153140e

Rename log event case to ablyLog

ed219b2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update README with VoIP push details and send script usage

04a189b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document push-test channel rule setup

43c797b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log installation and launch IDs in send-voip-push script

41efc2d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document known iPad reproduction issue

f000761

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lawrence-forooghian added 3 commits April 17, 2026 16:37

Describe how before-first-unlock VoIP behaves a bit weirdly

87d6027

lawrence-forooghian changed the base branch from main to AIT-569-keychain-investigation-groundwork April 17, 2026 19:40

lawrence-forooghian mentioned this pull request Apr 17, 2026

[AIT-569] Groundwork for investigation of LocalDevice-related issues #2203

Open

github-actions bot temporarily deployed to staging/pull/2204/features April 17, 2026 19:40 Inactive

github-actions bot temporarily deployed to staging/pull/2204/jazzydoc April 17, 2026 19:44 Inactive

github-actions bot temporarily deployed to staging/pull/2204/markdown-api-reference April 17, 2026 19:44 Inactive

lawrence-forooghian changed the title ~~[AIT-569] Investigation of push registration failures~~ [WIP, AIT-569] Investigation of push registration failures Apr 17, 2026

lawrence-forooghian mentioned this pull request Apr 17, 2026

[AIT-569] Possible spec directions relating to ably-cocoa push registration failure ably/specification#450

Draft

github-actions bot temporarily deployed to staging/pull/2204/features April 17, 2026 20:16 Inactive

Mark testing plans as outdated in investigation doc

d25a26a

Point to the LocalDeviceStorageBugTest app as the current test approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lawrence-forooghian force-pushed the AIT-569-investigating-push-registration-failures branch from c36916f to d25a26a Compare April 17, 2026 20:16

github-actions bot deployed to staging/pull/2204/features April 17, 2026 20:17 View deployment

github-actions bot deployed to staging/pull/2204/jazzydoc April 17, 2026 20:21 View deployment

github-actions bot deployed to staging/pull/2204/markdown-api-reference April 17, 2026 20:21 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP, AIT-569] Investigation of push registration failures#2204

[WIP, AIT-569] Investigation of push registration failures#2204
lawrence-forooghian wants to merge 34 commits intoAIT-569-keychain-investigation-groundworkfrom
AIT-569-investigating-push-registration-failures

lawrence-forooghian commented Apr 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 17, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

lawrence-forooghian commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

lawrence-forooghian commented Apr 17, 2026 •

edited

Loading

coderabbitai bot commented Apr 17, 2026 •

edited

Loading