[WIP, AIT-569] Investigation of push registration failures#2204
Draft
lawrence-forooghian wants to merge 34 commits intoAIT-569-keychain-investigation-groundworkfrom
Draft
Conversation
Document two possible approaches for handling legacy data where the deviceIdentityToken may not match the current device id: Direction A: validate the token with a GET, then re-register if rejected. Currently written into the spec (RSH3i/RSH3j). The spec has a TODO in RSH3j2a for a possible improvement: using a PATCH with the deviceSecret to preserve the registration rather than discarding it. Direction B: skip validation, discard the token, and go through the normal registration flow. The POST is an upsert on the server (confirmed in realtime code) so the existing registration is preserved. Simpler but relies on undocumented server behaviour. Also adds context on how much we care about preserving the registration — devices that have been through the Keychain bug already have orphaned registrations, but many devices may never have been affected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add analysis of RSH3a2a — the existing "validation" mechanism in the spec. It appears to be unreachable through normal state machine transitions (all paths into NotActivated clear the token), and its failure path loops on 401 errors. Add direction C: hook into RSH3a2a by keeping the token and starting in NotActivated. Would fix the token-mismatch loop for all cases (not just legacy migration) by modifying the sync failure path to discard the token on 401 and fall through to fresh registration. But loses context about why the 401 happened, making it impossible to distinguish legacy migration recovery from other auth failures. Note that directions B and C are not yet written into the spec (only direction A is specced). Update recommendation to reflect that we haven't reached a decision between the three. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Background research on which push activation states are persistent vs non-persistent across ably-cocoa, ably-java, and ably-js. Traces the history of these decisions, the motivation (ably-java#546), the #966 bug caused by non-persistent states, and the unmerged attempt to fix it (0e92186). Out of scope for the current push registration failure investigation but captured for future reference, particularly for informing ably-swift's push implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add comprehensive inventory of everything ably-cocoa persists, with storage mechanisms and consequences of loss for each item - Note Apple's APNS token caching guidance and how RSH8i addresses it (see specification#25) - Add direction C (hook into RSH3a2a) as an alternative approach - Add analysis of RSH3a2a's purpose, reachability, and whether we can use it for recovery - Clarify that directions B and C are not yet written into the spec - Update direction A/B comparison to be more balanced Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add section clarifying two implicit assumptions in the spec: 1. LocalDevice is treated as a single atomic blob — the spec doesn't anticipate split storage (Keychain + NSUserDefaults). RSH8a2 acknowledges this but we haven't proposed how ably-cocoa would achieve atomicity going forward. 2. RSH3h1's failure recovery is a safety net, not a routine code path. If it fires frequently (due to sometimes-unavailable storage), orphaned registrations accumulate with no cleanup mechanism. Paddy's comment on #1109 confirms always-available storage was the intended approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Discuss whether the spec's atomic unit should be just the (id, secret, token) tuple (as proposed in RSH8a2) or the entire set of persisted data including state machine state. The (id, secret, token) tuple is the only group where atomicity is critical for correctness. Other items (clientId, APNS tokens, state machine state) are self-correcting or have fallbacks. But the state machine state is logically part of the same unit, and storing everything as one blob in ably-cocoa would be simplest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers: - Test app requirements (display device state, debug tampering) - How to create the stale token state directly (debug button) - How to reproduce the actual Keychain failure (reboot + silent push before unlock) - Migration testing (old SDK → new SDK, with and without stale token, with and without Keychain availability) - Verification criteria for all cases These test plans have not been verified and are initial proposals to ensure we have a testing strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create two Ably instances: mainAbly (the client under test) and eventLoggingAbly (for publishing events to the LocalDeviceStorageBugTest-events channel). Wire up a custom ARTLog handler on mainAbly that publishes all log messages to the events channel via eventLoggingAbly. Also add a Secrets.example.swift template and .gitignore to keep API keys out of source control, and a README describing the app. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set up PushKit to register for VoIP push tokens, publishing them to the events channel as voipToken events. Add a minimal CallKit handler (CXProvider/CXProviderDelegate) to satisfy the iOS requirement that VoIP pushes must report an incoming call. Add a shell script (send-voip-push.sh) that fetches the latest VoIP token from the events channel via the Ably CLI and sends a push to APNs. Also configure the Xcode project with the required capabilities: push notification entitlement, VoIP background mode, and usage description. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a Codable Event enum that unifies all events published to the events channel, with documented associated-value structs for each case. Custom Codable conformance avoids the _0 wrapper that Swift's default enum synthesis produces. Add an ARTRealtimeChannel extension in a separate file for publishing Event values directly. Update all call sites and the send-voip-push script to use the new event names and JSON structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add buttons to activate push and subscribe to a push channel, with result display. Each action publishes attempt/result event pairs to the events channel, linked by an attempt ID and tagged with a reason (currently userTappedButton, extensible to automatic triggers). Add CodableErrorInfo to capture the full ARTErrorInfo in event payloads (code, statusCode, message, reason, href, requestId, cause). Add an AppDelegate to forward APNs device tokens to ARTPush, as required by the Ably push activation flow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a settings section with toggles for auto-activating push and auto-subscribing to the push channel on app launch. When both are enabled, activation runs first and subscription follows on success. Settings are stored in a JSON file with FileProtectionType.none so they remain readable before first unlock — necessary because a VoIP push can launch the app while the device is still locked. Add appLaunch as an ActionReason to distinguish automatic actions from user-initiated ones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a setup step to enable message persistence on the events channel, since the default 2-minute retention is too short for the send script to find the VoIP token. Also increase the history limit in send-voip-push.sh from 100 to 1000. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generate a UUID on first launch and persist it to an unprotected file (installation-id.txt). Include it in every event alongside appLaunchID, so events can be correlated across launches of the same installation. The ID does not survive reinstallation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish an appLaunched event before any other setup, capturing whether protected data was available at launch time. This is key for identifying launches that occur before first unlock (e.g. from a VoIP push). Also observe protectedDataDidBecomeAvailable and protectedDataWillBecomeUnavailable notifications, publishing a protectedDataAvailability event on each subsequent change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add CodableLocalDevice (and CodablePushDetails, CodableIdentityTokenDetails) to capture the full state of ARTLocalDevice. Include it in the pushActivateResult event so that changes to device details (e.g. ID, secret) can be detected when the SDK is unable to load persisted data — as happens when the app is launched before first unlock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use appInstallation-<UUID> as the clientId so that multiple device registrations from the same installation are easy to identify — this is the failure mode under investigation where the device ID gets unnecessarily recreated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Read the file protection attribute of the NSUserDefaults plist at startup and include it in the appLaunched event. This tells us whether the file that ARTLocalDeviceStorage uses to persist device details is accessible before first unlock. The plist path (Library/Preferences/<bundle-id>.plist) is an implementation detail of NSUserDefaults. The file may not exist on a fresh install before any defaults have been written. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Include a full dump of UserDefaults.standard in both the appLaunched and pushActivateResult events, so that we can compare before and after to see whether the SDK wrote new values during activation — even if the file was previously unavailable (before first unlock). Non-JSON-serialisable values (e.g. Data) are sanitised to string representations. The dump is inlined as a dictionary in the Ably message payload rather than a JSON string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Forthcoming work on this branch needs a test-only Ably API that is only exposed through the `Ably.Private` submodule, so import it here. Importing `Ably.Private` exposes additional designated initialisers on `ARTLog` (notably `-initCapturingOutput:` and `-initCapturingOutput:historyLines:`, declared in `ARTLog+Private.h`), and that has a knock-on effect on the `EventLoggingLogHandler: ARTLog` subclass defined in this file. Without any action, the app now fatally errors at launch with "Use of unimplemented initializer 'init(capturingOutput:)'". Work around it by routing `super.init(...)` through the terminal 3-arg initialiser rather than `-init`, which avoids the self-dispatch that triggers the Swift-synthesised trap. The comment in the code flags that the reasoning is Claude's speculation and should be verified if it ever becomes load-bearing.
Use the private test-only options added in 571c79ae and 4015f8e6: - Set `disableLocalDevice = true` on the event-logging client so it doesn't become the owner of the shared `ARTLocalDevice` and the first client to access `rest.device_nosync` (the main client) ends up bound to the shared device instead, keeping its device-storage activity attributed to its own logger. - Set `logLocalDeviceStorageValues = true` on the main client so that storage read/write log lines include the persisted values rather than the `(retracted)` placeholder. This is what we're here to investigate.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Point to the LocalDeviceStorageBugTest app as the current test approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c36916f to
d25a26a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: This PR is based on top of #2203, which provides groundwork for the demo app contained in this PR.
This contains a WIP investigation of https://ably.atlassian.net/browse/AIT-569. I have not yet reproduced the customer's issue and am awaiting further information. However, the investigation document and corresponding specification changes do contain some things that we need to fix no matter what.
The two main things contained in this PR are:
Examples/LocalDeviceStorageBugTest), which allows us to explore how the SDK behaves under before-first-unlock data inaccessibility; see its README for information on how to use itThere are some corresponding specification changes that I've drafted; see ably/specification#450. Some of these may be valuable, some may not, because until I actually reproduce the issue some of the things it's trying to address may just be speculative and not actually in need of "fixing".