Skip to content

[WIP, AIT-569] Investigation of push registration failures#2204

Draft
lawrence-forooghian wants to merge 34 commits intoAIT-569-keychain-investigation-groundworkfrom
AIT-569-investigating-push-registration-failures
Draft

[WIP, AIT-569] Investigation of push registration failures#2204
lawrence-forooghian wants to merge 34 commits intoAIT-569-keychain-investigation-groundworkfrom
AIT-569-investigating-push-registration-failures

Conversation

@lawrence-forooghian
Copy link
Copy Markdown
Collaborator

@lawrence-forooghian lawrence-forooghian commented Apr 17, 2026

Note: This PR is based on top of #2203, which provides groundwork for the demo app contained in this PR.

This contains a WIP investigation of https://ably.atlassian.net/browse/AIT-569. I have not yet reproduced the customer's issue and am awaiting further information. However, the investigation document and corresponding specification changes do contain some things that we need to fix no matter what.

The two main things contained in this PR are:

  • an investigation document which reflects the results of conversations I had with Claude and thinking about what could have caused this issue; take it with a pinch of salt, I have not yet read through it in its entirety and have just been letting Claude update it as it goes along
  • a test harness app (Examples/LocalDeviceStorageBugTest), which allows us to explore how the SDK behaves under before-first-unlock data inaccessibility; see its README for information on how to use it

There are some corresponding specification changes that I've drafted; see ably/specification#450. Some of these may be valuable, some may not, because until I actually reproduce the issue some of the things it's trying to address may just be speculative and not actually in need of "fixing".

lawrence-forooghian and others added 30 commits April 17, 2026 16:37
Document two possible approaches for handling legacy data where
the deviceIdentityToken may not match the current device id:

Direction A: validate the token with a GET, then re-register if
rejected. Currently written into the spec (RSH3i/RSH3j). The spec
has a TODO in RSH3j2a for a possible improvement: using a PATCH
with the deviceSecret to preserve the registration rather than
discarding it.

Direction B: skip validation, discard the token, and go through
the normal registration flow. The POST is an upsert on the server
(confirmed in realtime code) so the existing registration is
preserved. Simpler but relies on undocumented server behaviour.

Also adds context on how much we care about preserving the
registration — devices that have been through the Keychain bug
already have orphaned registrations, but many devices may never
have been affected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add analysis of RSH3a2a — the existing "validation" mechanism in
the spec. It appears to be unreachable through normal state machine
transitions (all paths into NotActivated clear the token), and its
failure path loops on 401 errors.

Add direction C: hook into RSH3a2a by keeping the token and
starting in NotActivated. Would fix the token-mismatch loop for
all cases (not just legacy migration) by modifying the sync
failure path to discard the token on 401 and fall through to fresh
registration. But loses context about why the 401 happened, making
it impossible to distinguish legacy migration recovery from other
auth failures.

Note that directions B and C are not yet written into the spec
(only direction A is specced). Update recommendation to reflect
that we haven't reached a decision between the three.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Background research on which push activation states are persistent
vs non-persistent across ably-cocoa, ably-java, and ably-js. Traces
the history of these decisions, the motivation (ably-java#546), the
#966 bug caused by non-persistent states, and the unmerged attempt
to fix it (0e92186).

Out of scope for the current push registration failure
investigation but captured for future reference, particularly for
informing ably-swift's push implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add comprehensive inventory of everything ably-cocoa persists,
  with storage mechanisms and consequences of loss for each item
- Note Apple's APNS token caching guidance and how RSH8i addresses
  it (see specification#25)
- Add direction C (hook into RSH3a2a) as an alternative approach
- Add analysis of RSH3a2a's purpose, reachability, and whether we
  can use it for recovery
- Clarify that directions B and C are not yet written into the spec
- Update direction A/B comparison to be more balanced

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add section clarifying two implicit assumptions in the spec:

1. LocalDevice is treated as a single atomic blob — the spec
   doesn't anticipate split storage (Keychain + NSUserDefaults).
   RSH8a2 acknowledges this but we haven't proposed how ably-cocoa
   would achieve atomicity going forward.

2. RSH3h1's failure recovery is a safety net, not a routine code
   path. If it fires frequently (due to sometimes-unavailable
   storage), orphaned registrations accumulate with no cleanup
   mechanism. Paddy's comment on #1109 confirms always-available
   storage was the intended approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Discuss whether the spec's atomic unit should be just the (id,
secret, token) tuple (as proposed in RSH8a2) or the entire set
of persisted data including state machine state.

The (id, secret, token) tuple is the only group where atomicity
is critical for correctness. Other items (clientId, APNS tokens,
state machine state) are self-correcting or have fallbacks. But
the state machine state is logically part of the same unit, and
storing everything as one blob in ably-cocoa would be simplest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers:
- Test app requirements (display device state, debug tampering)
- How to create the stale token state directly (debug button)
- How to reproduce the actual Keychain failure (reboot + silent
  push before unlock)
- Migration testing (old SDK → new SDK, with and without stale
  token, with and without Keychain availability)
- Verification criteria for all cases

These test plans have not been verified and are initial proposals
to ensure we have a testing strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create two Ably instances: mainAbly (the client under test) and
eventLoggingAbly (for publishing events to the
LocalDeviceStorageBugTest-events channel). Wire up a custom ARTLog
handler on mainAbly that publishes all log messages to the events
channel via eventLoggingAbly.

Also add a Secrets.example.swift template and .gitignore to keep API
keys out of source control, and a README describing the app.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set up PushKit to register for VoIP push tokens, publishing them to
the events channel as voipToken events. Add a minimal CallKit handler
(CXProvider/CXProviderDelegate) to satisfy the iOS requirement that
VoIP pushes must report an incoming call.

Add a shell script (send-voip-push.sh) that fetches the latest VoIP
token from the events channel via the Ably CLI and sends a push to
APNs.

Also configure the Xcode project with the required capabilities: push
notification entitlement, VoIP background mode, and usage description.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a Codable Event enum that unifies all events published to
the events channel, with documented associated-value structs for each
case. Custom Codable conformance avoids the _0 wrapper that Swift's
default enum synthesis produces.

Add an ARTRealtimeChannel extension in a separate file for publishing
Event values directly. Update all call sites and the send-voip-push
script to use the new event names and JSON structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add buttons to activate push and subscribe to a push channel, with
result display. Each action publishes attempt/result event pairs to
the events channel, linked by an attempt ID and tagged with a reason
(currently userTappedButton, extensible to automatic triggers).

Add CodableErrorInfo to capture the full ARTErrorInfo in event
payloads (code, statusCode, message, reason, href, requestId, cause).

Add an AppDelegate to forward APNs device tokens to ARTPush, as
required by the Ably push activation flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a settings section with toggles for auto-activating push and
auto-subscribing to the push channel on app launch. When both are
enabled, activation runs first and subscription follows on success.

Settings are stored in a JSON file with FileProtectionType.none so
they remain readable before first unlock — necessary because a VoIP
push can launch the app while the device is still locked.

Add appLaunch as an ActionReason to distinguish automatic actions
from user-initiated ones.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a setup step to enable message persistence on the events channel,
since the default 2-minute retention is too short for the send script
to find the VoIP token. Also increase the history limit in
send-voip-push.sh from 100 to 1000.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Generate a UUID on first launch and persist it to an unprotected file
(installation-id.txt). Include it in every event alongside
appLaunchID, so events can be correlated across launches of the same
installation. The ID does not survive reinstallation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Publish an appLaunched event before any other setup, capturing whether
protected data was available at launch time. This is key for
identifying launches that occur before first unlock (e.g. from a VoIP
push).

Also observe protectedDataDidBecomeAvailable and
protectedDataWillBecomeUnavailable notifications, publishing a
protectedDataAvailability event on each subsequent change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add CodableLocalDevice (and CodablePushDetails,
CodableIdentityTokenDetails) to capture the full state of
ARTLocalDevice. Include it in the pushActivateResult event so that
changes to device details (e.g. ID, secret) can be detected when the
SDK is unable to load persisted data — as happens when the app is
launched before first unlock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use appInstallation-<UUID> as the clientId so that multiple device
registrations from the same installation are easy to identify — this
is the failure mode under investigation where the device ID gets
unnecessarily recreated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Read the file protection attribute of the NSUserDefaults plist at
startup and include it in the appLaunched event. This tells us
whether the file that ARTLocalDeviceStorage uses to persist device
details is accessible before first unlock.

The plist path (Library/Preferences/<bundle-id>.plist) is an
implementation detail of NSUserDefaults. The file may not exist on a
fresh install before any defaults have been written.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Include a full dump of UserDefaults.standard in both the appLaunched
and pushActivateResult events, so that we can compare before and after
to see whether the SDK wrote new values during activation — even if
the file was previously unavailable (before first unlock).

Non-JSON-serialisable values (e.g. Data) are sanitised to string
representations. The dump is inlined as a dictionary in the Ably
message payload rather than a JSON string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Forthcoming work on this branch needs a test-only Ably API that is
only exposed through the `Ably.Private` submodule, so import it here.

Importing `Ably.Private` exposes additional designated initialisers on
`ARTLog` (notably `-initCapturingOutput:` and
`-initCapturingOutput:historyLines:`, declared in
`ARTLog+Private.h`), and that has a knock-on effect on the
`EventLoggingLogHandler: ARTLog` subclass defined in this file.
Without any action, the app now fatally errors at launch with "Use of
unimplemented initializer 'init(capturingOutput:)'".

Work around it by routing `super.init(...)` through the terminal 3-arg
initialiser rather than `-init`, which avoids the self-dispatch that
triggers the Swift-synthesised trap. The comment in the code flags
that the reasoning is Claude's speculation and should be verified if
it ever becomes load-bearing.
Use the private test-only options added in 571c79ae and 4015f8e6:

- Set `disableLocalDevice = true` on the event-logging client so it
  doesn't become the owner of the shared `ARTLocalDevice` and the
  first client to access `rest.device_nosync` (the main client)
  ends up bound to the shared device instead, keeping its
  device-storage activity attributed to its own logger.
- Set `logLocalDeviceStorageValues = true` on the main client so that
  storage read/write log lines include the persisted values rather
  than the `(retracted)` placeholder. This is what we're here to
  investigate.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7cab16f5-8385-49d1-b240-95cb19947c92

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch AIT-569-investigating-push-registration-failures

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lawrence-forooghian lawrence-forooghian changed the base branch from main to AIT-569-keychain-investigation-groundwork April 17, 2026 19:40
@github-actions github-actions bot temporarily deployed to staging/pull/2204/features April 17, 2026 19:40 Inactive
@github-actions github-actions bot temporarily deployed to staging/pull/2204/jazzydoc April 17, 2026 19:44 Inactive
@github-actions github-actions bot temporarily deployed to staging/pull/2204/markdown-api-reference April 17, 2026 19:44 Inactive
@lawrence-forooghian lawrence-forooghian changed the title [AIT-569] Investigation of push registration failures [WIP, AIT-569] Investigation of push registration failures Apr 17, 2026
@github-actions github-actions bot temporarily deployed to staging/pull/2204/features April 17, 2026 20:16 Inactive
Point to the LocalDeviceStorageBugTest app as the current test
approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant