Fix #5989, #6053, #6059, #6060, #6097, #6200: Rebuild app & WorkManager initialization#6062
Fix #5989, #6053, #6059, #6060, #6097, #6200: Rebuild app & WorkManager initialization#6062BenHenning wants to merge 31 commits intodevelopfrom
WorkManager initialization#6062Conversation
This includes substantial changes to how the app starts up in all cases, but especially for workers. The new approach is robust against worker changes (including class renames or input changes), centralizes intialization so that all workers always run in a sane environment, and ensures all app initialization is done in a completely robust and safe way. Also, re-write much of the work manager wiki page to include new design details and significant debugging recommendations. Testing, code, and wiki documentation are not yet complete. This also has only been tested on the developer build of the app, so more validation will be necessary.
|
Thanks for submitting this pull request! Some main reviewers have taken time off for the next few weeks, so it may take a little while before we can look at this PR. We appreciate your patience while some of our team members recharge. We'll be fully returning on 05 January 2026. |
|
Note that this is probably in a good enough place for me to work off of it to make progress on #3506 (which was my original goal heading into this), and I'll need to circle back to this PR. It still needs a lot of documentation and testing work, including figuring out hopefully a better testing strategy than using fakes (I'm hoping it's possible to actually properly orchestrate kicking off the jobs but I have a lot more digging to do in order to figure that out fully). |
|
Hi @BenHenning, I'm going to mark this PR as stale because it hasn't had any updates for 7 days. If no further activity occurs within 7 days, it will be automatically closed so that others can take up the issue. |
|
Hi @BenHenning, I'm going to mark this PR as stale because it hasn't had any updates for 7 days. If no further activity occurs within 7 days, it will be automatically closed so that others can take up the issue. |
Some previous code was missed when converting the jobs, and there were some API simplifications possible. Also, this ensures that platform parameter syncing intervals can't be tweaked to the point of essentially bricking app syncing.
This is needed for testing, though this will likely not be a permanent upgrade since it requires Kotlin 1.8. For proof-of-concept purposes this also includes ignoring the metadata check so that tests can run with 1.8 dependencies in the 1.6 environment.
These may not land in this PR but they were discovered as gaps when trying to make work manager synchronization work in tests.
This includes some visibility changes as well as alterations to TestPlatformParameterModule to ensure parameters can actually be initialized correctly with the work manager pathway. This commit essentially establishes the baseline for work manager job testing and will be used as the foundation for adding tests for all of the other work manager jobs.
This introduces new utilities to simplify testing with WorkManager and one of which (the mixin) actually provides tests with the ability to test multiple rounds of jobs with just time management by leveraging a custom scheduler (or, rather, a simple self-scheduler to simulate WorkManager).
Also refined some of the shared utilities including the fake performance logger.
Also a bunch more shared utility refinement, plus some integration tests for MetricLogSchedulingWorker in how it relates to LogUploadWorker.
Not much to add yet since downloading platform parameters isn't fully supported yet.
Still many more to go, but this is a good start.
Conflicts: domain/src/main/java/org/oppia/android/domain/oppialogger/analytics/ApplicationLifecycleObserver.kt domain/src/main/java/org/oppia/android/domain/oppialogger/analytics/LearnerAnalyticsLogger.kt
This makes the new test synching mechanism work correctly for UI tests (which use a different sync-first pathway).
This fixes the dispatcher logic to actually behave correctly. This will be moved outside of the PR so that it can be more iterated.
@Multibinds should also be exempted.
These are all managed through the central test utilities, now.
This isolates test WorkManager interactions to just a single utility now (instead of being across multiple tests and multiple utilities).
The new version is hopefully much more streamlined and resilient to user error.
BenHenning
left a comment
There was a problem hiding this comment.
Self-reviewed fixes.
PR should be up-to-date with develop again and seems to be in a pretty good shape. Just need to finish main PR description and it should be ready for full review (assuming CI passes now which I am hoping it will).
Coverage ReportResultsNumber of files assessed: 189 Passing coverageFiles with passing code coverage
Exempted coverageFiles exempted from coverage
|
|
Hi @BenHenning This refactor is really impressive — I can see how introducing the Bootstrap worker and centralizing scheduling resolves the failure modes I was exploring earlier around WorkManager’s KEEP policy and permanent sync stalls. It’s much clearer now how the new architecture avoids those dead states. I especially found the separation of scheduling concerns and the abstraction over WorkManager interactions quite elegant — it makes the system easier to reason about and extend. I had one question while going through the changes: with the new scheduling flow, are intermittent failures (like transient network issues) primarily handled within individual workers (via retry/backoff), or is there also a mechanism at the scheduler/Bootstrap level that ensures eventual recovery? Happy to dig deeper into related areas or help validate edge cases if that would be useful. Really enjoyed going through this — thanks for putting together such a thorough redesign! |
|
@akkicodes-dev the current implementation expects workers (that is, |
|
Got it, that clarifies things — thanks! I was thinking in terms of additional retry/backoff handling, but this makes sense given the periodic nature of the workers and existing flow. Appreciate the explanation! |
|
PTAL @adhiamboperes. This is now ready for full review (the PR description is finally finished!). |
|
Hi @BenHenning, I'm going to mark this PR as stale because it hasn't had any updates for 7 days. If no further activity occurs within 7 days, it will be automatically closed so that others can take up the issue. |
adhiamboperes
left a comment
There was a problem hiding this comment.
Thanks @BenHenning! PR LGTM mostly. PTAL at my comments.
| @Test | ||
| fun testFakeExceptionLogger_getMostRecentExceptions_two_noExceptionsLogged_returnsEmptyList() { | ||
| val mostRecentExceptions = fakeExceptionLogger.getMostRecentExceptions(count = 2) | ||
|
|
||
| assertThat(mostRecentExceptions).isEmpty() | ||
| } |
There was a problem hiding this comment.
It is not clear what this test is checking for.
| @Test | ||
| fun testFakeExceptionLogger_getMostRecentExceptions_two_twoExceptionsLogged_returnsInOrder() { | ||
| exceptionLogger.logException(exception2) | ||
| exceptionLogger.logException(exception1) | ||
|
|
||
| val mostRecentExceptions = fakeExceptionLogger.getMostRecentExceptions(count = 2) | ||
|
|
||
| val messages = mostRecentExceptions.map { it.message } | ||
| assertThat(messages).containsExactly("Second Exception", "First Exception").inOrder() | ||
| } | ||
|
|
||
| @Test | ||
| fun testFakeExceptionLogger_getMostRecentExceptions_one_twoExceptionsLogged_returnsLatest() { | ||
| exceptionLogger.logException(exception2) | ||
| exceptionLogger.logException(exception1) | ||
|
|
||
| val mostRecentExceptions = fakeExceptionLogger.getMostRecentExceptions(count = 1) | ||
|
|
||
| assertThat(mostRecentExceptions).hasSize(1) | ||
| assertThat(mostRecentExceptions.single()).hasMessageThat().isEqualTo("First Exception") | ||
| } |
There was a problem hiding this comment.
Nit on naming -- could these and adjacent tests be better as:
testFakeExceptionLogger_getMostRecentException_twoExceptionsLogged_returnsLatest
and
testFakeExceptionLogger_getTwoMostRecentExceptions_twoExceptionsLogged_returnsInOrder
| /** | ||
| * A general-purpose, all-in-one test utility when interacting with [WorkManager] in Oppia tests. | ||
| * | ||
| * This utility contains a bunch of critical setup and interaction pathways that tests needing to |
There was a problem hiding this comment.
This sentence seems a bit broken?
WorkManager initializationWorkManager initialization
Explanation
Fixes #5989
Fixes #6053
Fixes #6059
Fixes #6060
Fixes #6097
Fixes #6200
This PR fixes the app's
WorkManagerconfiguration by essentially rebuilding how it works on a fundamental level.Existing behaviors
Previously, the app attempted to provide a custom
WorkManagerConfiguration(viaAbstractOppiaApplication) which, perWorkManagerConfigurationModule, added a bunch of customWorkerFactorys per a providedDelegatingWorkerFactory. However, there are two problems with this setup (#6053 covers both of these):Configurationthe app must remove the custom initialization provider (https://developer.android.com/develop/background-work/background-tasks/persistent/configuration/custom-configuration). Technically this is being done in the app today, however it's removing all app initialization which can affect more than justWorkManager.exportedtofalsefor theprovider. This is technically redundant sinceWorkManageralready declares it as not exported, but the Android Linter is only reviewing the localAndroidManifest.xmland not the fully merged manifest so it's a reasonable way to silence the warning (and be a bit more explicit about safety).WorkerFactorys are not adhering toDelegatingWorkerFactory's contract. The delegating factory expects subsequent factories to returnnullif they do not match the request worker being created, but Oppia's workers always return non-null. This means that the first worker always wins which is in this case theMetricLogSchedulingWorker(hence why we didn't notice this easily before because the majority of analytics events are successfully being uploaded, and most performance metrics are captured via direct app callbacks). This also wasn't caught in testing because of tests directly validating the jobs and not their scheduled and periodic behaviors. [BUG]: PlatformParemeterSyncUpWorkerFactory and MetricLogSchedulingWorkerFactory are not being initialised correctly #5989 is tracking this issue.Another issue with the current setup is that platform parameters cannot be safely initialized when the app starts up via a worker which has led to a top crash: #6060. That issue thread plus the documentation in
AbstractOppiaApplicationgo into a lot more detail about the initialization order changes.New design
BootstrapOppiaWorkerThe new design involves introducing a single
WorkManagerworker:BootstrapOppiaWorker. This introduces a very elegant improvement over the previous solution since it allows us to solve several classes of common issues every worker will face:WorkManager's own internal workers (which run through the same factory).WorkManager; these now clear themselves). This behavior works retroactively for past Oppia workers, as well, once this code is deployed to user's devices.Much of these weren't actually handled in the existing design. Using a common worker allows all of that to be solved in one place, and additional benefits could also be realized:
Configurationimplementation since the worker itself never changes.WorkManager(which actually, combined with some testing requirements, led to a complete isolation ofWorkManagerthroughout the codebase).OppiaWorkerinterface with Dagger-managed bindings. In the case of subtasks, this is mainly establishing an architectural pattern around an emergent behavior that was already the case for basically all of the current workers. Each subtask operates as a distinct top-level worker from the eyes ofWorkManager.Note that
BootstrapOppiaWorkerandWorkManagerConfigurationModuleare both being very thoroughly tested to ensure everything built on them can be comfortably relied upon and to cover the specific core cases that led to needing this PR in the first place (to help guard against regressions, however unlikely with these two classes rarely needing to change).New initialization logic
Beyond that, the recreation of initialization led to some other nice improvements:
ApplicationStartupListenerwas updated to have two different callbacks depending on whether platform parameters are initialized. This actually provides significant more stability in listeners that rely on platform parameters.WorkManagertrying to run a worker that relies on platform parameters andAbstractOppiaApplicationattempting to asynchronously load platform parameters. This two-step initialization process fully eliminates that race condition specifically for the path of a worker immediately running upon scheduling it (see next point). The other part of [BUG]: java.lang.IllegalStateException - Attempting to access feature flag LOGGING_LEARNER_STUDY_IDS before initialization #6060 was fixed by forcingBootstrapOppiaWorkerto synchronously load platform parameters before proceeding with creating the actual underlyingOppiaWorker.AnalyticsStartupListenerwas replaced withStartupWorkerScheduleReadinessListenersince that's effectively what it was: an application-bound listener that wanted to run on startup to schedule periodic jobs to run. Now that's done completely in an initialization order-safe way and with a custom scheduler (described later in this description).New scheduler
A new
WorkManagerSchedulerhas been introduced to simplify interacting withWorkManagerwhen wanting to schedule jobs by reusing common and obvious constraints across jobs and also ensuring that jobs cannot be scheduled in a way where they'd effectively never run (to address the potential of #6097 being an issue). Jobs are enforced to run no more than every 15 minutes (which is also a hardWorkManagerrule) and no less often than every 14 days.This scheduler works in tandem with the new
StartupWorkerScheduleReadinessListenerto provide an easy-to-use method for defining custom schedulers for a job.StartupWorkerScheduleReadinessMonitoris a new utility (bound itself as anApplicationStartupListener) which injects all of theStartupWorkerScheduleReadinessListeners and calls each at an initialization order safe time to schedule jobs (shortly after platform parameters are fully initialized). A few things to note:StartupWorkerScheduleReadinessListenerimplementations to use platform parameters directly (which is actually necessary since they may use platform parameters for configurations like job periodicity intervals as is the case forMetricLogSchedulingWorkerScheduler.ApplicationStartupListenerhelps to keepAbstractOppiaApplicationas lean as possible. One thing found during the development of this PR was that reasoning aboutAbstractOppiaApplication's initialization order and nuances was really challenging. Moving responsibilities out of theonCreate()should hopefully help make it much easier to understand and perform the necessary isolation for correctness.One limitation to note: the current version of
WorkManagermeans thatWorkManagerSchedulermust useExistingPeriodicWorkPolicy.KEEPwhen scheduling jobs whereasExistingPeriodicWorkPolicy.UPDATEwould actually be better since the latter properly supports property changes being applied without needing to cancel and restart the job. Practically this isn't expected to cause any problems for a while but it could potentially in the future (such as if we tweak the periodicity of performance metric captures or platform parameter syncing). #6115 is capturing the upgrade work needed to address this and there's a TODO pointing to that issue in code to ensure the update happens once it's possible.Finally, actual scheduling logic relies on
WorkManager. Even though we attempt to schedule jobs upon startup they may not actually run unless it's time for them to run again (i.e. their constraints are met and they haven't run within their configured interval). In most cases, requesting to schedule a job is a no-op since it's will already be tracked byWorkManager.OppiaWorkerOppiaWorkeris described a bit above but it's meant to be an extremely straightforward way to represent scheduled workers without having to worry about the extra complexity ofWorkManagerand edge integration-specific quirks. The idea is that all a developer needs to do is:OppiaWorkerimplementation with whatever subtasks are desired.StartupWorkerScheduleReadinessListenerto schedule the worker's subtasks upon application startup.And that's basically it. An
OppiaWorkerimplementation and itsStartupWorkerScheduleReadinessListenerimplementation can both directly depend on platform parameters and feature flags as needed and perform whatever background logic that needs to be done for the worker. The hope is this significantly simplifies implementing such workers in the future.Developer reference improvements
To help illustrate how to set up a new
OppiaWorkerand test it, a developer-onlyDebugWorkerhas been introduced along with its scheduler and module classes. This is a fully functioning worker that actually runs in//:oppia_devbuilds of the app and can be used as both a coding and maintenance reference (see wiki bits below).A bunch of updates were made to the
WorkManagerdocumentation covering not only the new implementation and testing patterns but also how to actually forcibly run workers usingadb. Unfortunately these instructions basically only work whenadbcan be rooted which is limited for personal devices (but should be fine for emulators). Hopefully these instructions significantly help when debugging workers in the future.Testing infrastructure changes
Testing workers is surprisingly complex which is likely why existing tests weren't actually validating the periodicity part of workers, and were introducing a lot of "backdoors" to directly access configuration values rather than reading what was passed to
WorkManager. A lot of investigation and iteration went into the creation of a newOppiaWorkManagerTestDriverutility which solves all of these (and more) problems:WorkManager.WorkManager's own internal testing library for managing constraints (thoughWorkManageritself doesn't do a great job here so the utility is limited).WorkManager's own behavior in a way that's interoperable withTestCoroutineDispatchers. I suggest looking at the code to understand what this does more.USE_TIME_BASED_SCHEDULINGcan be used which allows direct binding to theFakeOppiaClockand seems to actually "just work" with Oppia's test coroutine management (with some other tweaks).FakeLogScheduler,FakeLogUploader, andFirebaseLogUploader(andFirebaseLogUploaderModule) since workers should actually just be tested directly, instead.It's worth noting that the configuration test setup for the driver is rather complex and won't be fully described here since the documentation and code do a better job (though one thing to note is that using the driver will change the time management mode of the test away from the standard Oppia Android default). This and the other overall complexity of properly interacting with
WorkManager, its driver, orTestCoroutineDispatchersis ultimately why this PR also mandates using this driver overWorkManageror its test utility directly. Trying to make jobs work correctly in a test environment is a complete minefield, so it makes sense to just eliminate whole classes of developer frustration and confusion by requiring using a utility that trivializes the majority of these interactions.Some other notes:
WorkManager's internal database) that might break in future versions, but it's the best solution for what it's being used for currently.CoroutineExecutorServiceis needed as part of helping to bridge the gap and ensureTestCoroutineDispatcherinteroperability. This will continue to be the case even after [Feature Request]: Upgrade work manager to 2.9.0 #6115 is addressed, and will likely continue until Introduce managed executors to back dispatchers instead of using managed dispatchers #3715 is addressed.Changes to worker implementations & simplifications
The event log upload worker, performance metrics capture worker, and platform parameter worker all obviously require substantial reworking (along with their corresponding peer classes) to fit the new architectural patterns. Note that some attempt was also made to ensure naming consistency when it was easy (which is why
LogUploadWorkerremains inconsistent for now--updating it to be consistent would be a rather large codebase-wide change on top of an already very large PR for minimal gain). Fortunately the new architectural pattern makes renaming these classes trivial in the future.Note that this change does migrate over the platform parameter work which means it actually should run now. However, it doesn't actually perform syncing logic anymore and that will be addressed in #6061 (which actually prompted the investigation that led to this PR).
Miscellaneous non-test changes
There are a bunch of other changes that were made in passing:
ApplicationLifecycleListenerhas a new KDoc update that comes from the fact thatProcessLifecycleOwnerhas built-in delays for foreground/background changes to avoid invalid "went to background"/"wen to foreground" subsequent events just for navigating activities normally (sinceonStop/onStartis called between navigation moments). This was found to cause some initialization order complexities in tests so the documentation is meant to help make this clear for other developers in the future.ApplicationLifecycleObserverhad a bunch of changes:ApplicationLifecycleLoggerwhich is responsible for the actual metric logging so that the observer can focus on activity-level state tracking.Provider.AbstractOppiaApplication.onCreatewas further simplified by abstracting awayAppCheckProviderFactoryto the class's constructor. This allows it to default to the default production-friendly factory but also allowDeveloperOppiaApplicationto useDebugAppCheckProviderFactoryas needed. This led to an overall cleanup of no longer needing the build flavor as part of the root application component anymore.AndroidManifest.xmlinstructions was added to clarify how to actually enable faster event logs. This isn't directly related to the PR but it was noticed in passing. The previous support article was changed over the last few years and it's much harder to find the actualadbcommand needed.CpuPerformanceSnapshotterwas made injectable to simplify the Dagger graph a bit and which modules are actually needed for certain test scenarios. It's not obvious why it wasn't done this way originally from looking at code history.OppiaWorkManagerTestDriver).ExceptionsControllerwas noticed in passing that it wasn't a singleton, and this has been fixed. It needs to be to avoid potential race conditions against itsPersistentCacheStoreif/when more than one is created. This almost certainly has caused unseen problems.UncaughtExceptionLoggerStartupListenerwas updated in a few ways:Miscellaneous test & test infrastructural changes
There are other miscellaneous test and test infrastructure changes outside the main work manager bits:
WorkManagerinitialization was removed in tests that didn't actually use workers (e.g.ProfileAndDeviceIdActivityTest).ExplorationActivityLocalTestdid have an event reordering update but I don't remember precisely why this was correct.ApplicationLifecycleObserverTestwas updated to use an actual real activity to validate lifecycle integration (since before it wasn't actually testing that the utility was wired correctly). There's some messy reflection to ensureProcessLifecycleOwnerproperly resets across tests because it uses static state internally and the current version being used doesn't allow it to be properly reset. Addressing this is being tracked in [Feature Request]: Migrate to a cleaner method for resettingProcessLifecycleOwnerin tests #6187.CpuPerformanceSnapshotterTesthas been updated so that the snapshotter itself is being tested rather than testing happening viaApplicationLifecycleObserverfor more correct separations of concern.TestPlatformParameterModulehas become even more complex due to a realization during debugging thatTestCoroutineDispatchers.runCurrent()cannot be safely called off the main thread (see [BUG]: Fix dispatcher synchronization issues #6116), yet it still being possible via certain worker initialization pathways for platform parameters to now be needed off the main thread. This has been addressed by specifically delegating back to the main thread and correctly blocking on it.Loopercheck done is actually an SDK 21-compatible variant toisCurrentThreadwhich is, unfortunately, SDK 23+. This technically doesn't matter for Robolectric tests but Android Lint is unhappy without this approach.FakeAnalyticsEventLogger,FakeExceptionLogger,FakeFirestoreEventLogger, andFakePerformanceMetricsEventLoggerhave been updaed to have similar APIs (particularly for event count and exceptions) since this was needed for a bunch of tests. The loggers' tests themselves have also been updated to reflect these changes.Script, pattern, an exemption updates
Other non-app changes:
KdocValidityCheckwas updated to also ignoreMultibinds(which it should've done before, this was just missed since it's a rarely used annotation). A new test was added to validate this behavior change.WorkManagerlibrary interaction as mentioned in previous sections.StartupWorkerScheduleReadinessMonitorfor simplicity since it's a large PR and a relatively simple utility. [Feature Request]: Add tests forStartupWorkerScheduleReadinessMonitor#6189 is tracking adding tests for this in a future PR.ApplicationLifecycleLoggerbecauseApplicationLifecycleObserver's tests need to be audited to determine which should move into a dedicated logger test suite, and what additional logger-specific tests should be added. This is being tracked in [Feature Request]: Add tests forApplicationLifecycleLogger#6190.OppiaWorkManagerTestDriverbecause it's highly complex to test and not worth it at this point because it will actually be simplified with the upgrade to 2.9.0 per [Feature Request]: Upgrade work manager to 2.9.0 #6115. Nevertheless, adding tests for this driver is being tracked by [Feature Request]: Add tests forOppiaWorkManagerTestDriver#6191.MetricLogSchedulingWorkerScheduler). This revealed that theTestFileCheckutility doesn't correctly fail when an exemption is added but a test actually exists. [BUG]: Test file checker doesn't catch existing tests #6188 is tracking fixing this.DebugWorkerDebugModulewas exempted for coverage compatibility since it actually caused a failure during the finalization of the PR. I didn't dig into the specifics as to why it has an incompatibility, but this seems to be a reasonable workaround for now.Disclosure of LLM usage
I did use Gemini pretty heavily for diagnosing the early issues with the work manager integration and how to effectively debug them. This actually all began with #6061 when I was trying to figure out why I couldn't get the jobs to run, and pulling on that thread led (with LLM help) uncovered all of the broader problems that ended up being addressed here. Gemini did help a bit with ideating on an approach but the vast majority of the bootstrapping pattern and all of its corresponding code came from me. I used Gemini CLI a bit for some of the testing for
UncaughtExceptionLoggerStartupListenerbut that was it.Essential Checklist
For UI-specific PRs only
Largely this PR should have zero impact on direct UIs in the app, though it will indirectly affect the overall UX since it will enable platform parameters to actually work correctly (though not entirely, #6061 still needs to be finished). No screens are being updated as part of these changes.