edge-api served a partial environment document (missing feature state) for a newly-created environment on SaaS. The document was corrected by running the "rebuild environment document" Django admin action. Nothing surfaced in Sentry; we only caught it via two tickets from the same customer landing in Pylon.
This is a known race, documented inline at api/core/signals.py:21-30: the audit log post_save signal enqueues the Dynamo write, but the write can be scheduled before the related FeatureState rows are visible. Today's mitigation is a 1-second delay_until on the audit-log task when TASK_RUN_METHOD is TASK_PROCESSOR — a timer, not a guarantee.
We don't want to fix this by wrapping environment/feature creation in a transaction: these tables are on the hot path for flag evaluation and we are conservative about introducing locks there. The fix should live inside the audit-log-driven sync path.
Partial writes are also effectively invisible today: the Dynamo write path (environment_wrapper.write_environments / _write_environments at api/environments/dynamodb/wrappers/environment_wrapper.py:65-105) has no Prometheus counter and emits no structured event on success. A committed-too-early snapshot produces a stale doc on edge-api with no signal we can alert on.
Proposed scope:
Short-term (correctness) — extend is_creating to cover initial seeding
Environment.is_creating already exists and is set by Environment.clone() (api/environments/models.py:242). Extend it to also cover standard creation: set True in a BEFORE_CREATE hook, clear it after create_initial_feature_states_for_environment has returned (api/environments/models.py:168-170, api/features/models.py:857-861). The clear should happen via an .update(is_creating=False) on the row to avoid re-triggering hooks.
- Add an equivalent flag on
Feature. Set it before create_initial_feature_states_for_feature seeds per-environment FeatureStates (api/features/models.py:863-866), clear it when seeding returns.
- In
process_environment_update (api/environments/tasks.py:32-38), before calling write_environment_documents, check whether any environment in scope is is_creating=True, or whether any feature in the project is is_creating=True. If so, re-enqueue the task with exponential backoff (e.g. 1s → 2s → 4s) up to a bounded attempt count; log a warning and give up past that. Otherwise proceed to write.
- Once the flag-based guard is in place, remove the 1-second
delay_until shim at api/core/signals.py:32-36. The guard handles the race directly; the timer becomes dead code.
Defensive (observability)
- Add a Prometheus counter (e.g.
flagsmith_environment_document_writes_total with a result label of success/failure) so we can alert on non-zero failure rates rather than finding out via support.
- Emit a structlog event (
environment_document.written) with environment__id, feature_states__count, document__bytes so writes can be correlated against what edge-api actually serves.
edge-api served a partial environment document (missing feature state) for a newly-created environment on SaaS. The document was corrected by running the "rebuild environment document" Django admin action. Nothing surfaced in Sentry; we only caught it via two tickets from the same customer landing in Pylon.
This is a known race, documented inline at api/core/signals.py:21-30: the audit log post_save signal enqueues the Dynamo write, but the write can be scheduled before the related FeatureState rows are visible. Today's mitigation is a 1-second
delay_untilon the audit-log task when TASK_RUN_METHOD is TASK_PROCESSOR — a timer, not a guarantee.We don't want to fix this by wrapping environment/feature creation in a transaction: these tables are on the hot path for flag evaluation and we are conservative about introducing locks there. The fix should live inside the audit-log-driven sync path.
Partial writes are also effectively invisible today: the Dynamo write path (
environment_wrapper.write_environments/_write_environmentsat api/environments/dynamodb/wrappers/environment_wrapper.py:65-105) has no Prometheus counter and emits no structured event on success. A committed-too-early snapshot produces a stale doc on edge-api with no signal we can alert on.Proposed scope:
Short-term (correctness) — extend
is_creatingto cover initial seedingEnvironment.is_creatingalready exists and is set byEnvironment.clone()(api/environments/models.py:242). Extend it to also cover standard creation: setTruein aBEFORE_CREATEhook, clear it aftercreate_initial_feature_states_for_environmenthas returned (api/environments/models.py:168-170, api/features/models.py:857-861). The clear should happen via an.update(is_creating=False)on the row to avoid re-triggering hooks.Feature. Set it beforecreate_initial_feature_states_for_featureseeds per-environment FeatureStates (api/features/models.py:863-866), clear it when seeding returns.process_environment_update(api/environments/tasks.py:32-38), before callingwrite_environment_documents, check whether any environment in scope isis_creating=True, or whether any feature in the project isis_creating=True. If so, re-enqueue the task with exponential backoff (e.g. 1s → 2s → 4s) up to a bounded attempt count; log a warning and give up past that. Otherwise proceed to write.delay_untilshim at api/core/signals.py:32-36. The guard handles the race directly; the timer becomes dead code.Defensive (observability)
flagsmith_environment_document_writes_totalwith aresultlabel of success/failure) so we can alert on non-zero failure rates rather than finding out via support.environment_document.written) withenvironment__id,feature_states__count,document__bytesso writes can be correlated against what edge-api actually serves.