Environment document can sync to DynamoDB before feature states commit

edge-api served a partial environment document (missing feature state) for a newly-created environment on SaaS. The document was corrected by running the "rebuild environment document" Django admin action. Nothing surfaced in Sentry; we only caught it via two tickets from the same customer landing in Pylon.

This is a known race, documented inline at api/core/signals.py:21-30: the audit log post_save signal enqueues the Dynamo write, but the write can be scheduled before the related FeatureState rows are visible. Today's mitigation is a 1-second `delay_until` on the audit-log task when TASK_RUN_METHOD is TASK_PROCESSOR — a timer, not a guarantee.

We don't want to fix this by wrapping environment/feature creation in a transaction: these tables are on the hot path for flag evaluation and we are conservative about introducing locks there. The fix should live inside the audit-log-driven sync path.

Partial writes are also effectively invisible today: the Dynamo write path (`environment_wrapper.write_environments` / `_write_environments` at api/environments/dynamodb/wrappers/environment_wrapper.py:65-105) has no Prometheus counter and emits no structured event on success. A committed-too-early snapshot produces a stale doc on edge-api with no signal we can alert on.

Proposed scope:

Short-term (correctness) — extend `is_creating` to cover initial seeding

- `Environment.is_creating` already exists and is set by `Environment.clone()` (api/environments/models.py:242). Extend it to also cover standard creation: set `True` in a `BEFORE_CREATE` hook, clear it after `create_initial_feature_states_for_environment` has returned (api/environments/models.py:168-170, api/features/models.py:857-861). The clear should happen via an `.update(is_creating=False)` on the row to avoid re-triggering hooks.
- Add an equivalent flag on `Feature`. Set it before `create_initial_feature_states_for_feature` seeds per-environment FeatureStates (api/features/models.py:863-866), clear it when seeding returns.
- In `process_environment_update` (api/environments/tasks.py:32-38), before calling `write_environment_documents`, check whether any environment in scope is `is_creating=True`, or whether any feature in the project is `is_creating=True`. If so, re-enqueue the task with exponential backoff (e.g. 1s → 2s → 4s) up to a bounded attempt count; log a warning and give up past that. Otherwise proceed to write.
- Once the flag-based guard is in place, remove the 1-second `delay_until` shim at api/core/signals.py:32-36. The guard handles the race directly; the timer becomes dead code.

Defensive (observability)

- Add a Prometheus counter (e.g. `flagsmith_environment_document_writes_total` with a `result` label of success/failure) so we can alert on non-zero failure rates rather than finding out via support.
- Emit a structlog event (`environment_document.written`) with `environment__id`, `feature_states__count`, `document__bytes` so writes can be correlated against what edge-api actually serves.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment document can sync to DynamoDB before feature states commit #7281

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Environment document can sync to DynamoDB before feature states commit #7281

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions