Skip to content

Environment document can sync to DynamoDB before feature states commit #7281

@khvn26

Description

@khvn26

edge-api served a partial environment document (missing feature state) for a newly-created environment on SaaS. The document was corrected by running the "rebuild environment document" Django admin action. Nothing surfaced in Sentry; we only caught it via two tickets from the same customer landing in Pylon.

This is a known race, documented inline at api/core/signals.py:21-30: the audit log post_save signal enqueues the Dynamo write, but the write can be scheduled before the related FeatureState rows are visible. Today's mitigation is a 1-second delay_until on the audit-log task when TASK_RUN_METHOD is TASK_PROCESSOR — a timer, not a guarantee.

We don't want to fix this by wrapping environment/feature creation in a transaction: these tables are on the hot path for flag evaluation and we are conservative about introducing locks there. The fix should live inside the audit-log-driven sync path.

Partial writes are also effectively invisible today: the Dynamo write path (environment_wrapper.write_environments / _write_environments at api/environments/dynamodb/wrappers/environment_wrapper.py:65-105) has no Prometheus counter and emits no structured event on success. A committed-too-early snapshot produces a stale doc on edge-api with no signal we can alert on.

Proposed scope:

Short-term (correctness) — extend is_creating to cover initial seeding

  • Environment.is_creating already exists and is set by Environment.clone() (api/environments/models.py:242). Extend it to also cover standard creation: set True in a BEFORE_CREATE hook, clear it after create_initial_feature_states_for_environment has returned (api/environments/models.py:168-170, api/features/models.py:857-861). The clear should happen via an .update(is_creating=False) on the row to avoid re-triggering hooks.
  • Add an equivalent flag on Feature. Set it before create_initial_feature_states_for_feature seeds per-environment FeatureStates (api/features/models.py:863-866), clear it when seeding returns.
  • In process_environment_update (api/environments/tasks.py:32-38), before calling write_environment_documents, check whether any environment in scope is is_creating=True, or whether any feature in the project is is_creating=True. If so, re-enqueue the task with exponential backoff (e.g. 1s → 2s → 4s) up to a bounded attempt count; log a warning and give up past that. Otherwise proceed to write.
  • Once the flag-based guard is in place, remove the 1-second delay_until shim at api/core/signals.py:32-36. The guard handles the race directly; the timer becomes dead code.

Defensive (observability)

  • Add a Prometheus counter (e.g. flagsmith_environment_document_writes_total with a result label of success/failure) so we can alert on non-zero failure rates rather than finding out via support.
  • Emit a structlog event (environment_document.written) with environment__id, feature_states__count, document__bytes so writes can be correlated against what edge-api actually serves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiIssue related to the REST API

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions