Skip to content

Race condition: Reconciler ignores newly created Terraform CR due to informer cache lag #1721

@OS-joaocastilho

Description

@OS-joaocastilho

Summary

When a Terraform Custom Resource is created, there's a race condition where the controller's informer cache may not have synced the new resource yet. The reconciler fetches the resource, receives a "not found" error, and silently exits without requeuing. This causes the resource to fall out of the reconciliation loop entirely.

Environment

  • tofu-controller version: Latest (also affects tf-controller)
  • Kubernetes version: 1.28+
  • Flux version: 2.x

Expected Behavior

When the controller receives a watch event for a Terraform CR and cannot find it in the cache, it should:

  1. Recognize this may be a transient cache synchronization issue
  2. Requeue the reconciliation after a short delay (e.g., 5-10 seconds)
  3. Only ignore the resource after multiple failed attempts

Actual Behavior

The controller silently ignores the resource:

// From tf_controller.go - Reconcile function
terraform := &infrav1.Terraform{}
if err := r.Get(ctx, req.NamespacedName, terraform); err != nil {
    return ctrl.Result{}, client.IgnoreNotFound(err)  // Silently exits
}

The retryInterval setting does not help because the reconciliation doesn't "fail" - it returns successfully with an empty result.

Evidence

We observed this in production during a Helm deployment:

Helm logs (resource created):

16:38:57.956Z: Replaced "my-terraform-resource" with kind Terraform

Terraform Controller logs (3ms later):

16:38:57.959Z: error: "Terraform.infra.contrib.fluxcd.io 'my-terraform-resource' not found"

The 3ms gap between creation and "not found" error indicates the informer cache hadn't synced yet.

Impact

  • Affects first-time deployments to new namespaces/clusters
  • Resource never gets reconciled until the next periodic sync (based on interval)
  • If interval is long (e.g., 12h), the deployment times out waiting for Terraform to run
  • We observed ~3.4% failure rate on first-time deployments across multiple clusters

Suggested Fix

Modify the reconciler to requeue on "not found" when triggered by a watch event:

terraform := &infrav1.Terraform{}
if err := r.Get(ctx, req.NamespacedName, terraform); err != nil {
    if apierrors.IsNotFound(err) {
        // Could be informer cache lag - requeue with short delay
        log.Info("Terraform resource not found, may be cache lag, requeueing")
        return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
    }
    return ctrl.Result{}, err
}

Alternatively, track reconciliation attempts and only ignore after N consecutive "not found" errors.

Workarounds

Current workarounds (all have drawbacks):

  1. Reduce interval from hours to minutes (increases API server load)
  2. Add Helm pre-install hook with sleep delay (adds latency to all deployments)
  3. Accept occasional first-deployment failures (requires manual intervention)

Reproduction Steps

  1. Create a new Kubernetes namespace
  2. Immediately deploy a Terraform CR to that namespace via Helm
  3. Observe that the Terraform Controller may miss the resource due to cache lag
  4. The resource won't be reconciled until the next periodic interval

This is more likely to occur on:

  • First-time deployments to new clusters/namespaces
  • High-load clusters where informer sync may be slower
  • When the controller is processing many resources simultaneously

Related

  • Similar to eventual consistency issues in other Kubernetes controllers
  • The standard client.IgnoreNotFound() pattern assumes "not found" means "deleted", which isn't always true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions