Summary
When a Terraform Custom Resource is created, there's a race condition where the controller's informer cache may not have synced the new resource yet. The reconciler fetches the resource, receives a "not found" error, and silently exits without requeuing. This causes the resource to fall out of the reconciliation loop entirely.
Environment
- tofu-controller version: Latest (also affects tf-controller)
- Kubernetes version: 1.28+
- Flux version: 2.x
Expected Behavior
When the controller receives a watch event for a Terraform CR and cannot find it in the cache, it should:
- Recognize this may be a transient cache synchronization issue
- Requeue the reconciliation after a short delay (e.g., 5-10 seconds)
- Only ignore the resource after multiple failed attempts
Actual Behavior
The controller silently ignores the resource:
// From tf_controller.go - Reconcile function
terraform := &infrav1.Terraform{}
if err := r.Get(ctx, req.NamespacedName, terraform); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err) // Silently exits
}
The retryInterval setting does not help because the reconciliation doesn't "fail" - it returns successfully with an empty result.
Evidence
We observed this in production during a Helm deployment:
Helm logs (resource created):
16:38:57.956Z: Replaced "my-terraform-resource" with kind Terraform
Terraform Controller logs (3ms later):
16:38:57.959Z: error: "Terraform.infra.contrib.fluxcd.io 'my-terraform-resource' not found"
The 3ms gap between creation and "not found" error indicates the informer cache hadn't synced yet.
Impact
- Affects first-time deployments to new namespaces/clusters
- Resource never gets reconciled until the next periodic sync (based on
interval)
- If
interval is long (e.g., 12h), the deployment times out waiting for Terraform to run
- We observed ~3.4% failure rate on first-time deployments across multiple clusters
Suggested Fix
Modify the reconciler to requeue on "not found" when triggered by a watch event:
terraform := &infrav1.Terraform{}
if err := r.Get(ctx, req.NamespacedName, terraform); err != nil {
if apierrors.IsNotFound(err) {
// Could be informer cache lag - requeue with short delay
log.Info("Terraform resource not found, may be cache lag, requeueing")
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
return ctrl.Result{}, err
}
Alternatively, track reconciliation attempts and only ignore after N consecutive "not found" errors.
Workarounds
Current workarounds (all have drawbacks):
- Reduce
interval from hours to minutes (increases API server load)
- Add Helm pre-install hook with sleep delay (adds latency to all deployments)
- Accept occasional first-deployment failures (requires manual intervention)
Reproduction Steps
- Create a new Kubernetes namespace
- Immediately deploy a Terraform CR to that namespace via Helm
- Observe that the Terraform Controller may miss the resource due to cache lag
- The resource won't be reconciled until the next periodic interval
This is more likely to occur on:
- First-time deployments to new clusters/namespaces
- High-load clusters where informer sync may be slower
- When the controller is processing many resources simultaneously
Related
- Similar to eventual consistency issues in other Kubernetes controllers
- The standard
client.IgnoreNotFound() pattern assumes "not found" means "deleted", which isn't always true
Summary
When a Terraform Custom Resource is created, there's a race condition where the controller's informer cache may not have synced the new resource yet. The reconciler fetches the resource, receives a "not found" error, and silently exits without requeuing. This causes the resource to fall out of the reconciliation loop entirely.
Environment
Expected Behavior
When the controller receives a watch event for a Terraform CR and cannot find it in the cache, it should:
Actual Behavior
The controller silently ignores the resource:
The
retryIntervalsetting does not help because the reconciliation doesn't "fail" - it returns successfully with an empty result.Evidence
We observed this in production during a Helm deployment:
Helm logs (resource created):
Terraform Controller logs (3ms later):
The 3ms gap between creation and "not found" error indicates the informer cache hadn't synced yet.
Impact
interval)intervalis long (e.g., 12h), the deployment times out waiting for Terraform to runSuggested Fix
Modify the reconciler to requeue on "not found" when triggered by a watch event:
Alternatively, track reconciliation attempts and only ignore after N consecutive "not found" errors.
Workarounds
Current workarounds (all have drawbacks):
intervalfrom hours to minutes (increases API server load)Reproduction Steps
This is more likely to occur on:
Related
client.IgnoreNotFound()pattern assumes "not found" means "deleted", which isn't always true