Race condition: Reconciler ignores newly created Terraform CR due to informer cache lag

## Summary

When a Terraform Custom Resource is created, there's a race condition where the controller's informer cache may not have synced the new resource yet. The reconciler fetches the resource, receives a "not found" error, and silently exits without requeuing. This causes the resource to fall out of the reconciliation loop entirely.

## Environment

- **tofu-controller version:** Latest (also affects tf-controller)
- **Kubernetes version:** 1.28+
- **Flux version:** 2.x

## Expected Behavior

When the controller receives a watch event for a Terraform CR and cannot find it in the cache, it should:
1. Recognize this may be a transient cache synchronization issue
2. Requeue the reconciliation after a short delay (e.g., 5-10 seconds)
3. Only ignore the resource after multiple failed attempts

## Actual Behavior

The controller silently ignores the resource:

```go
// From tf_controller.go - Reconcile function
terraform := &infrav1.Terraform{}
if err := r.Get(ctx, req.NamespacedName, terraform); err != nil {
    return ctrl.Result{}, client.IgnoreNotFound(err)  // Silently exits
}
```

The `retryInterval` setting does not help because the reconciliation doesn't "fail" - it returns successfully with an empty result.

## Evidence

We observed this in production during a Helm deployment:

**Helm logs (resource created):**
```
16:38:57.956Z: Replaced "my-terraform-resource" with kind Terraform
```

**Terraform Controller logs (3ms later):**
```
16:38:57.959Z: error: "Terraform.infra.contrib.fluxcd.io 'my-terraform-resource' not found"
```

The 3ms gap between creation and "not found" error indicates the informer cache hadn't synced yet.

## Impact

- Affects first-time deployments to new namespaces/clusters
- Resource never gets reconciled until the next periodic sync (based on `interval`)
- If `interval` is long (e.g., 12h), the deployment times out waiting for Terraform to run
- We observed ~3.4% failure rate on first-time deployments across multiple clusters

## Suggested Fix

Modify the reconciler to requeue on "not found" when triggered by a watch event:

```go
terraform := &infrav1.Terraform{}
if err := r.Get(ctx, req.NamespacedName, terraform); err != nil {
    if apierrors.IsNotFound(err) {
        // Could be informer cache lag - requeue with short delay
        log.Info("Terraform resource not found, may be cache lag, requeueing")
        return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
    }
    return ctrl.Result{}, err
}
```

Alternatively, track reconciliation attempts and only ignore after N consecutive "not found" errors.

## Workarounds

Current workarounds (all have drawbacks):
1. Reduce `interval` from hours to minutes (increases API server load)
2. Add Helm pre-install hook with sleep delay (adds latency to all deployments)
3. Accept occasional first-deployment failures (requires manual intervention)

## Reproduction Steps

1. Create a new Kubernetes namespace
2. Immediately deploy a Terraform CR to that namespace via Helm
3. Observe that the Terraform Controller may miss the resource due to cache lag
4. The resource won't be reconciled until the next periodic interval

This is more likely to occur on:
- First-time deployments to new clusters/namespaces
- High-load clusters where informer sync may be slower
- When the controller is processing many resources simultaneously

## Related

- Similar to eventual consistency issues in other Kubernetes controllers
- The standard `client.IgnoreNotFound()` pattern assumes "not found" means "deleted", which isn't always true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition: Reconciler ignores newly created Terraform CR due to informer cache lag #1721

Summary

Environment

Expected Behavior

Actual Behavior

Evidence

Impact

Suggested Fix

Workarounds

Reproduction Steps

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition: Reconciler ignores newly created Terraform CR due to informer cache lag #1721

Description

Summary

Environment

Expected Behavior

Actual Behavior

Evidence

Impact

Suggested Fix

Workarounds

Reproduction Steps

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions