Kubernetes Pod Operator: handle unknown pod phase by johnhoran · Pull Request #65202 · apache/airflow

johnhoran · 2026-04-14T09:54:47Z

We had a recent task failure which we were only alerted on once we hit the dagrun timeout. The logs looked like

2026-04-10, 09:02:25 UTC] {pod.py:1425} INFO - Building pod ...-fwg83fmr with labels: {'dag_id': '...', 'task_id': '...-cdf8135eb', 'run_id': 'scheduled__2026-04-09T0900000000-020f81fe0', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2026-04-10, 09:02:25 UTC] {pod.py:601} INFO - Found matching pod ...-fwg83fmr with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.11.2-astro.2', 'app': 'airflow', 'astronomer.io/cloud_provider': 'aws', 'astronomer.io/cloud_region': 'us-west-2', 'astronomer.io/deploymentId': '...', 'astronomer.io/organizationId': '...', 'astronomer.io/workspaceId': '...', 'dag_id': '...', 'kubernetes_pod_operator': 'True', 'run_id': 'scheduled__2026-04-09T0900000000-020f81fe0', 'task_id': '...-cdf8135eb', 'try_number': '1'}
[2026-04-10, 09:02:25 UTC] {pod.py:602} INFO - `try_number` of task_instance: 1
[2026-04-10, 09:02:25 UTC] {pod.py:603} INFO - `try_number` of pod: 1
[2026-04-10, 09:02:25 UTC] {pod.py:895} WARNING - Could not resolve connection extras for deferral: connection `kubernetes_default` not found. Triggerer will try to resolve it from its own environment.
[2026-04-10, 09:02:25 UTC] {taskinstance.py:297} INFO - Pausing task as DEFERRED. dag_id=..., task_id=..._opportunity_daily_stage_run, run_id=scheduled__2026-04-09T09:00:00+00:00, execution_date=20260409T090000, start_date=20260410T090223
[2026-04-10, 09:02:25 UTC] {taskinstance.py:349} ▶ Post task execution logs
[2026-04-10, 09:02:26 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 09:02:26 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 09:02:26 UTC] {kubernetes.py:1160} WARNING - Kubernetes API does not permit watching events; falling back to polling: (403)
Reason: Forbidden: events is forbidden: User "system:serviceaccount:...:...-triggerer-serviceaccount" cannot watch resource "events" in API group "" in the namespace "..."
[2026-04-10, 09:02:26 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 0/13 nodes are available: 1 node(s) had untolerated taint {karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint {astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption: not eligible due to preemptionPolicy=Never. from None
[2026-04-10, 09:02:31 UTC] {pod_manager.py:116} INFO - The Pod has an Event: Pod should schedule on: nodeclaim/airflow-worker-primary-9km4s from None
[2026-04-10, 09:02:36 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 0/14 nodes are available: 1 node(s) had untolerated taint {ebs.csi.aws.com/agent-not-ready: }, 1 node(s) had untolerated taint {karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint {astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption: not eligible due to preemptionPolicy=Never. from None
[2026-04-10, 09:02:52 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:07:15 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 11:07:15 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 11:07:16 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:11:54 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 11:11:54 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 11:11:54 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:18:55 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 11:18:56 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 11:18:56 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 12:00:03 UTC] {pod.py:448} INFO - Deleting pod ...-fwg83fmr in namespace ....
[2026-04-10, 12:00:03 UTC] {pod.py:456} ERROR - Unexpected error while deleting pod ...-fwg83fmr
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 558, in cleanup_finished_triggers
    result = details["task"].result()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 630, in run_trigger
    async for event in trigger.run():
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 206, in run
    event = await self._wait_for_container_completion()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 340, in _wait_for_container_completion
    await asyncio.sleep(self.poll_interval)
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 665, in sleep
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 450, in cleanup
    await self.hook.delete_pod(
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    return await copy(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    do = await self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 400, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 1008, in delete_pod
    await v1_api.delete_namespaced_pod(
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 117, in call_api
    return await super().call_api(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py", line 239, in DELETE
    return (await self.request("DELETE", url,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py", line 206, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Audit-Id': '96014b60-12ff-4e23-8ef2-15949b6bb0c4', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '332d44d3-abc1-4edf-9669-08749324024e', 'X-Kubernetes-Pf-Prioritylevel-Uid': '04963fcf-132d-4951-a31a-17392195da29', 'Date': 'Fri, 10 Apr 2026 12:00:03 GMT', 'Content-Length': '499')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"...-fwg83fmr\" is forbidden: User \"system:serviceaccount:...:...-triggerer-serviceaccount\" cannot delete resource \"pods\" in API group \"\" in the namespace \"...\"","reason":"Forbidden","details":{"name":"...-fwg83fmr","kind":"pods"},"code":403}

Most notably are the lines {pod_manager.py:150} ▲▲▲ Log group end which indicate that the state of the pod was at least not pending when it reached this. Given the other phases the pod could have been in, I believe the most likely situation is that there was a node communication issue and that the pod phase was unknown. That this allows us to break out of the await_pod_start loop feels incorrect. I think it should remain in the loop and be allowed to hit the scheduled timeout, same as if it was stuck in pending.

johnhoran · 2026-04-14T10:17:11Z

I think the 403 error from the logs is irrelevant to the provider, just a missing permission in astronomer.

Nataneljpwd

Looks good! This still allowes the pod to recover if the node comes back which is great!

handle unknown state

144fc9f

johnhoran requested review from hussein-awala, jedcunningham and jscheffl as code owners April 14, 2026 09:54

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 14, 2026

Merge branch 'main' into pod_phase_unknown

fc94738

Nataneljpwd approved these changes Apr 14, 2026

View reviewed changes

jscheffl approved these changes Apr 14, 2026

View reviewed changes

jscheffl merged commit 4183ae5 into apache:main Apr 14, 2026
111 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Pod Operator: handle unknown pod phase#65202

Kubernetes Pod Operator: handle unknown pod phase#65202
jscheffl merged 2 commits intoapache:mainfrom
johnhoran:pod_phase_unknown

johnhoran commented Apr 14, 2026

Uh oh!

johnhoran commented Apr 14, 2026

Uh oh!

Nataneljpwd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

johnhoran commented Apr 14, 2026

Uh oh!

johnhoran commented Apr 14, 2026

Uh oh!

Nataneljpwd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants