Skip to content

Kubernetes Pod Operator: handle unknown pod phase#65202

Merged
jscheffl merged 2 commits intoapache:mainfrom
johnhoran:pod_phase_unknown
Apr 14, 2026
Merged

Kubernetes Pod Operator: handle unknown pod phase#65202
jscheffl merged 2 commits intoapache:mainfrom
johnhoran:pod_phase_unknown

Conversation

@johnhoran
Copy link
Copy Markdown
Contributor

We had a recent task failure which we were only alerted on once we hit the dagrun timeout. The logs looked like

2026-04-10, 09:02:25 UTC] {pod.py:1425} INFO - Building pod ...-fwg83fmr with labels: {'dag_id': '...', 'task_id': '...-cdf8135eb', 'run_id': 'scheduled__2026-04-09T0900000000-020f81fe0', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2026-04-10, 09:02:25 UTC] {pod.py:601} INFO - Found matching pod ...-fwg83fmr with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.11.2-astro.2', 'app': 'airflow', 'astronomer.io/cloud_provider': 'aws', 'astronomer.io/cloud_region': 'us-west-2', 'astronomer.io/deploymentId': '...', 'astronomer.io/organizationId': '...', 'astronomer.io/workspaceId': '...', 'dag_id': '...', 'kubernetes_pod_operator': 'True', 'run_id': 'scheduled__2026-04-09T0900000000-020f81fe0', 'task_id': '...-cdf8135eb', 'try_number': '1'}
[2026-04-10, 09:02:25 UTC] {pod.py:602} INFO - `try_number` of task_instance: 1
[2026-04-10, 09:02:25 UTC] {pod.py:603} INFO - `try_number` of pod: 1
[2026-04-10, 09:02:25 UTC] {pod.py:895} WARNING - Could not resolve connection extras for deferral: connection `kubernetes_default` not found. Triggerer will try to resolve it from its own environment.
[2026-04-10, 09:02:25 UTC] {taskinstance.py:297} INFO - Pausing task as DEFERRED. dag_id=..., task_id=..._opportunity_daily_stage_run, run_id=scheduled__2026-04-09T09:00:00+00:00, execution_date=20260409T090000, start_date=20260410T090223
[2026-04-10, 09:02:25 UTC] {taskinstance.py:349} ▶ Post task execution logs
[2026-04-10, 09:02:26 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 09:02:26 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 09:02:26 UTC] {kubernetes.py:1160} WARNING - Kubernetes API does not permit watching events; falling back to polling: (403)
Reason: Forbidden: events is forbidden: User "system:serviceaccount:...:...-triggerer-serviceaccount" cannot watch resource "events" in API group "" in the namespace "..."
[2026-04-10, 09:02:26 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 0/13 nodes are available: 1 node(s) had untolerated taint {karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint {astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption: not eligible due to preemptionPolicy=Never. from None
[2026-04-10, 09:02:31 UTC] {pod_manager.py:116} INFO - The Pod has an Event: Pod should schedule on: nodeclaim/airflow-worker-primary-9km4s from None
[2026-04-10, 09:02:36 UTC] {pod_manager.py:116} INFO - The Pod has an Event: 0/14 nodes are available: 1 node(s) had untolerated taint {ebs.csi.aws.com/agent-not-ready: }, 1 node(s) had untolerated taint {karpenter.sh/disrupted: }, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 3 node(s) had untolerated taint {astronomer.io/node-group: airflow-system}, 5 Insufficient memory. preemption: not eligible due to preemptionPolicy=Never. from None
[2026-04-10, 09:02:52 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:07:15 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 11:07:15 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 11:07:16 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:11:54 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 11:11:54 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 11:11:54 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 11:18:55 UTC] {pod.py:177} INFO - Checking pod '...-fwg83fmr' in namespace '...' with poll interval 2.
[2026-04-10, 11:18:56 UTC] {pod_manager.py:138} ▼ Waiting until 600s to get the POD scheduled...
[2026-04-10, 11:18:56 UTC] {pod_manager.py:150} ▲▲▲ Log group end
[2026-04-10, 12:00:03 UTC] {pod.py:448} INFO - Deleting pod ...-fwg83fmr in namespace ....
[2026-04-10, 12:00:03 UTC] {pod.py:456} ERROR - Unexpected error while deleting pod ...-fwg83fmr
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 558, in cleanup_finished_triggers
    result = details["task"].result()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 630, in run_trigger
    async for event in trigger.run():
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 206, in run
    event = await self._wait_for_container_completion()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 340, in _wait_for_container_completion
    await asyncio.sleep(self.poll_interval)
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 665, in sleep
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 450, in cleanup
    await self.hook.delete_pod(
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 189, in async_wrapped
    return await copy(fn, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 111, in __call__
    do = await self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 153, in iter
    result = await action(retry_state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/_utils.py", line 99, in inner
    return call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/tenacity/__init__.py", line 400, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.12/site-packages/tenacity/asyncio/__init__.py", line 114, in __call__
    result = await fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 1008, in delete_pod
    await v1_api.delete_namespaced_pod(
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 117, in call_api
    return await super().call_api(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py", line 239, in DELETE
    return (await self.request("DELETE", url,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py", line 206, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: <CIMultiDictProxy('Audit-Id': '96014b60-12ff-4e23-8ef2-15949b6bb0c4', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': '332d44d3-abc1-4edf-9669-08749324024e', 'X-Kubernetes-Pf-Prioritylevel-Uid': '04963fcf-132d-4951-a31a-17392195da29', 'Date': 'Fri, 10 Apr 2026 12:00:03 GMT', 'Content-Length': '499')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"...-fwg83fmr\" is forbidden: User \"system:serviceaccount:...:...-triggerer-serviceaccount\" cannot delete resource \"pods\" in API group \"\" in the namespace \"...\"","reason":"Forbidden","details":{"name":"...-fwg83fmr","kind":"pods"},"code":403}

Most notably are the lines {pod_manager.py:150} ▲▲▲ Log group end which indicate that the state of the pod was at least not pending when it reached this. Given the other phases the pod could have been in, I believe the most likely situation is that there was a node communication issue and that the pod phase was unknown. That this allows us to break out of the await_pod_start loop feels incorrect. I think it should remain in the loop and be allowed to hit the scheduled timeout, same as if it was stuck in pending.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Apr 14, 2026
@johnhoran
Copy link
Copy Markdown
Contributor Author

I think the 403 error from the logs is irrelevant to the provider, just a missing permission in astronomer.

Copy link
Copy Markdown
Contributor

@Nataneljpwd Nataneljpwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This still allowes the pod to recover if the node comes back which is great!

@jscheffl jscheffl merged commit 4183ae5 into apache:main Apr 14, 2026
111 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants