-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077
Comments
Related closed issue with a vague scope #10861 The logging all events as ERROR still exists in main. airflow/airflow/providers/cncf/kubernetes/operators/pod.py Lines 780 to 785 in ace97c0
Understandably, Kubernetes does not isolate stdout from stderr in its container log API, which is why KPO was implemented to container logs to INFO. The options I can think of is make this function aware of the container status (beyond just running) airflow/airflow/providers/cncf/kubernetes/utils/pod_manager.py Lines 398 to 473 in ace97c0
or give the user an ability to write stderr (or anything) to https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/ This seems last part seems out of the scope. I can make another ticket if interested. |
Hi, I am an MLH fellowship intern at Airflow. I would like to work on this issue. |
Assigned you :) |
Hi @ketozhang I'm currently trying to reproduce the problem. I've set up these test dags. As per the settings, I should receive pod events in the log for Any thoughts on how to reproduce the issue? |
A pod event failure happens on the Kubernetes side when a Pod fails. An example of a failure is when the pod requested resource is much higher than what's available in the cluster. In your test, your pod succeeded to launch, successfully ran the Python code despite Python raising an Exception. This is not a Pod failure since it did what it's suppose to do (run the Python code). You can try various realistic scenarios like requesting a large number of CPU and RAM. Perhaps others here can help point you to any existing testing cases that demonstrate a Pod failure either with real or mocked scenarios. |
Thanks 🚀 I was also thinking about that but wasn't sure. |
I attempted to update DAGs in various ways and tried to generate a Pod failure event. However, I have not been able to generate a log for the Pod failure event. DAGs I am using: dev/dags Any feedback from experienced user/maintainer would be helpful. |
You seem to be getting a different error more so about Kubernetes API permission issue which isn't related to the Pod. Let's try to use reduce your DAG file into minimal viable example: dag = DAG('test_fail_dag', schedule_interval='@once', catchup=False)
task = KubernetesPodOperator(
task_id='test_fail_task',
dag=dag,
cmds=["/bin/bash", "-c", "-x"],
arguments=["eccho"], # purposeful typo,
container_resources={"requests": {"memory": "1000Gi", "cpu": "1000"}},
log_events_on_failure=True,
) Assuming your cluster isn't that large, this should give you in the logs something like |
@ketozhang Thanks! will try this shortly |
If you wouldn't mind. Can you do one run with and another without I forgot the original issue I outlined here was we are getting Pod Event failures when we are not supposed to. With |
Yes sure, Just to mention, I already did try this Dag which is similar to the one you suggested. But did not get Pod event. Testing your suggested DAG now. |
Hi @ketozhang
|
It looks like you're not getting any event failures because of some Kubernetes permission issue. In your logs, you're getting a 403 error:
I'm not familiar with the Helm chart you're using, |
Thanks for the hint ✅ |
@ketozhang Thank you so much. There were problem in RBAC, I updated it and now I can get Pod events
|
Hi @ketozhang Implemented a solution here #37944, could you please take a look. |
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Using KPO that fails on runtime turning on
log_events_on_failure
, using a trivial example,returns various lines of logs
that are irrelevant for the reason of failure (i.e., the runtime container exited nonzero with $STDERR)
What you think should happen instead
logging.DEBUG
logging.ERROR
logging.ERROR
How to reproduce
See above.
Operating System
Debian Bookworm
Versions of Apache Airflow Providers
Deployment
Docker-Compose
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: