Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077

ketozhang · 2023-12-06T03:30:49Z

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Using KPO that fails on runtime turning on log_events_on_failure, using a trivial example,

KubernetesPodOperator(
  task_id=...,
  image=...,
  cmds=["/bin/bash", "-c", "-x"],
  arguments=["eccho"],  # typo
  log_events_on_failure=True,
)

returns various lines of logs

[2023-12-05, 02:53:08 UTC] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 1st time calling it.
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - + eccho
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - bash: eccho: command not found
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Scheduled - Successfully assigned airflow/redacted_pod_name
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulling - Pulling image "redacted:latest"
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulled - Successfully pulled image "redacted:latest" in 142.533921ms (142.546182ms including waiting)
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Created - Created container base
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Started - Started container base
[2023-12-05, 02:53:10 UTC] {pod.py:726} INFO - Deleting pod: redacted_pod_name
[2023-12-05, 02:53:10 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
remote_pod: {'api_version': 'v1',
 'kind': 'Pod',
 'metadata': {'annotations': None,
  ... # VERY LONG JSON
}
...

that are irrelevant for the reason of failure (i.e., the runtime container exited nonzero with $STDERR)

What you think should happen instead

Use reasonable log level for each pod event type & reason
- Normal->logging.DEBUG
- (Warning, Failed)->logging.ERROR
Print pod runtime stderr to logging.ERROR

How to reproduce

See above.

Operating System

Debian Bookworm

Versions of Apache Airflow Providers

apache-airflow==2.6.2
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.1
apache-airflow-providers-ftp==3.4.1
apache-airflow-providers-http==4.4.1
apache-airflow-providers-imap==3.2.1
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-sqlite==3.4.1

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ketozhang · 2023-12-06T03:54:21Z

Related closed issue with a vague scope #10861

The logging all events as ERROR still exists in main.

airflow/airflow/providers/cncf/kubernetes/operators/pod.py

Lines 780 to 785 in ace97c0

    
           def _read_pod_events(self, pod, *, reraise=True): 
        
               """Will fetch and emit events from pod.""" 
        
               with _optionally_suppress(reraise=reraise): 
        
                   for event in self.pod_manager.read_pod_events(pod).items: 
        
                       self.log.error("Pod Event: %s - %s", event.reason, event.message)

Understandably, Kubernetes does not isolate stdout from stderr in its container log API, which is why KPO was implemented to container logs to INFO.

The options I can think of is make this function aware of the container status (beyond just running)

airflow/airflow/providers/cncf/kubernetes/utils/pod_manager.py

Lines 398 to 473 in ace97c0

    
                   def consume_logs(*, since_time: DateTime | None = None) -> DateTime | None: 
        
                       """ 
        
                       Try to follow container logs until container completes. 
        
                       For a long-running container, sometimes the log read may be interrupted 
        
                       Such errors of this kind are suppressed. 
        
                       Returns the last timestamp observed in logs. 
        
                       """ 
        
                       last_captured_timestamp = None 
        
                       try: 
        
                           logs = self.read_pod_logs( 
        
                               pod=pod, 
        
                               container_name=container_name, 
        
                               timestamps=True, 
        
                               since_seconds=( 
        
                                   math.ceil((pendulum.now() - since_time).total_seconds()) if since_time else None 
        
                               ), 
        
                               follow=follow, 
        
                               post_termination_timeout=post_termination_timeout, 
        
                           ) 
        
                           message_to_log = None 
        
                           message_timestamp = None 
        
                           progress_callback_lines = [] 
        
                           try: 
        
                               for raw_line in logs: 
        
                                   line = raw_line.decode("utf-8", errors="backslashreplace") 
        
                                   line_timestamp, message = self.parse_log_line(line) 
        
                                   if line_timestamp:  # detect new log line 
        
                                       if message_to_log is None:  # first line in the log 
        
                                           message_to_log = message 
        
                                           message_timestamp = line_timestamp 
        
                                           progress_callback_lines.append(line) 
        
                                       else:  # previous log line is complete 
        
                                           if self._progress_callback: 
        
                                               for line in progress_callback_lines: 
        
                                                   self._progress_callback(line) 
        
                                           self.log.info("[%s] %s", container_name, message_to_log) 
        
                                           last_captured_timestamp = message_timestamp 
        
                                           message_to_log = message 
        
                                           message_timestamp = line_timestamp 
        
                                           progress_callback_lines = [line] 
        
                                   else:  # continuation of the previous log line 
        
                                       message_to_log = f"{message_to_log}\n{message}" 
        
                                       progress_callback_lines.append(line) 
        
                           finally: 
        
                               # log the last line and update the last_captured_timestamp 
        
                               if self._progress_callback: 
        
                                   for line in progress_callback_lines: 
        
                                       self._progress_callback(line) 
        
                               self.log.info("[%s] %s", container_name, message_to_log) 
        
                               last_captured_timestamp = message_timestamp 
        
                       except BaseHTTPError: 
        
                           self.log.exception( 
        
                               "Reading of logs interrupted for container %r; will retry.", 
        
                               container_name, 
        
                           ) 
        
                       return last_captured_timestamp or since_time 
        
                   # note: `read_pod_logs` follows the logs, so we shouldn't necessarily *need* to 
        
                   # loop as we do here. But in a long-running process we might temporarily lose connectivity. 
        
                   # So the looping logic is there to let us resume following the logs. 
        
                   last_log_time = since_time 
        
                   while True: 
        
                       last_log_time = consume_logs(since_time=last_log_time) 
        
                       if not self.container_is_running(pod, container_name=container_name): 
        
                           return PodLoggingStatus(running=False, last_log_time=last_log_time) 
        
                       if not follow: 
        
                           return PodLoggingStatus(running=True, last_log_time=last_log_time) 
        
                       else: 
        
                           self.log.warning( 
        
                               "Pod %s log read interrupted but container %s still running", 
        
                               pod.metadata.name, 
        
                               container_name, 
        
                           ) 
        
                           time.sleep(1)

or give the user an ability to write stderr (or anything) to /dev/termination-log for KPO (same can be done on @task.kubernetes with the task logger)

https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/

This seems last part seems out of the scope. I can make another ticket if interested.

sudiptob2 · 2024-02-10T15:35:17Z

Hi, I am an MLH fellowship intern at Airflow. I would like to work on this issue.

potiuk · 2024-02-10T17:14:03Z

Assigned you :)

sudiptob2 · 2024-02-15T05:12:05Z

Hi @ketozhang

I'm currently trying to reproduce the problem. I've set up these test dags. As per the settings, I should receive pod events in the log for test_fail_dag. Despite setting log_events_on_failure=True, I haven't received any pod events. There may be a misconfiguration in my dag files.
I'm working with Airflow version 2.8.1, although this shouldn't be the reason why pod events aren't being logged.

Any thoughts on how to reproduce the issue?
Test run log: Here
Ref: https://github.com/sudiptob2/airflow-kpo-test

ketozhang · 2024-02-15T05:18:05Z

A pod event failure happens on the Kubernetes side when a Pod fails. An example of a failure is when the pod requested resource is much higher than what's available in the cluster.

In your test, your pod succeeded to launch, successfully ran the Python code despite Python raising an Exception. This is not a Pod failure since it did what it's suppose to do (run the Python code).

You can try various realistic scenarios like requesting a large number of CPU and RAM.

Perhaps others here can help point you to any existing testing cases that demonstrate a Pod failure either with real or mocked scenarios.

sudiptob2 · 2024-02-15T05:26:24Z

Thanks 🚀 I was also thinking about that but wasn't sure.

sudiptob2 · 2024-02-15T17:53:03Z

I attempted to update DAGs in various ways and tried to generate a Pod failure event. However, I have not been able to generate a log for the Pod failure event.

DAGs I am using: dev/dags
Test run log for dev/dags/test_fail_dag.py: LOGFILE
Ref: https://github.com/sudiptob2/airflow-kpo-test

Any feedback from experienced user/maintainer would be helpful.

ketozhang · 2024-02-15T22:19:51Z

You seem to be getting a different error more so about Kubernetes API permission issue which isn't related to the Pod.

Let's try to use reduce your DAG file into minimal viable example:

dag = DAG('test_fail_dag', schedule_interval='@once', catchup=False)

task = KubernetesPodOperator(
    task_id='test_fail_task',
    dag=dag,
    cmds=["/bin/bash", "-c", "-x"],
    arguments=["eccho"],  # purposeful typo,
    container_resources={"requests": {"memory": "1000Gi", "cpu": "1000"}},
    log_events_on_failure=True,
)

Assuming your cluster isn't that large, this should give you in the logs something like Pod Event: FailedScheduling - ...

sudiptob2 · 2024-02-15T22:21:26Z

@ketozhang Thanks! will try this shortly

ketozhang · 2024-02-15T22:25:16Z

If you wouldn't mind. Can you do one run with and another without container_resources set?

I forgot the original issue I outlined here was we are getting Pod Event failures when we are not supposed to. With container_resources, you should expected a Pod Event failure. Without container_resources, you should expect no Pod Event failure — I did see one which is why I raised this ticket.

sudiptob2 · 2024-02-15T22:28:33Z

Yes sure, Just to mention, I already did try this Dag which is similar to the one you suggested. But did not get Pod event.

Testing your suggested DAG now.

sudiptob2 · 2024-02-15T23:34:55Z

Hi @ketozhang
I ran 2 dags according to your suggestion. But could not get the Pod Event log.

minimal_test_fail_dag_without_cr.py : Running without container_resource set, See Logfile
minimal_test_fail.py: Run with container resources set, See Logfile.

ketozhang · 2024-02-16T00:20:14Z

It looks like you're not getting any event failures because of some Kubernetes permission issue. In your logs, you're getting a 403 error:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"events is forbidden: User "system:serviceaccount:default:default" cannot list resource "events" in API group "" in the namespace "default"","reason":"Forbidden","details":{"kind":"events"},"code":403}

I'm not familiar with the Helm chart you're using, airflow-k8s-maxcotec. I'd look into your chart's documentation or try a different helm chart.

sudiptob2 · 2024-02-16T00:21:55Z

Thanks for the hint ✅

sudiptob2 · 2024-02-16T00:46:31Z

@ketozhang Thank you so much. There were problem in RBAC, I updated it and now I can get Pod events

[2024-02-15, 23:26:26 UTC] {pod.py:980} ERROR - (403)
[2024-02-15, 23:26:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-15, 23:26:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-rp9u952i returned a failure.
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Scheduled - Successfully assigned default/minimal-test-fail-task-without-cr-kfnmnrcd to kind-worker
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulling - Pulling image "alpine:latest"
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulled - Successfully pulled image "alpine:latest" in 566ms (566ms including waiting)
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Created - Created container base
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Started - Started container base
[2024-02-16, 00:45:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-16, 00:45:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-kfnmnrcd returned a failure.

sudiptob2 · 2024-03-06T19:20:34Z

Hi @ketozhang Implemented a solution here #37944, could you please take a look.

ketozhang added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Dec 6, 2023

jscheffl added provider:cncf-kubernetes Kubernetes provider related issues and removed area:core needs-triage label for new issues that we didn't triage yet labels Dec 20, 2023

potiuk assigned sudiptob2 Feb 10, 2024

sudiptob2 mentioned this issue Mar 6, 2024

fix: reduce irrelevant error logs for pod events. #37944

Merged

potiuk closed this as completed in #37944 Mar 7, 2024

eladkal mentioned this issue Apr 10, 2024

Status of testing Providers that were prepared on April 10, 2024 #38904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077

Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077

ketozhang commented Dec 6, 2023

ketozhang commented Dec 6, 2023

sudiptob2 commented Feb 10, 2024

potiuk commented Feb 10, 2024

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 15, 2024 •

edited

Loading

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 16, 2024

sudiptob2 commented Feb 16, 2024

sudiptob2 commented Feb 16, 2024

sudiptob2 commented Mar 6, 2024

Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077

Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077

Comments

ketozhang commented Dec 6, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

ketozhang commented Dec 6, 2023

sudiptob2 commented Feb 10, 2024

potiuk commented Feb 10, 2024

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 15, 2024 • edited Loading

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

sudiptob2 commented Feb 15, 2024

ketozhang commented Feb 16, 2024

sudiptob2 commented Feb 16, 2024

sudiptob2 commented Feb 16, 2024

sudiptob2 commented Mar 6, 2024

ketozhang commented Feb 15, 2024 •

edited

Loading