Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce irrelevant ERROR level logs from KubernertesPodOperator on pod runtime failure #36077

Closed
2 tasks done
ketozhang opened this issue Dec 6, 2023 · 16 comments · Fixed by #37944
Closed
2 tasks done
Assignees
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@ketozhang
Copy link
Contributor

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Using KPO that fails on runtime turning on log_events_on_failure, using a trivial example,

KubernetesPodOperator(
  task_id=...,
  image=...,
  cmds=["/bin/bash", "-c", "-x"],
  arguments=["eccho"],  # typo
  log_events_on_failure=True,
)

returns various lines of logs

[2023-12-05, 02:53:08 UTC] {before.py:40} INFO - Starting call to 'airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager.fetch_container_logs.<locals>.consume_logs', this is the 1st time calling it.
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - + eccho
[2023-12-05, 02:53:08 UTC] {pod_manager.py:381} INFO - bash: eccho: command not found
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Scheduled - Successfully assigned airflow/redacted_pod_name
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulling - Pulling image "redacted:latest"
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Pulled - Successfully pulled image "redacted:latest" in 142.533921ms (142.546182ms including waiting)
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Created - Created container base
[2023-12-05, 02:53:10 UTC] {pod.py:720} ERROR - Pod Event: Started - Started container base
[2023-12-05, 02:53:10 UTC] {pod.py:726} INFO - Deleting pod: redacted_pod_name
[2023-12-05, 02:53:10 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
remote_pod: {'api_version': 'v1',
 'kind': 'Pod',
 'metadata': {'annotations': None,
  ... # VERY LONG JSON
}
...

that are irrelevant for the reason of failure (i.e., the runtime container exited nonzero with $STDERR)

What you think should happen instead

  • Use reasonable log level for each pod event type & reason
    • Normal->logging.DEBUG
    • (Warning, Failed)->logging.ERROR
  • Print pod runtime stderr to logging.ERROR

How to reproduce

See above.

Operating System

Debian Bookworm

Versions of Apache Airflow Providers

apache-airflow==2.6.2
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-cncf-kubernetes==7.1.0
apache-airflow-providers-common-sql==1.5.1
apache-airflow-providers-ftp==3.4.1
apache-airflow-providers-http==4.4.1
apache-airflow-providers-imap==3.2.1
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-sqlite==3.4.1

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ketozhang ketozhang added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Dec 6, 2023
@ketozhang
Copy link
Contributor Author

Related closed issue with a vague scope #10861


The logging all events as ERROR still exists in main.

def _read_pod_events(self, pod, *, reraise=True):
"""Will fetch and emit events from pod."""
with _optionally_suppress(reraise=reraise):
for event in self.pod_manager.read_pod_events(pod).items:
self.log.error("Pod Event: %s - %s", event.reason, event.message)


Understandably, Kubernetes does not isolate stdout from stderr in its container log API, which is why KPO was implemented to container logs to INFO.

The options I can think of is make this function aware of the container status (beyond just running)

def consume_logs(*, since_time: DateTime | None = None) -> DateTime | None:
"""
Try to follow container logs until container completes.
For a long-running container, sometimes the log read may be interrupted
Such errors of this kind are suppressed.
Returns the last timestamp observed in logs.
"""
last_captured_timestamp = None
try:
logs = self.read_pod_logs(
pod=pod,
container_name=container_name,
timestamps=True,
since_seconds=(
math.ceil((pendulum.now() - since_time).total_seconds()) if since_time else None
),
follow=follow,
post_termination_timeout=post_termination_timeout,
)
message_to_log = None
message_timestamp = None
progress_callback_lines = []
try:
for raw_line in logs:
line = raw_line.decode("utf-8", errors="backslashreplace")
line_timestamp, message = self.parse_log_line(line)
if line_timestamp: # detect new log line
if message_to_log is None: # first line in the log
message_to_log = message
message_timestamp = line_timestamp
progress_callback_lines.append(line)
else: # previous log line is complete
if self._progress_callback:
for line in progress_callback_lines:
self._progress_callback(line)
self.log.info("[%s] %s", container_name, message_to_log)
last_captured_timestamp = message_timestamp
message_to_log = message
message_timestamp = line_timestamp
progress_callback_lines = [line]
else: # continuation of the previous log line
message_to_log = f"{message_to_log}\n{message}"
progress_callback_lines.append(line)
finally:
# log the last line and update the last_captured_timestamp
if self._progress_callback:
for line in progress_callback_lines:
self._progress_callback(line)
self.log.info("[%s] %s", container_name, message_to_log)
last_captured_timestamp = message_timestamp
except BaseHTTPError:
self.log.exception(
"Reading of logs interrupted for container %r; will retry.",
container_name,
)
return last_captured_timestamp or since_time
# note: `read_pod_logs` follows the logs, so we shouldn't necessarily *need* to
# loop as we do here. But in a long-running process we might temporarily lose connectivity.
# So the looping logic is there to let us resume following the logs.
last_log_time = since_time
while True:
last_log_time = consume_logs(since_time=last_log_time)
if not self.container_is_running(pod, container_name=container_name):
return PodLoggingStatus(running=False, last_log_time=last_log_time)
if not follow:
return PodLoggingStatus(running=True, last_log_time=last_log_time)
else:
self.log.warning(
"Pod %s log read interrupted but container %s still running",
pod.metadata.name,
container_name,
)
time.sleep(1)

or give the user an ability to write stderr (or anything) to /dev/termination-log for KPO (same can be done on @task.kubernetes with the task logger)

https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/

This seems last part seems out of the scope. I can make another ticket if interested.

@jscheffl jscheffl added provider:cncf-kubernetes Kubernetes provider related issues and removed area:core needs-triage label for new issues that we didn't triage yet labels Dec 20, 2023
@sudiptob2
Copy link
Contributor

Hi, I am an MLH fellowship intern at Airflow. I would like to work on this issue.

@potiuk
Copy link
Member

potiuk commented Feb 10, 2024

Assigned you :)

@sudiptob2
Copy link
Contributor

Hi @ketozhang

I'm currently trying to reproduce the problem. I've set up these test dags. As per the settings, I should receive pod events in the log for test_fail_dag. Despite setting log_events_on_failure=True, I haven't received any pod events. There may be a misconfiguration in my dag files.
I'm working with Airflow version 2.8.1, although this shouldn't be the reason why pod events aren't being logged.

Any thoughts on how to reproduce the issue?
Test run log: Here
Ref: https://github.com/sudiptob2/airflow-kpo-test

@ketozhang
Copy link
Contributor Author

A pod event failure happens on the Kubernetes side when a Pod fails. An example of a failure is when the pod requested resource is much higher than what's available in the cluster.

In your test, your pod succeeded to launch, successfully ran the Python code despite Python raising an Exception. This is not a Pod failure since it did what it's suppose to do (run the Python code).

You can try various realistic scenarios like requesting a large number of CPU and RAM.

Perhaps others here can help point you to any existing testing cases that demonstrate a Pod failure either with real or mocked scenarios.

@sudiptob2
Copy link
Contributor

Thanks 🚀 I was also thinking about that but wasn't sure.

@sudiptob2
Copy link
Contributor

I attempted to update DAGs in various ways and tried to generate a Pod failure event. However, I have not been able to generate a log for the Pod failure event.

DAGs I am using: dev/dags
Test run log for dev/dags/test_fail_dag.py: LOGFILE
Ref: https://github.com/sudiptob2/airflow-kpo-test

Any feedback from experienced user/maintainer would be helpful.

@ketozhang
Copy link
Contributor Author

ketozhang commented Feb 15, 2024

You seem to be getting a different error more so about Kubernetes API permission issue which isn't related to the Pod.

Let's try to use reduce your DAG file into minimal viable example:

dag = DAG('test_fail_dag', schedule_interval='@once', catchup=False)

task = KubernetesPodOperator(
    task_id='test_fail_task',
    dag=dag,
    cmds=["/bin/bash", "-c", "-x"],
    arguments=["eccho"],  # purposeful typo,
    container_resources={"requests": {"memory": "1000Gi", "cpu": "1000"}},
    log_events_on_failure=True,
)

Assuming your cluster isn't that large, this should give you in the logs something like Pod Event: FailedScheduling - ...

@sudiptob2
Copy link
Contributor

@ketozhang Thanks! will try this shortly

@ketozhang
Copy link
Contributor Author

If you wouldn't mind. Can you do one run with and another without container_resources set?

I forgot the original issue I outlined here was we are getting Pod Event failures when we are not supposed to. With container_resources, you should expected a Pod Event failure. Without container_resources, you should expect no Pod Event failure — I did see one which is why I raised this ticket.

@sudiptob2
Copy link
Contributor

Yes sure, Just to mention, I already did try this Dag which is similar to the one you suggested. But did not get Pod event.

Testing your suggested DAG now.

@sudiptob2
Copy link
Contributor

Hi @ketozhang
I ran 2 dags according to your suggestion. But could not get the Pod Event log.

  1. minimal_test_fail_dag_without_cr.py : Running without container_resource set, See Logfile
  2. minimal_test_fail.py: Run with container resources set, See Logfile.

@ketozhang
Copy link
Contributor Author

It looks like you're not getting any event failures because of some Kubernetes permission issue. In your logs, you're getting a 403 error:

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"events is forbidden: User "system:serviceaccount:default:default" cannot list resource "events" in API group "" in the namespace "default"","reason":"Forbidden","details":{"kind":"events"},"code":403}

I'm not familiar with the Helm chart you're using, airflow-k8s-maxcotec. I'd look into your chart's documentation or try a different helm chart.

@sudiptob2
Copy link
Contributor

Thanks for the hint ✅

@sudiptob2
Copy link
Contributor

@ketozhang Thank you so much. There were problem in RBAC, I updated it and now I can get Pod events

[2024-02-15, 23:26:26 UTC] {pod.py:980} ERROR - (403)
[2024-02-15, 23:26:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-15, 23:26:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-rp9u952i returned a failure.
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Scheduled - Successfully assigned default/minimal-test-fail-task-without-cr-kfnmnrcd to kind-worker
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulling - Pulling image "alpine:latest"
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Pulled - Successfully pulled image "alpine:latest" in 566ms (566ms including waiting)
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Created - Created container base
[2024-02-16, 00:45:26 UTC] {pod.py:732} ERROR - Pod Event: Started - Started container base
[2024-02-16, 00:45:26 UTC] {taskinstance.py:2698} ERROR - Task failed with exception
[2024-02-16, 00:45:26 UTC] {standard_task_runner.py:107} ERROR - Failed to execute job 2 for task minimal_test_fail_task_without_cr (Pod minimal-test-fail-task-without-cr-kfnmnrcd returned a failure.

@sudiptob2
Copy link
Contributor

Hi @ketozhang Implemented a solution here #37944, could you please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants