-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running task logs is incomplete #91
Comments
I could not reproduce the error. I may be able to check this more later on but it may take a couple of weeks. Looking at the log, I think there are two possible scenarios,
This leads me to think that you may have some time limit in the underlining airflow, or maybe there is a missing command in the operator to update the task in the case of long runs... not sure. Also, I find the 4 hours mark very strange - if there was an error/disconnect and not a timeout I would expect random execution times, rather then a stable time span. Is there a max connection time there? Was there other runs that lasted less/more time? To check that the timeout is not coming from airflow:
To check that the timeout is a connection/kubernetes time:
Also can you specify what kind of cloud structure and which cloud system you are running on. I know that there is some issue with the AWS Kubernetes cloud specified in this thread: #54 |
According to your suggestion, I ran three tasks, bashoperator, kubernetespodoperator, and bashoperator. Based on the log results, only the kubernetesjoboperator task experienced a four-hour log interruption. Here are my dag.py and the corresponding pod and job YAML files
dag file of kubernetestpodoperator
pod.yaml
dag file of kubernetesjoboperator
job.yaml
I don't know if there are any other suggestions, look forward to your reply |
Hi. Let me set up the system and give that a go. Seems to be coming from the operator. May take a few weeks since I'm dealing with some things at work. Can you share all of the cloud/os/airflow details? |
Cloud:BCC(Baidu Cloud Compute)
Airflow:2.5.0 |
I was able to reproduce the issue, not sure of cause yet. Using: k0s (local server) from datetime import timedelta
from utils import default_args, name_from_file
from airflow import DAG
from airflow_kubernetes_job_operator.kubernetes_job_operator import (
KubernetesJobOperator,
)
dag = DAG(
name_from_file(__file__),
default_args=default_args,
description="Test base job operator",
schedule_interval=None,
catchup=False,
)
envs = {
"PASS_ARG": "a test",
}
total_time_seconds = round(timedelta(hours=4.5).total_seconds())
KubernetesJobOperator(
task_id="test-long-job-success",
body_filepath="./templates/test_long_job.yaml",
envs={
"PASS_ARG": "a long test",
"TIC_COUNT": str(total_time_seconds),
},
dag=dag,
)
if __name__ == "__main__":
dag.test() apiVersion: batch/v1
kind: Job
metadata: {}
spec:
template:
spec:
restartPolicy: Never
containers:
- name: job-executor
image: ubuntu
command:
- bash
- -c
- |
#/usr/bin/env bash
: "${SLEEP_INTERVAL:=10}"
echo "Starting $PASS_ARG (Sleep interval $SLEEP_INTERVAL)"
local elapsed_time=0
while true; do
sleep $SLEEP_INTERVAL
elapsed_time=$((elapsed_time + $SLEEP_INTERVAL))
echo "Elapsed $elapsed_time [seconds]"
if [ "$elapsed_time" -ge "$TIC_COUNT" ]; then
break
fi
done
echo "Complete"
env:
- name: TIC_COUNT
value: '10800' # 3 hrs
- name: SLEEP_INTERVAL
value: '60'
backoffLimit: 0
|
I have made no progress on this issue yet. Thank you for the information, and I hope to find a solution soon. Thanks again. |
Found and fixed? the issue @ #93 and merge to master. The issue was that once 4 hours have passed, Kubernetes closes the connection on its side, but returns no errors. Since its follow, the query should restart.
This would allow the logs query to essentially last forever. This dose add some call overhead whilst Kubernetes is deleting the resources, but that should be not a lot (maybe 10 more calls at the end of the sequence). Please, if you can, do test by installing master branch as the package and let me know. |
My tests were successful, 4.5 hrs run without issue. Once you confirm on your end I'll make another release. |
Thank you very much for your support. I will quickly use the master branch to verify the issue, and I will respond to you with the exact results as soon as possible. |
Hi, any update? I want to create a new release |
I apologize for the delayed response as I was validating multiple long tasks. I have now verified in the environment, and the issue has been resolved. Thank you very much for your support. |
Describe the bug
Airflow: 2.5.0
kubernetesjoboperator: 2.0.12
description:
The task running log output is incomplete. When the task runs for more than 4 hours, the Airflow task will stop updating and display error information "Task is not able to be run", but the container in the Kubernetes still outputs logs normally.
The text was updated successfully, but these errors were encountered: