-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong timeout value for ExternalTaskSensor running in deferrable mode #43948
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
I can only contribute at weekend, so feel free to pick this up if anyone feels like it. |
@kien-truong I would like to work on this. |
Assigned you |
I looked into this a bit but it seems like there's a fundamental issue here, I'll try to explain below. The expected behavior would be to have a sensor that can run with retries, in case something fails during the sensor check, e.g. infra issues. The retries are not about the sensor not finding what it was supposed to, e.g. "the task is not there", but to recover from infra failures, e.g. the database being temporarily unavailable. This behavior works as expected with sensors in general. However, when combining retries on sensors with timeouts, that's where things start getting interesting:
It seems like the user would want the same behavior between deferred and non-deferred versions of the sensor for the timeouts with retries, but I couldn't find a way to solve it without adding a new table to airflow. is the original first start time information saved somewhere? |
Yeah, even in The document said,
However, this is only correct if the sensor doesn't fail and retry. |
I think the document is correct with regards to what it says: it says "the I agree with you that they should behave the same way. I think it'd be a relatively simple fix, but there needs to be a state that stores the attempts. |
Actually, testing this out, it does not work indeed across attempts, instead the timeout is only enforced within the same attempt across different reschedules. I think @kien-truong is right. |
There is currently a behavioural discrepancy between regular and deferred sensors that has been outstanding for many months which, it seems, will finally be addressed in 2.11.x. |
I think it is correct that sensors that have exhausted their timeout should not retry; instead the timeout should be longer to cover a wide enough window of waiting. |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
Stills relevant, to remove stale label |
Actually, unless I'm missing some other issue, I believe this is a duplication of a previously logged issue and looks to be addressed in the forthcoming Airflow 2.11. |
It's partially overlap with #33718, which is making Deferrable sensor to not retry when However, even with that fix in place, the sensor still needs to call defer with the correct |
@dstandish Are you able to comment on this given you worked on #33718? |
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version
2.10.3
If "Other Airflow 2 version" selected, which one?
No response
What happened?
The WorkflowTrigger used by ExternalTaskSensor should have a time limit set from
timeout
attribute instead ofexecution_timeout
airflow/airflow/sensors/external_task.py
Line 349 in bb234dc
What you think should happen instead?
No response
How to reproduce
execution_timeout
instead oftimeout
Operating System
Linux
Versions of Apache Airflow Providers
No response
Deployment
Google Cloud Composer
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: