Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong timeout value for ExternalTaskSensor running in deferrable mode #43948

Closed
2 tasks done
kien-truong opened this issue Nov 13, 2024 · 17 comments
Closed
2 tasks done
Assignees
Labels
area:async-operators AIP-40: Deferrable ("Async") Operators area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@kien-truong
Copy link

kien-truong commented Nov 13, 2024

Apache Airflow version

2.10.3

If "Other Airflow 2 version" selected, which one?

No response

What happened?

The WorkflowTrigger used by ExternalTaskSensor should have a time limit set from timeout attribute instead of execution_timeout

timeout=self.execution_timeout,

What you think should happen instead?

No response

How to reproduce

  1. Start an ExternalTaskSensor in deferrable mode to monitor a non-running DAG, using default arguments.
  2. The Sensor never timeout because the monitored DAGs do not update, and the timeout value is wrongly set from execution_timeout instead of timeout

Operating System

Linux

Versions of Apache Airflow Providers

No response

Deployment

Google Cloud Composer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@kien-truong kien-truong added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Nov 13, 2024
Copy link

boring-cyborg bot commented Nov 13, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added the area:async-operators AIP-40: Deferrable ("Async") Operators label Nov 13, 2024
@kien-truong
Copy link
Author

I can only contribute at weekend, so feel free to pick this up if anyone feels like it.

@kien-truong kien-truong changed the title Wrong timeout value for ExternalTaskSensor running in Deferrable mode Wrong timeout value for ExternalTaskSensor running in deferrable mode Nov 13, 2024
@venkateshwaracholan
Copy link

@kien-truong I would like to work on this.

@potiuk
Copy link
Member

potiuk commented Nov 16, 2024

Assigned you

@karakanb
Copy link
Contributor

I looked into this a bit but it seems like there's a fundamental issue here, I'll try to explain below.

The expected behavior would be to have a sensor that can run with retries, in case something fails during the sensor check, e.g. infra issues. The retries are not about the sensor not finding what it was supposed to, e.g. "the task is not there", but to recover from infra failures, e.g. the database being temporarily unavailable. This behavior works as expected with sensors in general.

However, when combining retries on sensors with timeouts, that's where things start getting interesting:

  • When the user sets a timeout, the intention is "wait this long from the beginning of the first try", which is a very important factor that is also highlighted in the Timeouts section of the docs. This behavior seems to work correctly with reschedule mode thanks to the task_reschedule table that records the start timestamp for the first try.
  • However, when deferrable mode is used, the timeouts do not work with retries since there's no way to retrieve the start time of the first attempt of a task instance.

It seems like the user would want the same behavior between deferred and non-deferred versions of the sensor for the timeouts with retries, but I couldn't find a way to solve it without adding a new table to airflow. is the original first start time information saved somewhere?

@kien-truong
Copy link
Author

kien-truong commented Nov 29, 2024

When the user sets a timeout, the intention is "wait this long from the beginning of the first try

Yeah, even in poke mode sensor does not respect this behavior but use the first poke of the current try as the starting point, which I think is a bug in itself.

The document said,

In case that the mode is poke (see below), both of them (timeout and execution_timeout | emphasis mine) are equivalent (as the sensor is never rescheduled)

However, this is only correct if the sensor doesn't fail and retry.

@karakanb
Copy link
Contributor

I think the document is correct with regards to what it says: it says "the execution_timeout and the timeout behave the same for poke", which means that they don't do what we'd expect them to do, and instead just set the timeout for the specific retry attempt instead.

I agree with you that they should behave the same way. I think it'd be a relatively simple fix, but there needs to be a state that stores the attempts.

@karakanb
Copy link
Contributor

karakanb commented Dec 2, 2024

Actually, testing this out, it does not work indeed across attempts, instead the timeout is only enforced within the same attempt across different reschedules. I think @kien-truong is right.

@nathadfield
Copy link
Collaborator

There is currently a behavioural discrepancy between regular and deferred sensors that has been outstanding for many months which, it seems, will finally be addressed in 2.11.x.

@nathadfield
Copy link
Collaborator

I think it is correct that sensors that have exhausted their timeout should not retry; instead the timeout should be longer to cover a wide enough window of waiting.

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 31, 2024
@seifrajhi
Copy link

Stills relevant, to remove stale label

@nathadfield
Copy link
Collaborator

Actually, unless I'm missing some other issue, I believe this is a duplication of a previously logged issue and looks to be addressed in the forthcoming Airflow 2.11.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 3, 2025
@kien-truong
Copy link
Author

It's partially overlap with #33718, which is making Deferrable sensor to not retry when timeout is reach.

However, even with that fix in place, the sensor still needs to call defer with the correct timeout parameter, which is not the case with ExternalTaskSensor: it's passing execution_timeout instead of timeout

@nathadfield
Copy link
Collaborator

@dstandish Are you able to comment on this given you worked on #33718?

Copy link

github-actions bot commented Feb 1, 2025

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 1, 2025
Copy link

github-actions bot commented Feb 9, 2025

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:async-operators AIP-40: Deferrable ("Async") Operators area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

6 participants