Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix the flaky TestTriggererJob test #34075

Merged
merged 1 commit into from
Sep 4, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Sep 4, 2023

As documented in #33323, we have frequent failures of the flaky triggerer job tests.

The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions.

This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors.

The fix is to join the runner after marking it to be stopped.

Fixes: #33323 (Hopefully)


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

As documented in apache#33323, we have frequent failures of the flaky
triggerer job tests.

The flaky failures are about some errors when we close all the
sessions in teardown of the test. It turns out that the tests
had side-effect - they have not waited for the TriggererJob thread
to complete, they merely marked them to be stopped, but they have
not waited for those to complete - this is quite plausible
explanation of the flaky test failures - since those threads have
1 second sleep, it's more than likely that the session has been
created and used by the thread while the teardown has been attempting
to close all the sessions.

This side effect could also have an effect for other tests that
were run after - because in a busy test run machine, the side
effects could propagate further than just to the teardown, so it
could also explain why sometimes (very rarely) other job tests
failed with similar errors.

The fix is to join the runner after marking it to be stopped.

Fixes: apache#33323 (Hopefully)
@boring-cyborg boring-cyborg bot added area:Scheduler including HA (high availability) scheduler area:Triggerer labels Sep 4, 2023
@potiuk potiuk added this to the Airflow 2.7.1 milestone Sep 4, 2023
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Sep 4, 2023
@Taragolis
Copy link
Contributor

I'm not sure this would help, because timeout it is only for how long block Main Thread, but thread might still exists.
In the other case if parameter timeout not set or None then it would block Main Thread until required thread completed.

Some example

from threading import Thread
from time import sleep


def infinity():
    while True:
        sleep(1)


if __name__ == "__main__":
    thr = Thread(target=infinity)
    thr.start()
    print("Alive? ", thr.is_alive())
    thr.join(5)
    print("5 second? I'm stile Alive! ", thr.is_alive())
    thr.join(30)
    print("30 second? Nope! I'm stile Alive! ", thr.is_alive())
    print("Let's block main thread, but this would not help either.")
    thr.join()

I assume that we should do something on Triggerer side, however killing thread it's not something straight forward, last time when I tried to do that stuff, with events. queues and etc I lost my mind, and I'm not sure that I recovered after that.

@Taragolis
Copy link
Contributor

Some additional "simulation" within pytest and long living threads

If run entire module in Pytest, all tests will failed:

  • TestMultiThreading::test_thread_is_alive - Thread still alive
  • TestMultiThreading::test_check_threads - Thread with name "foo-bar" still exists
  • TestOtherTest::test_check_threads - Thread with name "foo-bar" still exists
import threading
from time import sleep


class TestMultiThreading:
    def test_thread_is_alive(self):

        def infinity():
            while True:
                sleep(1)

        thr = threading.Thread(target=infinity, name="foo-bar")
        try:
            thr.start()
        finally:
            thr.join(2)
            assert not thr.is_alive()

    def test_check_threads(self):
        threads_names = [thread.name for thread in threading.enumerate()]
        assert "foo-bar" not in threads_names, threads_names


class TestOtherTest:
    def test_check_threads(self):
        threads_names = [thread.name for thread in threading.enumerate()]
        assert "foo-bar" not in threads_names, threads_names

That mean if our side effect that we block main thread and as result we ignore all signals, then solution with timeout would work.

But if our problem that thread existed after test completed and this affect other test, then solution with timeout would not work. And better spawn separate process which would run TriggererJob and kill that process in the end of the test.

@potiuk
Copy link
Member Author

potiuk commented Sep 4, 2023

I think it will help. The way it was before had almost guarantee that those races will happen because we have not waited for the thread at all.

Here we give 30 seconds (way more than the 1 sec. Idle) for the thread to complete. And this is all about to make it far far less probable and 'fast enough' even if the thread hangs for some reason or the machine is very, very busy.

In most cases even if the thread will hang for 30 s and we won't complete waiting, nothing bad will happen. We would have to be really unlucky to get the thread do something with session while we are closing it. Previously it was just the question of how close to the end of the 1s sleep the thread was so If the 'race window' was say 10ms - we woud have 1% chance to hit it. With 30 seconds time for the thread to complete IMHO the probability is wery low.

If we have infinite join and the thread does not complete, then we will fail at 60s anyway with test timeout, and the test would fail in this case - even if it would have side effects would be harmless, so I prefer to just 'let it go' after half the time and continue hoping that the side effects won't kick in.

@potiuk
Copy link
Member Author

potiuk commented Sep 4, 2023

I think it will. The way it was before had almost guarantee that those races will happen because we have not waited for the thread at all.

Here we give 30 seconds (way more than the 1 sec. Idle) for the thread to complete. And this is all about to make it far far less probable and 'fast enough' even if the thread hangs for some reason or the machine is very, very busy.

In most cases even if the thread will hang for 30 s and we won't complete waiting, nothing bad will happen. We would have to be really unlucky to get the thread do something with session while we are closing it. Previously it was just the question of how close to the end of the 1s sleep the thread was so If the 'race window' was say 10ms - we woud have 1% chance to hit it. With 30 seconds time for the thread to complete IMHO the probability is wery low.

If we have infinite join and the thread does not complete, then we will fail at 60s anyway with test timeout, and the test would fail in this case - even if it would have side effects would be harmless, so I prefer to just 'let it go' after half the time and continue hoping that the side effects won't kick in.

And yeah likely having separate process is 'better' but I wanted to have a quick fix now - to Alo make it easy to cherry-pick to 2.7

Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I don't see initially that we do not wait at all. In this case it should dramatically reduce that side effect.

@potiuk potiuk merged commit 47f79b9 into apache:main Sep 4, 2023
@potiuk potiuk deleted the attempt-to-fix-flaky-triggerer-tests branch September 4, 2023 14:40
@potiuk
Copy link
Member Author

potiuk commented Sep 4, 2023

cc: @ephraimbuddy -> should be good to cherry-pick

ephraimbuddy pushed a commit that referenced this pull request Sep 4, 2023
As documented in #33323, we have frequent failures of the flaky
triggerer job tests.

The flaky failures are about some errors when we close all the
sessions in teardown of the test. It turns out that the tests
had side-effect - they have not waited for the TriggererJob thread
to complete, they merely marked them to be stopped, but they have
not waited for those to complete - this is quite plausible
explanation of the flaky test failures - since those threads have
1 second sleep, it's more than likely that the session has been
created and used by the thread while the teardown has been attempting
to close all the sessions.

This side effect could also have an effect for other tests that
were run after - because in a busy test run machine, the side
effects could propagate further than just to the teardown, so it
could also explain why sometimes (very rarely) other job tests
failed with similar errors.

The fix is to join the runner after marking it to be stopped.

Fixes: #33323 (Hopefully)
(cherry picked from commit 47f79b9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler area:Triggerer changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
3 participants