Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore stability and unquarantine all test_scheduler_job tests #19860

Merged
merged 2 commits into from
Dec 13, 2021

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Nov 28, 2021

The scheduler job tests were pretty flaky and some of them were
quarantined already (especially the query count). This PR improves
the stability in the following ways:

  • clean the database between tests for TestSchedulerJob to avoid
    side effects
  • forces UTC timezone in tests where date missed timezone specs
  • updates number of queries expected in the query count tests
  • stabilizes the sequence of retrieval of tasks in case tests
    depended on it
  • adds more stack trace levels (5) to compare where extra
    methods were called.
  • increase number of scheduler runs where it was needed
  • add session.flush() where it was missing
  • add requirement to have serialized dags ready when needed

Hopefully:

(Based on #20054 to test if all green).


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 28, 2021
@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 16f174e to a8fcdc0 Compare November 28, 2021 13:32
@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch from a8fcdc0 to 9cf5055 Compare November 28, 2021 14:57
@potiuk potiuk added the full tests needed We need to run full set of tests for this PR to merge label Nov 28, 2021
@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 9cf5055 to 2438009 Compare November 28, 2021 14:58
@potiuk
Copy link
Member Author

potiuk commented Nov 28, 2021

Running "full" tests, to get a few runs and check flakiness (locally I run those 100s of times - but seems that some scenarios might only happen in parallel run scenarios).

@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 2438009 to f444f47 Compare November 28, 2021 15:01
@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch from f444f47 to 53a1c74 Compare November 28, 2021 17:02
@eladkal
Copy link
Contributor

eladkal commented Nov 28, 2021

test_scheduler_keeps_scheduling_pool_full still seems flaky

==== Core mssql: 1 failure ====

tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_scheduler_keeps_scheduling_pool_full: assert 0 > 0
 +  where 0 = len([])
Error: Process completed with exit code 1.

@potiuk
Copy link
Member Author

potiuk commented Nov 28, 2021

Yep. Seems it's still 'flaky`... looking at it.

@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch 2 times, most recently from ec3a4e8 to 9be6803 Compare November 29, 2021 11:51
@potiuk
Copy link
Member Author

potiuk commented Nov 29, 2021

I could not reproduce it locally at all, but I added an extra flush(). I believe the problem was because the queries might have not been yet flushed after ORM modifications (but I am a bit guessing here). I will try to run it more times.

@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 9be6803 to d65807d Compare November 29, 2021 18:00
@potiuk
Copy link
Member Author

potiuk commented Nov 29, 2021

Hmm. @ashb @uranusjr @ephraimbuddy maybe you could take a look at this one and help me to figure out what happened. I have a theory and I am trying it in the latest commit, but I just debugged and learned more about how dag_maker works internally (pretty cool actually).

Below you will see coloured dump pf the failing test_scheduler_keeps_scheduling_pool_full tests.

Here is my line of thinking:

What I seee there is that when the "create_dagrun" loop for the d2 dag failed with Couldn't find dag (scheduler_job.py:992). You can see five of those failing (this is understood as we try to create 5 dagruns).

When (normally) the test runs successfully - as expected there are no error messages, DagRuns are created and life is good. So clearly the reason for the test failin are those 'Couldn't find dag'. But from what I understand how get_dag() works, the only way to get Couldn't find dag is that there is no SerializedDagModel available for SQLAlchemy.

For whatever reason when the tests get to the d2 create_dagruns loop, the SerializedDagModel for it is not available in sqlalchemy. The whole test is done (as I understand it) without a single commit() - so all the SerializedDag objects should be in the session (because they are not commited te the database). This (I think) means that the tests has only a chance to work if both dag_makers above use the same session as is used later to create dag_runs.

My theory is that for some reason the second run of the dag_maker used different session from sqlalchemy. This is the only way I could explain this behaviour, but I am not sure how it could happen. Yes the dag_makers did not have session passed to them - which triggers (own_session) - so each of the dag_maker contexts would create new Session(). But at least in theory they should be the same scope. From what I understand sqlalchemy stores the actuall session in Threadlocal and they will be reused if Session() is called twice. But maybe I do not understand SqlAlchemy session behaviour and there is way same thread will get two different sessions (with different ORM models) if it calls Session() twice.

I am trying out my theory by adding a change where the two dag_makers, actually share session created before. I moved session creation up and passed it to dag_makers. Previously those tests failed in 1/2 tests consistently before so I hope we will see the result of it soon. But If it would confirm, then I wonder under what circumstances this could happen.

Screenshot from 2021-11-29 19-07-05

@potiuk
Copy link
Member Author

potiuk commented Nov 29, 2021

Hmm. Interesting one - it did not fix the problem (so my hypothesis was wrong): https://github.com/apache/airflow/runs/4357877756?check_suite_focus=true#step:6:4065

Also we have another flaky test here: test_scheduler_task_start_date

https://github.com/apache/airflow/runs/4357877606?check_suite_focus=true#step:6:4013

@potiuk potiuk force-pushed the restore-stability-in-test-scheduler-job branch 5 times, most recently from f965207 to c5220b7 Compare November 30, 2021 09:43
@potiuk
Copy link
Member Author

potiuk commented Nov 30, 2021

OK. I have very interesting findings and I might need some brainstorming. @ashb @ephraimbuddy @uranusjr.

I think know what is causing the problem wut but I have not figured out the mechanism (yet). I am running some experiments to get more data but the problem with test_scheduler_keeps_scheduling_pool_full is most likely caused by side effect of
test_scheduler_multiprocessing_with_spawn_method - and I am afraid it could have some real consequences on systems that are using spawn method for multiprocessing.

This seems to be caused by a deffered re-importing of airflow.settings generated by DagProcessor from the spawn test. However I scratch my head and cannot figure out how such reload in a - seamingly - different process, with python interpreter started from scratch (this is how spawn works) could influence the tests running in a different process. It might be related to some shared memory or resources on the "driver" level for the database (this is the only thing left as a hypothesis)

Some Observations:

This problem is accompanied always by this log entry (which is generated by reloading settings by dag processor I believe):

  [2021-11-29 19:14:45,094] {settings.py:52} INFO - Configured default timezone Timezone('UTC')

This entry asynchronously appears at various places - sometimes during the test, sometimes in setup (the latter is a proof that this is a side effect from an earlier test and not generated by the test itself). Depending on when the "reload" happens, the effects on the test might be slighly different.

  • First of all the test_scheduler_keeps_scheduling_pool_full runs right after test_scheduler_multiprocessing_with_spawn_method because that's the sequence in the test.
  • I tested it by disalbling the test_scheduler_multiprocessing_with_spawn_method test. When the test is enabled, the problem happens rather consistently (1/2 failures for each CI builds). When the test is disabled, the test_scheduler_keeps_scheduling_pool_full consistently works fine
  • I am (just now) running an experiment where i moved the spawn test case after the test_scheduler_keeps_scheduling_pool_full and I will see what effect it will have (I guess the following tests might start failing).

I see three interesting cases:

Reload happening after or around second DAG syncing - here only d2 was missing:

  ---------------------------- Captured stdout setup -----------------------------
  [2021-11-29 19:14:44,996] {dagbag.py:500} INFO - Filling up the DagBag from /dev/null
  ------------------------------ Captured log setup ------------------------------
  INFO     airflow.models.dagbag.DagBag:dagbag.py:500 Filling up the DagBag from /dev/null
  ----------------------------- Captured stdout call -----------------------------
  [2021-11-29 19:14:45,011] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-29 19:14:45,015] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d1
  [2021-11-29 19:14:45,021] {dag.py:2934} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d1 to 2016-01-01T00:00:00+00:00
  [2021-11-29 19:14:45,041] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-29 19:14:45,044] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d2
  [2021-11-29 19:14:45,050] {dag.py:2934} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d2 to 2016-01-01T00:00:00+00:00
  [2021-11-29 19:14:45,094] {settings.py:52} INFO - Configured default timezone Timezone('UTC')  <--- HERE RELOAD HAPPENED
  [2021-11-29 19:14:45,190] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,198] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,207] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,221] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,235] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,248] {scheduler_job.py:287} INFO - 2 tasks up for execution:

Reload happening before DAG syncing (here both dags are missing). This is VERY interesting because it seems that the reload affects creation of the SerializedDagModels created after the reload - and not retrieval of them later.

  ---------------------------- Captured stdout setup -----------------------------
  [2021-11-28 17:37:58,370] {settings.py:52} INFO - Configured default timezone Timezone('UTC')  <-- HERE RELOAD HAPPENED
  [2021-11-28 17:37:58,372] {dagbag.py:500} INFO - Filling up the DagBag from /dev/null
  ------------------------------ Captured log setup ------------------------------
  INFO     airflow.models.dagbag.DagBag:dagbag.py:500 Filling up the DagBag from /dev/null
  ----------------------------- Captured stdout call -----------------------------
  [2021-11-28 17:37:58,381] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 17:37:58,384] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d1
  [2021-11-28 17:37:58,388] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d1 to 2016-01-01T00:00:00+00:00
  [2021-11-28 17:37:58,403] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 17:37:58,406] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d2
  [2021-11-28 17:37:58,410] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d2 to 2016-01-01T00:00:00+00:00
  [2021-11-28 17:37:58,429] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,438] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,448] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,461] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,474] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,487] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,501] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,514] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,527] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,539] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!

Reload happening between the dags syncing (here again d2 was missing):

  ---------------------------- Captured stdout setup -----------------------------
  [2021-11-28 13:41:46,975] {dagbag.py:500} INFO - Filling up the DagBag from /dev/null
  ------------------------------ Captured log setup ------------------------------
  INFO     airflow.models.dagbag.DagBag:dagbag.py:500 Filling up the DagBag from /dev/null
  ----------------------------- Captured stdout call -----------------------------
  [2021-11-28 13:41:46,993] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 13:41:46,998] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d1
  [2021-11-28 13:41:47,004] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d1 to 2016-01-01T00:00:00+00:00
  [2021-11-28 13:41:47,011] {settings.py:52} INFO - Configured default timezone Timezone('UTC')  <-- HERE RELOAD HAPPENS
  [2021-11-28 13:41:47,030] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 13:41:47,035] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d2
  [2021-11-28 13:41:47,042] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d2 to 2016-01-01T00:00:00+00:00
  [2021-11-28 13:41:47,198] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,209] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,220] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,231] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,241] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!

It looks like it is not connected with exactly the moment of the log, but I believe it is caused with what happens next after the "Timezone" log is logged. When I look at the code, right after re-importing the settings, DagProcessor calls initialize() method there are plenty of possibilities where some fo those (configure_orm() ?) might cause disruption to SQLAlchemy models stored in session.

It looks literally as if SerializedDAGModels created by dag_maker after this mysterious "reload effect" happen are not visible to get_dag(dag_id) query.

def initialize():
    """Initialize Airflow with all the settings from this file"""
    configure_vars()
    prepare_syspath()
    import_local_settings()
    global LOGGING_CLASS_PATH
    LOGGING_CLASS_PATH = configure_logging()
    configure_adapters()
    # The webservers import this file from models.py with the default settings.
    configure_orm()
    configure_action_logging()

    # Ensure we close DB connections at scheduler and gunicorn worker terminations
    atexit.register(dispose_orm)

However I failed to come up with a theory why such reload in a spawned process might cause this really? And why the spawned DagProcessor is not killed by this in pytest fixture ?

        if self.scheduler_job and self.scheduler_job.processor_agent:
            self.scheduler_job.processor_agent.end()
            self.scheduler_job = None

Any help and brainstorming appreciated.

@potiuk potiuk added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Dec 12, 2021
@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

All green with test_scheduler_job :). Closing/reopenig with use public runners label to see if public runners will handle it too.

@potiuk potiuk closed this Dec 12, 2021
@potiuk potiuk reopened this Dec 12, 2021
@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

This one looks really good - both public runners and self-hosted do not seem to be flaky at least the few times I run full test suite). I think it is ready to merge.

@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

Seems like for our test the time of flaky pandemic is close to be over and our quarantined test number will go down ;). Wish that was true in real life.

After that one is merged, those are the only left quanantined tests (looking at reviewing/fixing them after this one gets merged).

./tests/executors/test_celery_executor.py:    @pytest.mark.quarantined
./tests/executors/test_celery_executor.py:@pytest.mark.quarantined
./tests/jobs/test_local_task_job.py:    @pytest.mark.quarantined
./tests/jobs/test_local_task_job.py:    @pytest.mark.quarantined
./tests/jobs/test_local_task_job.py:    @pytest.mark.quarantined
./tests/www/views/test_views_trigger_dag.py:@pytest.mark.quarantined
./tests/cli/commands/test_webserver_command.py:    @pytest.mark.quarantined
./tests/cli/commands/test_task_command.py:    @pytest.mark.quarantined
./tests/cli/commands/test_task_command.py:    @pytest.mark.quarantined
./tests/core/test_impersonation_tests.py:@pytest.mark.quarantined
./tests/core/test_impersonation_tests.py:@pytest.mark.quarantined
./tests/sensors/test_external_task_sensor.py:@pytest.mark.quarantined

@potiuk potiuk closed this Dec 12, 2021
@potiuk potiuk reopened this Dec 12, 2021
@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

Running once again - all tests were ok except the mssql - which is unrelated problem we have to solve that happens occasionally with Public runner.

@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

Close/reopen to check again (and again).

@potiuk potiuk closed this Dec 12, 2021
@potiuk potiuk reopened this Dec 12, 2021
@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

The MSSQL failures should be handled by #20231

@potiuk
Copy link
Member Author

potiuk commented Dec 12, 2021

4 x integration test failures (those that #20231 should handle nicely). Running again, but seems those tests are pretty stable. now.

@potiuk potiuk closed this Dec 12, 2021
@potiuk potiuk reopened this Dec 12, 2021
@potiuk
Copy link
Member Author

potiuk commented Dec 13, 2021

Yep . Run it many times now. Only unrelated errors (flaky k8s test for public runners this time). Shall we?

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks right to me.

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
@potiuk
Copy link
Member Author

potiuk commented Dec 13, 2021

Let's get it run one more time :)

@potiuk potiuk merged commit 9b277db into apache:main Dec 13, 2021
@potiuk potiuk deleted the restore-stability-in-test-scheduler-job branch December 13, 2021 17:55
@potiuk potiuk added this to the Airflow 2.2.4 milestone Jan 22, 2022
@potiuk potiuk added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 22, 2022
potiuk added a commit that referenced this pull request Jan 22, 2022
* Restore stability and unquarantine all test_scheduler_job tests

The scheduler job tests were pretty flaky and some of them were
quarantined already (especially the query count). This PR improves
the stability in the following ways:

* clean the database between tests for TestSchedulerJob to avoid
  side effects
* forces UTC timezone in tests where date missed timezone specs
* updates number of queries expected in the query count tests
* stabilizes the sequence of retrieval of tasks in case tests
  depended on it
* adds more stack trace levels (5) to compare where extra
  methods were called.
* increase number of scheduler runs where it was needed
* add session.flush() where it was missing
* add requirement to have serialized dags ready when needed
* increase dagruns number to process where we could have
  some "too slow" tests comparing to fast processing of
  dag runs.

Hopefully:

* Fixes: #18777
* Fixes: #17291
* Fixes: #17224
* Fixes: #15255
* Fixes: #15085

* Update tests/jobs/test_scheduler_job.py

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
(cherry picked from commit 9b277db)
jedcunningham pushed a commit that referenced this pull request Jan 27, 2022
* Restore stability and unquarantine all test_scheduler_job tests

The scheduler job tests were pretty flaky and some of them were
quarantined already (especially the query count). This PR improves
the stability in the following ways:

* clean the database between tests for TestSchedulerJob to avoid
  side effects
* forces UTC timezone in tests where date missed timezone specs
* updates number of queries expected in the query count tests
* stabilizes the sequence of retrieval of tasks in case tests
  depended on it
* adds more stack trace levels (5) to compare where extra
  methods were called.
* increase number of scheduler runs where it was needed
* add session.flush() where it was missing
* add requirement to have serialized dags ready when needed
* increase dagruns number to process where we could have
  some "too slow" tests comparing to fast processing of
  dag runs.

Hopefully:

* Fixes: #18777
* Fixes: #17291
* Fixes: #17224
* Fixes: #15255
* Fixes: #15085

* Update tests/jobs/test_scheduler_job.py

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
(cherry picked from commit 9b277db)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing)
Projects
None yet
5 participants