Restore stability and unquarantine all test_scheduler_job tests #19860

potiuk · 2021-11-28T13:02:03Z

The scheduler job tests were pretty flaky and some of them were
quarantined already (especially the query count). This PR improves
the stability in the following ways:

clean the database between tests for TestSchedulerJob to avoid
side effects
forces UTC timezone in tests where date missed timezone specs
updates number of queries expected in the query count tests
stabilizes the sequence of retrieval of tasks in case tests
depended on it
adds more stack trace levels (5) to compare where extra
methods were called.
increase number of scheduler runs where it was needed
add session.flush() where it was missing
add requirement to have serialized dags ready when needed

Hopefully:

(Based on #20054 to test if all green).

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

potiuk · 2021-11-28T15:00:04Z

Running "full" tests, to get a few runs and check flakiness (locally I run those 100s of times - but seems that some scenarios might only happen in parallel run scenarios).

tests/jobs/test_scheduler_job.py

eladkal · 2021-11-28T20:33:05Z

test_scheduler_keeps_scheduling_pool_full still seems flaky

==== Core mssql: 1 failure ====

tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_scheduler_keeps_scheduling_pool_full: assert 0 > 0
 +  where 0 = len([])
Error: Process completed with exit code 1.

potiuk · 2021-11-28T21:08:13Z

Yep. Seems it's still 'flaky`... looking at it.

potiuk · 2021-11-29T11:54:00Z

I could not reproduce it locally at all, but I added an extra flush(). I believe the problem was because the queries might have not been yet flushed after ORM modifications (but I am a bit guessing here). I will try to run it more times.

potiuk · 2021-11-29T18:38:26Z

Hmm. @ashb @uranusjr @ephraimbuddy maybe you could take a look at this one and help me to figure out what happened. I have a theory and I am trying it in the latest commit, but I just debugged and learned more about how dag_maker works internally (pretty cool actually).

Below you will see coloured dump pf the failing test_scheduler_keeps_scheduling_pool_full tests.

Here is my line of thinking:

What I seee there is that when the "create_dagrun" loop for the d2 dag failed with Couldn't find dag (scheduler_job.py:992). You can see five of those failing (this is understood as we try to create 5 dagruns).

When (normally) the test runs successfully - as expected there are no error messages, DagRuns are created and life is good. So clearly the reason for the test failin are those 'Couldn't find dag'. But from what I understand how get_dag() works, the only way to get Couldn't find dag is that there is no SerializedDagModel available for SQLAlchemy.

For whatever reason when the tests get to the d2 create_dagruns loop, the SerializedDagModel for it is not available in sqlalchemy. The whole test is done (as I understand it) without a single commit() - so all the SerializedDag objects should be in the session (because they are not commited te the database). This (I think) means that the tests has only a chance to work if both dag_makers above use the same session as is used later to create dag_runs.

My theory is that for some reason the second run of the dag_maker used different session from sqlalchemy. This is the only way I could explain this behaviour, but I am not sure how it could happen. Yes the dag_makers did not have session passed to them - which triggers (own_session) - so each of the dag_maker contexts would create new Session(). But at least in theory they should be the same scope. From what I understand sqlalchemy stores the actuall session in Threadlocal and they will be reused if Session() is called twice. But maybe I do not understand SqlAlchemy session behaviour and there is way same thread will get two different sessions (with different ORM models) if it calls Session() twice.

I am trying out my theory by adding a change where the two dag_makers, actually share session created before. I moved session creation up and passed it to dag_makers. Previously those tests failed in 1/2 tests consistently before so I hope we will see the result of it soon. But If it would confirm, then I wonder under what circumstances this could happen.

potiuk · 2021-11-29T20:10:55Z

Hmm. Interesting one - it did not fix the problem (so my hypothesis was wrong): https://github.com/apache/airflow/runs/4357877756?check_suite_focus=true#step:6:4065

Also we have another flaky test here: test_scheduler_task_start_date

https://github.com/apache/airflow/runs/4357877606?check_suite_focus=true#step:6:4013

potiuk · 2021-11-30T10:07:15Z

OK. I have very interesting findings and I might need some brainstorming. @ashb @ephraimbuddy @uranusjr.

I think know what is causing the problem wut but I have not figured out the mechanism (yet). I am running some experiments to get more data but the problem with test_scheduler_keeps_scheduling_pool_full is most likely caused by side effect of
test_scheduler_multiprocessing_with_spawn_method - and I am afraid it could have some real consequences on systems that are using spawn method for multiprocessing.

This seems to be caused by a deffered re-importing of airflow.settings generated by DagProcessor from the spawn test. However I scratch my head and cannot figure out how such reload in a - seamingly - different process, with python interpreter started from scratch (this is how spawn works) could influence the tests running in a different process. It might be related to some shared memory or resources on the "driver" level for the database (this is the only thing left as a hypothesis)

Some Observations:

This problem is accompanied always by this log entry (which is generated by reloading settings by dag processor I believe):

  [2021-11-29 19:14:45,094] {settings.py:52} INFO - Configured default timezone Timezone('UTC')

This entry asynchronously appears at various places - sometimes during the test, sometimes in setup (the latter is a proof that this is a side effect from an earlier test and not generated by the test itself). Depending on when the "reload" happens, the effects on the test might be slighly different.

First of all the test_scheduler_keeps_scheduling_pool_full runs right after test_scheduler_multiprocessing_with_spawn_method because that's the sequence in the test.
I tested it by disalbling the test_scheduler_multiprocessing_with_spawn_method test. When the test is enabled, the problem happens rather consistently (1/2 failures for each CI builds). When the test is disabled, the test_scheduler_keeps_scheduling_pool_full consistently works fine
I am (just now) running an experiment where i moved the spawn test case after the test_scheduler_keeps_scheduling_pool_full and I will see what effect it will have (I guess the following tests might start failing).

I see three interesting cases:

Reload happening after or around second DAG syncing - here only d2 was missing:

  ---------------------------- Captured stdout setup -----------------------------
  [2021-11-29 19:14:44,996] {dagbag.py:500} INFO - Filling up the DagBag from /dev/null
  ------------------------------ Captured log setup ------------------------------
  INFO     airflow.models.dagbag.DagBag:dagbag.py:500 Filling up the DagBag from /dev/null
  ----------------------------- Captured stdout call -----------------------------
  [2021-11-29 19:14:45,011] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-29 19:14:45,015] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d1
  [2021-11-29 19:14:45,021] {dag.py:2934} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d1 to 2016-01-01T00:00:00+00:00
  [2021-11-29 19:14:45,041] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-29 19:14:45,044] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d2
  [2021-11-29 19:14:45,050] {dag.py:2934} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d2 to 2016-01-01T00:00:00+00:00
  [2021-11-29 19:14:45,094] {settings.py:52} INFO - Configured default timezone Timezone('UTC')  <--- HERE RELOAD HAPPENED
  [2021-11-29 19:14:45,190] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,198] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,207] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,221] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,235] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-29 19:14:45,248] {scheduler_job.py:287} INFO - 2 tasks up for execution:

Reload happening before DAG syncing (here both dags are missing). This is VERY interesting because it seems that the reload affects creation of the SerializedDagModels created after the reload - and not retrieval of them later.

  ---------------------------- Captured stdout setup -----------------------------
  [2021-11-28 17:37:58,370] {settings.py:52} INFO - Configured default timezone Timezone('UTC')  <-- HERE RELOAD HAPPENED
  [2021-11-28 17:37:58,372] {dagbag.py:500} INFO - Filling up the DagBag from /dev/null
  ------------------------------ Captured log setup ------------------------------
  INFO     airflow.models.dagbag.DagBag:dagbag.py:500 Filling up the DagBag from /dev/null
  ----------------------------- Captured stdout call -----------------------------
  [2021-11-28 17:37:58,381] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 17:37:58,384] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d1
  [2021-11-28 17:37:58,388] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d1 to 2016-01-01T00:00:00+00:00
  [2021-11-28 17:37:58,403] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 17:37:58,406] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d2
  [2021-11-28 17:37:58,410] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d2 to 2016-01-01T00:00:00+00:00
  [2021-11-28 17:37:58,429] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,438] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,448] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,461] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,474] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d1 in DagBag/DB!
  [2021-11-28 17:37:58,487] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,501] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,514] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,527] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 17:37:58,539] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!

Reload happening between the dags syncing (here again d2 was missing):

  ---------------------------- Captured stdout setup -----------------------------
  [2021-11-28 13:41:46,975] {dagbag.py:500} INFO - Filling up the DagBag from /dev/null
  ------------------------------ Captured log setup ------------------------------
  INFO     airflow.models.dagbag.DagBag:dagbag.py:500 Filling up the DagBag from /dev/null
  ----------------------------- Captured stdout call -----------------------------
  [2021-11-28 13:41:46,993] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 13:41:46,998] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d1
  [2021-11-28 13:41:47,004] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d1 to 2016-01-01T00:00:00+00:00
  [2021-11-28 13:41:47,011] {settings.py:52} INFO - Configured default timezone Timezone('UTC')  <-- HERE RELOAD HAPPENS
  [2021-11-28 13:41:47,030] {dag.py:2395} INFO - Sync 1 DAGs
  [2021-11-28 13:41:47,035] {dag.py:2414} INFO - Creating ORM DAG for test_scheduler_keeps_scheduling_pool_full_d2
  [2021-11-28 13:41:47,042] {dag.py:2928} INFO - Setting next_dagrun for test_scheduler_keeps_scheduling_pool_full_d2 to 2016-01-01T00:00:00+00:00
  [2021-11-28 13:41:47,198] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,209] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,220] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,231] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!
  [2021-11-28 13:41:47,241] {scheduler_job.py:992} ERROR - Couldn't find dag test_scheduler_keeps_scheduling_pool_full_d2 in DagBag/DB!

It looks like it is not connected with exactly the moment of the log, but I believe it is caused with what happens next after the "Timezone" log is logged. When I look at the code, right after re-importing the settings, DagProcessor calls initialize() method there are plenty of possibilities where some fo those (configure_orm() ?) might cause disruption to SQLAlchemy models stored in session.

It looks literally as if SerializedDAGModels created by dag_maker after this mysterious "reload effect" happen are not visible to get_dag(dag_id) query.

def initialize():
    """Initialize Airflow with all the settings from this file"""
    configure_vars()
    prepare_syspath()
    import_local_settings()
    global LOGGING_CLASS_PATH
    LOGGING_CLASS_PATH = configure_logging()
    configure_adapters()
    # The webservers import this file from models.py with the default settings.
    configure_orm()
    configure_action_logging()

    # Ensure we close DB connections at scheduler and gunicorn worker terminations
    atexit.register(dispose_orm)

However I failed to come up with a theory why such reload in a spawned process might cause this really? And why the spawned DagProcessor is not killed by this in pytest fixture ?

        if self.scheduler_job and self.scheduler_job.processor_agent:
            self.scheduler_job.processor_agent.end()
            self.scheduler_job = None

Any help and brainstorming appreciated.

potiuk · 2021-12-12T10:58:13Z

All green with test_scheduler_job :). Closing/reopenig with use public runners label to see if public runners will handle it too.

potiuk · 2021-12-12T11:52:07Z

This one looks really good - both public runners and self-hosted do not seem to be flaky at least the few times I run full test suite). I think it is ready to merge.

potiuk · 2021-12-12T11:56:54Z

Seems like for our test the time of flaky pandemic is close to be over and our quarantined test number will go down ;). Wish that was true in real life.

After that one is merged, those are the only left quanantined tests (looking at reviewing/fixing them after this one gets merged).

./tests/executors/test_celery_executor.py:    @pytest.mark.quarantined
./tests/executors/test_celery_executor.py:@pytest.mark.quarantined
./tests/jobs/test_local_task_job.py:    @pytest.mark.quarantined
./tests/jobs/test_local_task_job.py:    @pytest.mark.quarantined
./tests/jobs/test_local_task_job.py:    @pytest.mark.quarantined
./tests/www/views/test_views_trigger_dag.py:@pytest.mark.quarantined
./tests/cli/commands/test_webserver_command.py:    @pytest.mark.quarantined
./tests/cli/commands/test_task_command.py:    @pytest.mark.quarantined
./tests/cli/commands/test_task_command.py:    @pytest.mark.quarantined
./tests/core/test_impersonation_tests.py:@pytest.mark.quarantined
./tests/core/test_impersonation_tests.py:@pytest.mark.quarantined
./tests/sensors/test_external_task_sensor.py:@pytest.mark.quarantined

potiuk · 2021-12-12T12:13:33Z

Running once again - all tests were ok except the mssql - which is unrelated problem we have to solve that happens occasionally with Public runner.

potiuk · 2021-12-12T16:36:13Z

Close/reopen to check again (and again).

potiuk · 2021-12-12T16:36:49Z

The MSSQL failures should be handled by #20231

potiuk · 2021-12-12T17:46:35Z

4 x integration test failures (those that #20231 should handle nicely). Running again, but seems those tests are pretty stable. now.

potiuk · 2021-12-13T11:33:29Z

Yep . Run it many times now. Only unrelated errors (flaky k8s test for public runners this time). Shall we?

uranusjr

The fix looks right to me.

tests/jobs/test_scheduler_job.py

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

potiuk · 2021-12-13T11:54:49Z

Let's get it run one more time :)

* Restore stability and unquarantine all test_scheduler_job tests The scheduler job tests were pretty flaky and some of them were quarantined already (especially the query count). This PR improves the stability in the following ways: * clean the database between tests for TestSchedulerJob to avoid side effects * forces UTC timezone in tests where date missed timezone specs * updates number of queries expected in the query count tests * stabilizes the sequence of retrieval of tasks in case tests depended on it * adds more stack trace levels (5) to compare where extra methods were called. * increase number of scheduler runs where it was needed * add session.flush() where it was missing * add requirement to have serialized dags ready when needed * increase dagruns number to process where we could have some "too slow" tests comparing to fast processing of dag runs. Hopefully: * Fixes: #18777 * Fixes: #17291 * Fixes: #17224 * Fixes: #15255 * Fixes: #15085 * Update tests/jobs/test_scheduler_job.py Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com> (cherry picked from commit 9b277db)

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 28, 2021

potiuk requested review from dstandish, uranusjr, ephraimbuddy, ashb and kaxil November 28, 2021 13:02

potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 16f174e to a8fcdc0 Compare November 28, 2021 13:32

potiuk requested review from jedcunningham and eladkal November 28, 2021 13:33

potiuk force-pushed the restore-stability-in-test-scheduler-job branch from a8fcdc0 to 9cf5055 Compare November 28, 2021 14:57

potiuk added the full tests needed We need to run full set of tests for this PR to merge label Nov 28, 2021

potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 9cf5055 to 2438009 Compare November 28, 2021 14:58

potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 2438009 to f444f47 Compare November 28, 2021 15:01

eladkal reviewed Nov 28, 2021

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

potiuk force-pushed the restore-stability-in-test-scheduler-job branch from f444f47 to 53a1c74 Compare November 28, 2021 17:02

potiuk force-pushed the restore-stability-in-test-scheduler-job branch 2 times, most recently from ec3a4e8 to 9be6803 Compare November 29, 2021 11:51

potiuk force-pushed the restore-stability-in-test-scheduler-job branch from 9be6803 to d65807d Compare November 29, 2021 18:00

potiuk force-pushed the restore-stability-in-test-scheduler-job branch 5 times, most recently from f965207 to c5220b7 Compare November 30, 2021 09:43

potiuk added the use public runners Makes sure that Public runners are used even if commiters creates the PR (useful for testing) label Dec 12, 2021

potiuk closed this Dec 12, 2021

potiuk reopened this Dec 12, 2021

potiuk closed this Dec 12, 2021

potiuk reopened this Dec 12, 2021

potiuk closed this Dec 12, 2021

potiuk reopened this Dec 12, 2021

potiuk closed this Dec 12, 2021

potiuk reopened this Dec 12, 2021

uranusjr approved these changes Dec 13, 2021

View reviewed changes

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

Update tests/jobs/test_scheduler_job.py

ed1165a

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

potiuk merged commit 9b277db into apache:main Dec 13, 2021

potiuk deleted the restore-stability-in-test-scheduler-job branch December 13, 2021 17:55

jedcunningham mentioned this pull request Dec 14, 2021

Status of testing of Apache Airflow 2.2.3rc2 #20208

Closed

38 tasks

potiuk added this to the Airflow 2.2.4 milestone Jan 22, 2022

potiuk added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore stability and unquarantine all test_scheduler_job tests #19860

Restore stability and unquarantine all test_scheduler_job tests #19860

potiuk commented Nov 28, 2021 •

edited

Loading

potiuk commented Nov 28, 2021

eladkal commented Nov 28, 2021

potiuk commented Nov 28, 2021

potiuk commented Nov 29, 2021

potiuk commented Nov 29, 2021 •

edited

Loading

potiuk commented Nov 29, 2021 •

edited

Loading

potiuk commented Nov 30, 2021 •

edited

Loading

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021 •

edited

Loading

potiuk commented Dec 12, 2021

potiuk commented Dec 13, 2021

uranusjr left a comment

potiuk commented Dec 13, 2021

Restore stability and unquarantine all test_scheduler_job tests #19860

Restore stability and unquarantine all test_scheduler_job tests #19860

Conversation

potiuk commented Nov 28, 2021 • edited Loading

potiuk commented Nov 28, 2021

eladkal commented Nov 28, 2021

potiuk commented Nov 28, 2021

potiuk commented Nov 29, 2021

potiuk commented Nov 29, 2021 • edited Loading

potiuk commented Nov 29, 2021 • edited Loading

potiuk commented Nov 30, 2021 • edited Loading

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021

potiuk commented Dec 12, 2021 • edited Loading

potiuk commented Dec 12, 2021

potiuk commented Dec 13, 2021

uranusjr left a comment

Choose a reason for hiding this comment

potiuk commented Dec 13, 2021

potiuk commented Nov 28, 2021 •

edited

Loading

potiuk commented Nov 29, 2021 •

edited

Loading

potiuk commented Nov 29, 2021 •

edited

Loading

potiuk commented Nov 30, 2021 •

edited

Loading

potiuk commented Dec 12, 2021 •

edited

Loading