Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test in service_domain_db_SUITE #4072

Merged
merged 1 commit into from
Aug 3, 2023

Conversation

arcusfelis
Copy link
Contributor

@arcusfelis arcusfelis commented Aug 1, 2023

Fix a race condition in service_domain_db_suite.db_keeps_syncing_after_cluster_join testcase.

Our tests assume that sending message from mim() to mim2() node is instant.
But it is not true, so check_for_updates message is received after sync_local call in tests.
I.e. the following scenario fails:

  1. Test node asks mim2() to insert a domain.
  2. mim2() sends check_for_updates to mim() node.
  3. Test node calls sync_local on mim() node.
  4. mim() receives sync_local and replies to the test node.
  5. mim() finally receives check_for_updates from mim2(), send in step 2.

This PR addresses https://circleci-mim-results.s3.eu-central-1.amazonaws.com/PR/4066/183250/pgsql_cets.25.3-amd64/big/ct_run.test%4048ff31232359.2023-08-01_06.11.42/big_tests.tests.service_domain_db_SUITE.logs/run.2023-08-01_06.28.17/service_domain_db_suite.db_keeps_syncing_after_cluster_join.html

test_server:ts_tc failed on line 1782
Reason: {test_case_failed,{[<<"example1.com">>,<<"example2.com">>,<<"example3.com">>],
                   [<<"example1.com">>,<<"example2.com">>,<<"example3.com">>,
                    <<"example4.com">>]}}

If we add some debug logging we can see that ping sometimes arrives earlier than check_for_updates:

when=2023-08-01T11:31:32.523697+00:00 level=error 
what=ping pid=<0.42026.0> at=service_domain_db:handle_call/3:107 
(Ping from sync_local())

when=2023-08-01T11:31:32.523827+00:00 level=error what=check_for_updates pid=<0.42026.0> at=service_domain_db:handle_check_for_updates/3:143 
doms="[{<<\"example1.com\">>,<<\"dummy auth\">>}]" info_from_node=mongooseim2@localhost info_msg_host_type="dummy auth" info_msg_domain=example2.com 
(check_for_updates from insert_domain call)

Proposed changes include:

  • Actually check that check_for_updates is received by the local server, even if it is send by a remote server.

It is easy to reproduce the failure with repeat_until_any_fail for this testcase (and after the fix I cannot trigger the error anymore).

How the flaky test looks like:

db_keeps_syncing_after_cluster_join(Config) ->
    HostType = dummy_auth_host_type(),
    %% GIVING mim1 and mim2 are not clustered.
    %% Ask mim1 to join mim2's cluster
    %% (and mongooseim application gets restarted on mim1)
    leave_cluster(Config),
    service_enabled(mim()),
    {ok, _} = insert_domain(mim(), <<"example1.com">>, HostType),
    {ok, _} = insert_domain(mim2(), <<"example2.com">>, HostType),
    sync(),
    %% Nodes don't have to be clustered to sync the domains.
    %% XXX Could fail here (flaky!)
    assert_domains_are_equal(HostType),
    %% WHEN Adding mim1 joins into a cluster
    %% (and mongooseim application gets restarted on mim1)
    join_cluster(Config),
    service_enabled(mim()),
    %% THEN Sync is successful
    {ok, _} = insert_domain(mim(), <<"example3.com">>, HostType),
    {ok, _} = insert_domain(mim2(), <<"example4.com">>, HostType),
    sync(),
    %% XXX Could fail here too (flaky!)
    assert_domains_are_equal(HostType).

So, there are two places assert_domains_are_equal could fail before the fix.

It does not change any server logic, it is only fixes the test (because the correct state is achieved eventually, the test is just checking the condition too fast) .

…r_cluster_join

sync() assumes that sending message from mim() to mim2() node is instant

But it is not true, so check_for_updates message is received after sync_local call
in tests.

I.e. the following scenario fails:

1. Test node asks mim2() to insert a domain.
2. mim2() sends check_for_updates to mim() node.
3. Test node calls sync_local on mim() node.
4. mim() calls sync_local and replies to the test node.
5. mim() finally receives check_for_updates from mim2(), send in step 2.
@mongoose-im
Copy link
Collaborator

mongoose-im commented Aug 1, 2023

small_tests_24 / small_tests / 139130e
Reports root / small


elasticsearch_and_cassandra_25 / elasticsearch_and_cassandra_mnesia / 139130e
Reports root/ big
OK: 369 / Failed: 0 / User-skipped: 38 / Auto-skipped: 0


small_tests_25 / small_tests / 139130e
Reports root / small


small_tests_25_arm64 / small_tests / 139130e
Reports root / small


ldap_mnesia_24 / ldap_mnesia / 139130e
Reports root/ big
OK: 2258 / Failed: 0 / User-skipped: 827 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_24 / pgsql_mnesia / 139130e
Reports root/ big
OK: 4221 / Failed: 0 / User-skipped: 82 / Auto-skipped: 0


ldap_mnesia_25 / ldap_mnesia / 139130e
Reports root/ big
OK: 2258 / Failed: 0 / User-skipped: 827 / Auto-skipped: 0


dynamic_domains_mysql_redis_25 / mysql_redis / 139130e
Reports root/ big
OK: 4195 / Failed: 0 / User-skipped: 108 / Auto-skipped: 0


dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / 139130e
Reports root/ big
OK: 4221 / Failed: 0 / User-skipped: 82 / Auto-skipped: 0


dynamic_domains_mssql_mnesia_25 / odbc_mssql_mnesia / 139130e
Reports root/ big
OK: 4223 / Failed: 4 / User-skipped: 85 / Auto-skipped: 0

disco_and_caps_SUITE:disco_with_caps:user_can_query_friend_resources
{error,{{assertion_failed,assert_many,false,[is_roster_set],[],[]},
    [{escalus_new_assert,assert_true,2,
               [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_new_assert.erl"},
                {line,84}]},
     {escalus_story,'-make_all_clients_friends/1-fun-0-',2,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,108}]},
     {escalus_utils,'-each_with_index/3-fun-0-',3,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,87}]},
     {lists,foldl_1,3,[{file,"lists.erl"},{line,1355}]},
     {escalus_utils,'-each_with_index/3-fun-0-',3,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,87}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1350}]},
     {escalus_utils,distinct_pairs,2,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,60}]},
     {escalus_story,make_all_clients_friends,1,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,106}]}]}}

Report log

disco_and_caps_SUITE:disco_with_caps:user_can_query_friend_features
{error,{{assertion_failed,assert_many,false,[is_roster_set],[],[]},
    [{escalus_new_assert,assert_true,2,
               [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_new_assert.erl"},
                {line,84}]},
     {escalus_story,'-make_all_clients_friends/1-fun-0-',2,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,108}]},
     {escalus_utils,'-each_with_index/3-fun-0-',3,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,87}]},
     {lists,foldl_1,3,[{file,"lists.erl"},{line,1355}]},
     {escalus_utils,'-each_with_index/3-fun-0-',3,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,87}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1350}]},
     {escalus_utils,distinct_pairs,2,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,60}]},
     {escalus_story,make_all_clients_friends,1,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,106}]}]}}

Report log

disco_and_caps_SUITE:disco_with_caps:user_cannot_query_stranger_features
{error,
  {timeout_when_waiting_for_stanza,
    [{escalus_client,wait_for_stanza,
       [{client,
          <<"alice_user_cannot_query_stranger_features_619@domain.example.com/res1">>,
          escalus_tcp,<0.12918.0>,
          [{event_manager,<0.12882.0>},
           {server,<<"domain.example.com">>},
           {username,
             <<"alicE_user_cannot_query_stranger_features_619">>},
           {resource,<<"res1">>}],
          [{event_client,
             [{event_manager,<0.12882.0>},
            {server,<<"domain.example.com">>},
            {username,
              <<"alicE_user_cannot_query_stranger_features_619">>},
            {resource,<<"res1">>}]},
           {resource,<<"res1">>},
           {username,
             <<"alice_user_cannot_query_stranger_features_619">>},
           {server,<<"domain.example.com">>},
           {host,<<"localhost">>},
           {port,5222},
           {auth,{escalus_auth,auth_plain}},
           {wspath,undefined},
           {username,
             <<"alicE_user_cannot_query_stranger_features_619">>},
           {server,<<"domain.example.com">>},
           {host,<<"localhost">>},
           {password,<<"matygrysa">>},
           {stream_id,<<"3bbf4c7dd1186876">>}]},
        5000],
       [{file,
          "/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_client.erl"},
        {line,136}]},
     {disco_and_caps_SUITE,
       '-user_cannot_query_stranger_features/1-fun-0-',2,
       [{file,
          "/h...

Report log

disco_and_caps_SUITE:disco_with_caps:user_cannot_query_friend_resources_with_unknown_node
{error,{{assertion_failed,assert_many,false,[is_roster_set],[],[]},
    [{escalus_new_assert,assert_true,2,
               [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_new_assert.erl"},
                {line,84}]},
     {escalus_story,'-make_all_clients_friends/1-fun-0-',2,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,108}]},
     {escalus_utils,'-each_with_index/3-fun-0-',3,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,87}]},
     {lists,foldl_1,3,[{file,"lists.erl"},{line,1355}]},
     {escalus_utils,'-each_with_index/3-fun-0-',3,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,87}]},
     {lists,foldl,3,[{file,"lists.erl"},{line,1350}]},
     {escalus_utils,distinct_pairs,2,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_utils.erl"},
             {line,60}]},
     {escalus_story,make_all_clients_friends,1,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,106}]}]}}

Report log


internal_mnesia_25 / internal_mnesia / 139130e
Reports root/ big
OK: 2403 / Failed: 1 / User-skipped: 681 / Auto-skipped: 0

pubsub_SUITE:dag+basic:publish_with_max_items_test
{error,{{badmatch,false},
    [{pubsub_tools,check_response,2,
             [{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
            {line,444}]},
     {pubsub_tools,receive_response,3,
             [{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
            {line,434}]},
     {pubsub_tools,receive_and_check_response,4,
             [{file,"/home/circleci/project/big_tests/tests/pubsub_tools.erl"},
            {line,424}]},
     {escalus_story,story,4,
            [{file,"/home/circleci/project/big_tests/_build/default/lib/escalus/src/escalus_story.erl"},
             {line,72}]},
     {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1782}]},
     {test_server,run_test_case_eval1,6,
            [{file,"test_server.erl"},{line,1291}]},
     {test_server,run_test_case_eval,9,
            [{file,"test_server.erl"},{line,1223}]}]}}

Report log


pgsql_mnesia_24 / pgsql_mnesia / 139130e
Reports root/ big
OK: 4604 / Failed: 0 / User-skipped: 89 / Auto-skipped: 0


pgsql_mnesia_25 / pgsql_mnesia / 139130e
Reports root/ big
OK: 4604 / Failed: 0 / User-skipped: 89 / Auto-skipped: 0


mysql_redis_25 / mysql_redis / 139130e
Reports root/ big
OK: 4590 / Failed: 0 / User-skipped: 103 / Auto-skipped: 0


mssql_mnesia_25 / odbc_mssql_mnesia / 139130e
Reports root/ big
OK: 4601 / Failed: 0 / User-skipped: 92 / Auto-skipped: 0


internal_mnesia_25 / internal_mnesia / 139130e
Reports root/ big
OK: 2404 / Failed: 0 / User-skipped: 681 / Auto-skipped: 0

@codecov
Copy link

codecov bot commented Aug 1, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (6cef853) 83.88% compared to head (139130e) 83.89%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4072   +/-   ##
=======================================
  Coverage   83.88%   83.89%           
=======================================
  Files         526      526           
  Lines       33181    33187    +6     
=======================================
+ Hits        27834    27842    +8     
+ Misses       5347     5345    -2     
Files Changed Coverage Δ
src/domain/service_domain_db.erl 87.03% <100.00%> (+3.70%) ⬆️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

true = is_pid(LocalPid),
Nodes = [node(Pid) || Pid <- all_members()],
%% Ping from all nodes in the cluster
[pong = rpc:call(Node, ?MODULE, ping, [LocalPid]) || Node <- Nodes],
Copy link
Collaborator

@NelsonVides NelsonVides Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not erpc:multicall/4 instead maybe? Just a suggestion, otherwise I can merge as is.

Copy link
Contributor Author

@arcusfelis arcusfelis Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to keep it simple.
It is basically 2 nodes (mim and mim2) in the tests anyway.
And this function is used only in tests :D

And one of the nodes is local :)

@NelsonVides
Copy link
Collaborator

Btw, super good PR description, love it 😄

@NelsonVides NelsonVides merged commit 017a1b5 into master Aug 3, 2023
@NelsonVides NelsonVides deleted the fix-service_domain_db-tests branch August 3, 2023 10:38
@chrzaszcz chrzaszcz added this to the 6.2.0 milestone Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants