Reduce flakiness in test fts_segment_reset #518

jiaqizho · 2024-07-12T11:31:48Z

fix #ISSUE_Number

Change logs

origin commit message

Have seen some flakiness in test fts_segment_reset because sometimes FTS would still promote mirror if the primary takes a bit longer to restart after getting out of RESET stage. An example like below:

- Primary 0 gets out of RESET and was going to be restarted: 2022-05-23 15:32:53.924540 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4284,
- And it takes primary 0 about 2-3 seconds to do so: 2022-05-23 15:32:56.184117 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","database system is ready to accept connections”

- Unfortunately before primary 0 could restart, FTS makes one last probe and finds that it is in recovery mode, and not making progress (which is "correct" because primary 0 has finished recovery): 2022-05-23 15:32:56.009206 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: detected segment is in recovery mode and not making progress (content=0) primary dbid=2, mirror dbid=5",,,,,,,0,,"ftsprobe.c",254, 2022-05-23 15:32:56.065399 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS max (5) retries exhausted (content=0, dbid=2) state=9",,,,,,,0,,"ftsprobe.c”,788

Currently, we let primary stay in the RESET stage for 27 seconds. The FTS has a default of 5-second retry cycle, at the end of which it makes promote decision. That leaves about 3 seconds for the primary to start after getting out of RESET, which is probably too short. Now make the retry cycle 15 seconds and let the RESET delay to be 17 seconds. That leave about 13 seconds for the primary to start after that, which should be well enough to reduce common flakiness.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

CLAassistant · 2024-07-12T11:31:54Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Have seen some flakiness in test fts_segment_reset because sometimes FTS would still promote mirror if the primary takes a bit longer to restart after getting out of RESET stage. An example like below: - Primary 0 gets out of RESET and was going to be restarted: 2022-05-23 15:32:53.924540 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4284, - And it takes primary 0 about 2-3 seconds to do so: 2022-05-23 15:32:56.184117 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","database system is ready to accept connections” - Unfortunately before primary 0 could restart, FTS makes one last probe and finds that it is in recovery mode, and not making progress (which is "correct" because primary 0 has finished recovery): 2022-05-23 15:32:56.009206 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: detected segment is in recovery mode and not making progress (content=0) primary dbid=2, mirror dbid=5",,,,,,,0,,"ftsprobe.c",254, 2022-05-23 15:32:56.065399 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS max (5) retries exhausted (content=0, dbid=2) state=9",,,,,,,0,,"ftsprobe.c”,788 Currently, we let primary stay in the RESET stage for 27 seconds. The FTS has a default of 5-second retry cycle, at the end of which it makes promote decision. That leaves about 3 seconds for the primary to start after getting out of RESET, which is probably too short. Now make the retry cycle 15 seconds and let the RESET delay to be 17 seconds. That leave about 13 seconds for the primary to start after that, which should be well enough to reduce common flakiness.

avamingli approved these changes Jul 12, 2024

View reviewed changes

jiaqizho force-pushed the fix-flakiness-fts_segment_reset branch from 2fbb9ac to 9c18559 Compare July 12, 2024 12:25

my-ship-it approved these changes Jul 12, 2024

View reviewed changes

my-ship-it merged commit cd359e4 into apache:main Jul 12, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce flakiness in test fts_segment_reset #518

Reduce flakiness in test fts_segment_reset #518

jiaqizho commented Jul 12, 2024 •

edited

Loading

CLAassistant commented Jul 12, 2024

Reduce flakiness in test fts_segment_reset #518

Reduce flakiness in test fts_segment_reset #518

Conversation

jiaqizho commented Jul 12, 2024 • edited Loading

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

CLAassistant commented Jul 12, 2024

jiaqizho commented Jul 12, 2024 •

edited

Loading