Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce flakiness in test fts_segment_reset #518

Merged
merged 1 commit into from
Jul 12, 2024

Conversation

jiaqizho
Copy link
Contributor

@jiaqizho jiaqizho commented Jul 12, 2024

fix #ISSUE_Number


Change logs

origin commit message

Have seen some flakiness in test fts_segment_reset because sometimes FTS would still promote mirror if the primary takes a bit longer to restart after getting out of RESET stage. An example like below:

- Primary 0 gets out of RESET and was going to be restarted: 2022-05-23 15:32:53.924540 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4284,
- And it takes primary 0 about 2-3 seconds to do so: 2022-05-23 15:32:56.184117 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","database system is ready to accept connections”

- Unfortunately before primary 0 could restart, FTS makes one last probe and finds that it is in recovery mode, and not making progress (which is "correct" because primary 0 has finished recovery): 2022-05-23 15:32:56.009206 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: detected segment is in recovery mode and not making progress (content=0) primary dbid=2, mirror dbid=5",,,,,,,0,,"ftsprobe.c",254, 2022-05-23 15:32:56.065399 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS max (5) retries exhausted (content=0, dbid=2) state=9",,,,,,,0,,"ftsprobe.c”,788

Currently, we let primary stay in the RESET stage for 27 seconds. The FTS has a default of 5-second retry cycle, at the end of which it makes promote decision. That leaves about 3 seconds for the primary to start after getting out of RESET, which is probably too short. Now make the retry cycle 15 seconds and let the RESET delay to be 17 seconds. That leave about 13 seconds for the primary to start after that, which should be well enough to reduce common flakiness.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

  • Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
  • Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
  • Learn the coding contribution guide, including our code conventions, workflow and more.
  • List your communication in the GitHub Issues or Discussions (if has or needed).
  • Document changes.
  • Add tests for the change
  • Pass make installcheck
  • Pass make -C src/test installcheck-cbdb-parallel
  • Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Have seen some flakiness in test fts_segment_reset because sometimes
FTS would still promote mirror if the primary takes a bit longer to
restart after getting out of RESET stage. An example like below:

- Primary 0 gets out of RESET and was going to be restarted:
2022-05-23 15:32:53.924540 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4284,
- And it takes primary 0 about 2-3 seconds to do so:
2022-05-23 15:32:56.184117 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","database system is ready to accept connections”

- Unfortunately before primary 0 could restart, FTS makes one last probe
and finds that it is in recovery mode, and not making progress (which is
"correct" because primary 0 has finished recovery):
2022-05-23 15:32:56.009206 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: detected segment is in recovery mode and not making progress (content=0) primary dbid=2, mirror dbid=5",,,,,,,0,,"ftsprobe.c",254,
2022-05-23 15:32:56.065399 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS max (5) retries exhausted (content=0, dbid=2) state=9",,,,,,,0,,"ftsprobe.c”,788

Currently, we let primary stay in the RESET stage for 27 seconds.
The FTS has a default of 5-second retry cycle, at the end of which
it makes promote decision. That leaves about 3 seconds for the primary
to start after getting out of RESET, which is probably too short.
Now make the retry cycle 15 seconds and let the RESET delay to be 17
seconds. That leave about 13 seconds for the primary to start after that,
which should be well enough to reduce common flakiness.
@jiaqizho jiaqizho force-pushed the fix-flakiness-fts_segment_reset branch from 2fbb9ac to 9c18559 Compare July 12, 2024 12:25
@my-ship-it my-ship-it merged commit cd359e4 into apache:main Jul 12, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants