Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix clickhouse-copier cleaning-tainting contention between concurrent workers #7816

Conversation

dingxiangfei2009
Copy link
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Bug Fix

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):

  • Fix clickhouse-copier contention when target partition is dirty and workers run into race to clean it

...

Detailed description:
When multiple ClusterCopiers discover that the target partition is not empty, they will attempt to clean up this partition before proceeding to copying.
However, consider the following steps done by three workers in an interleaving fashion, which leads to no progress being made by any of the three workers.

Suppose there are three workers: A, B, C.

  1. Worker A discovers a clean partition whose row count is unfortunately non-zero. A sleeps.
  2. Worker C, after carrying out some work, marks the partition dirty. C sleeps.
  3. Worker B wakes up, and optionally aborting the previous work and found the partition dirty. It cleans up the previous work state, drops the partition and remove the is_dirty flag. B sleeps.
  4. Worker A wakes up and counts the number of finished shards. Now this number is zero. A sleeps.
  5. Worker B wakes up and found the partition clean. It starts to copy rows.
  6. Worker C wakes up and found the partition clean, but the row count is unfortunately non-zero. C sleeps.
  7. Worker A wakes up. At its current work cycle, it founds a "clean" partition whose row count is non-zero, yet "no workers were working" at the moment it was checking. Therefore, A marks the partition dirty.
    ...
    From Step 7 onwards, the role of the worker A and C can be swapped and steps from No. 3 can be repeated. Now we have a cleaning-tainting flip-flopping loop between A and C.

This PR will make clickhouse-copier to mitigate this issue with the following modifications.

  1. Instead of purging is_dirty, the history of cleaning work is preserved and partition hygiene is established based on a happens-before relation between the events. This relation is encoded by LogicalClock based on the mzxid of the is_dirty ZNode and is_dirty/cleaned. The fact of the partition hygiene is encoded by CleanStateClock.
  2. Updating the partition hygiene state will then require the worker to hold up-to-date information about this hygiene state and indirectly synchronize with the workers or the cleaner, depending on whether the current work cycle is insertion or cleaning.
  3. Guards are placed to ensure that at most one worker is allowed to compute the row count of a target partition.

...

@dingxiangfei2009
Copy link
Contributor Author

Sorry, I will let the functional tests finish before patching the style problems.

@dingxiangfei2009 dingxiangfei2009 force-pushed the fix-copier-contention-master branch from c66d21d to 6667961 Compare November 19, 2019 05:41
@dingxiangfei2009
Copy link
Contributor Author

Strange to see that multAsync failed, even if zkutil is untouched and up to date with master.

@akuzm
Copy link
Contributor

akuzm commented Nov 19, 2019

Strange to see that multAsync failed, even if zkutil is untouched and up to date with master.

Probably not related to your changes, it's a known flappy test.

@dingxiangfei2009 dingxiangfei2009 force-pushed the fix-copier-contention-master branch from 6667961 to 8330a09 Compare November 22, 2019 01:49
@alexey-milovidov alexey-milovidov merged commit eb7f48a into ClickHouse:master Nov 23, 2019
@dingxiangfei2009 dingxiangfei2009 deleted the fix-copier-contention-master branch December 2, 2019 01:54
@nikitamikhaylov nikitamikhaylov added the pr-bugfix Pull request with bugfix, not backported by default label Dec 2, 2019
nikitamikhaylov pushed a commit that referenced this pull request Dec 2, 2019
…master

Fix clickhouse-copier cleaning-tainting contention between concurrent workers

(cherry picked from commit eb7f48a)
nikitamikhaylov pushed a commit that referenced this pull request Dec 2, 2019
…master

Fix clickhouse-copier cleaning-tainting contention between concurrent workers

(cherry picked from commit eb7f48a)
nikitamikhaylov pushed a commit that referenced this pull request Dec 2, 2019
…master

Fix clickhouse-copier cleaning-tainting contention between concurrent workers

(cherry picked from commit eb7f48a)
vitlibar pushed a commit that referenced this pull request Dec 26, 2019
…master

Fix clickhouse-copier cleaning-tainting contention between concurrent workers

(cherry picked from commit eb7f48a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-bugfix Pull request with bugfix, not backported by default
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants