Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact: Offline deduplication #1275

Closed
wants to merge 1 commit into from

Conversation

smalldirector
Copy link

Changes

As the discussion in #1014, implement a new feature in compactor which provides offline deduplication function for the data from different replicas.

The offline deduplication follow the same design with query deduplication. The user needs to specify replica label by config dedup.replica-label before enable it, and the function uses current query deduplication algorithm(penalty algorithm) to merge data points come from different replicas.

The offline deduplication function is based on bucket level in remote storage, so the user needs to ensure that all replica data write to same bucket.

Verification

  • Compared data quality by tsdb read API.
  • Compared the graph by Thanos query UI.

Below figures are the comparison for one sample metrics before dedup and after dedup. It defines one interval replica label _agg_replica_ to represent the merged data.

Before Dedup

After Dedup

  • Tested the dedup function online, and monitoring its metrics(especially memory usage).

For below figure, each of block size is around 3GB and we have two replicas for each block. As it is using streaming read/write way to operate block, so no OOM exception happens.

Metrics

@bwplotka
Copy link
Member

Why it was closed? Also can we ensure the PR is split into smaller bits if possible? It will ensure quicker review for sure (:

@smalldirector
Copy link
Author

@bwplotka Closed it as I want to have a clean commit history for the review. Actually I opened another new PR: #1276. Can you please check the codes there?

Agree. Smaller PR will make the review easier. However, those codes are for one new feature dedup function, IMO keep everything in one PR will give a better context to know how it works there.
Do you have any suggestion on how to split it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants