Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
As the discussion in #1014, implement a new feature in compactor which provides offline deduplication function for the data from different replicas.
The offline deduplication follow the same design with query deduplication. The user needs to specify replica label by config
dedup.replica-label
before enable it, and the function uses current query deduplication algorithm(penalty algorithm
) to merge data points come from different replicas.The offline deduplication function is based on bucket level in remote storage, so the user needs to ensure that all replica data write to same bucket.
Verification
Below figures are the comparison for one sample metrics before dedup and after dedup. It defines one interval replica label
_agg_replica_
to represent the merged data.For below figure, each of block size is around 3GB and we have two replicas for each block. As it is using streaming read/write way to operate block, so no OOM exception happens.