Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Halt compaction if job is lost #4420

Merged
merged 8 commits into from
Dec 11, 2024

Conversation

joe-elliott
Copy link
Member

@joe-elliott joe-elliott commented Dec 5, 2024

What this PR does:
In a previous PR we added a log message when compactors finished a job and recognized it no longer owned the job. We have since seen this message pop up sporadically in production suggesting that unstable compactors will rarely increase the amount of data in object storage.

This PR extends on that work by adding code to abandon compaction in the event that ownership of the job changes. This is not foolproof but should make this an extremely rare occurrence.

I also considered an approach where we created a child context in doCompaction and cancel it with cause in the event that job ownership changes. We could use the cause to confirm that a "context cancelled" that bubbles up from .compact() was in fact due to ownership change. The reason I don't like it is b/c we would have to create a side goroutine that polls .Owns() watching for ownership change which felt messier. I am definitely game to go this route if others prefer it.

I have since pivoted to the approach described in the strikethrough for the reason mentioned below. The PR now contains the following changes:

  • Cleaned up the compaction loop to use a simple time.Sleep to accomplish the same intent.
  • Made the start and end compaction cycle log message "Info" level like other status updates for consistency
  • Removed <-ctx.Done() and just checked ctx.Err() for simplicity in doCompaction.
  • Renamed doCompaction and compact for clarity and added a compactWhileOwns method that does the job of checking job ownership during compaction

I have also left the first attempt as the first commit in this branch and then reverted it if a reviewer would like to see both options.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@joe-elliott
Copy link
Member Author

After some thought I'm leaning towards the cancel context approach described above. This is not a "real" fix to the problem but will just prevent compactors from duplicating data most of the time. Because of this we should opt for a less invasive change while we consider a more holistic improvement on compaction.

Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
@joe-elliott joe-elliott force-pushed the stop-compacting-on-job-loss branch from 0b3d145 to 5bf18f0 Compare December 10, 2024 17:46
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
Copy link
Contributor

@zalegrala zalegrala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat. Just a couple comments.

tempodb/compactor.go Show resolved Hide resolved
tempodb/compactor.go Show resolved Hide resolved
// every second test if we still own the job. if we don't then cancel the context with a cause
// that we can then test for
go func() {
ticker := time.NewTicker(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now it appears this 1 second is in conflict with the compactionCycle. If we have a compactionCycle less than 1 second, it means we could continue compaction for one extra than desired. Is this a concern?

Copy link
Member Author

@joe-elliott joe-elliott Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't block the return of the function so it won't stop the compaction. if a compaction takes < 1s then compactWhileOwns will still return successfully and some short amount of time later this goroutine will exit.

Signed-off-by: Joe Elliott <number101010@gmail.com>
@joe-elliott joe-elliott merged commit c75e9f8 into grafana:main Dec 11, 2024
16 checks passed
@Wessie
Copy link

Wessie commented Dec 19, 2024

Hello, I just hit a code path introduced in this patch that made tempo spin a CPU core.

This is because the for loop on this line has no exit condition. This means when tempo is exiting (but slowly for some reason), the ctx is canceled and this makes f() instantly return with this kind of log message:

caller=compactor.go:143 msg="caught context cancelled at the top of the compaction loop. bailing." err="context canceled" cause="context canceled"

and then this line also returns instantly inside of doForAtLeast because ctx is canceled. Causing the problem.

@joe-elliott
Copy link
Member Author

@Wessie Great find. Thank you 🙏. Would you like to submit a PR to fix?

If not I will patch it up today, no worries.

@Wessie
Copy link

Wessie commented Dec 19, 2024

Feel free to patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants