Halt compaction if job is lost #4420

joe-elliott · 2024-12-05T19:20:23Z

What this PR does:
In a previous PR we added a log message when compactors finished a job and recognized it no longer owned the job. We have since seen this message pop up sporadically in production suggesting that unstable compactors will rarely increase the amount of data in object storage.

This PR extends on that work by adding code to abandon compaction in the event that ownership of the job changes. This is not foolproof but should make this an extremely rare occurrence.

I also considered an approach where we created a child context in doCompaction and cancel it with cause in the event that job ownership changes. We could use the cause to confirm that a "context cancelled" that bubbles up from .compact() was in fact due to ownership change. The reason I don't like it is b/c we would have to create a side goroutine that polls .Owns() watching for ownership change which felt messier. I am definitely game to go this route if others prefer it.

I have since pivoted to the approach described in the strikethrough for the reason mentioned below. The PR now contains the following changes:

Cleaned up the compaction loop to use a simple time.Sleep to accomplish the same intent.
Made the start and end compaction cycle log message "Info" level like other status updates for consistency
Removed <-ctx.Done() and just checked ctx.Err() for simplicity in doCompaction.
Renamed doCompaction and compact for clarity and added a compactWhileOwns method that does the job of checking job ownership during compaction

I have also left the first attempt as the first commit in this branch and then reverted it if a reviewer would like to see both options.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

joe-elliott · 2024-12-09T14:01:37Z

After some thought I'm leaning towards the cancel context approach described above. This is not a "real" fix to the problem but will just prevent compactors from duplicating data most of the time. Because of this we should opt for a less invasive change while we consider a more holistic improvement on compaction.

Signed-off-by: Joe Elliott <number101010@gmail.com>

This reverts commit d29f5c0.

Signed-off-by: Joe Elliott <number101010@gmail.com>

tempodb/compactor.go

Signed-off-by: Joe Elliott <number101010@gmail.com>

zalegrala

This is neat. Just a couple comments.

tempodb/compactor.go

zalegrala · 2024-12-10T22:11:17Z

tempodb/compactor.go

+	// every second test if we still own the job. if we don't then cancel the context with a cause
+	// that we can then test for
+	go func() {
+		ticker := time.NewTicker(1 * time.Second)


I see now it appears this 1 second is in conflict with the compactionCycle. If we have a compactionCycle less than 1 second, it means we could continue compaction for one extra than desired. Is this a concern?

this doesn't block the return of the function so it won't stop the compaction. if a compaction takes < 1s then compactWhileOwns will still return successfully and some short amount of time later this goroutine will exit.

Signed-off-by: Joe Elliott <number101010@gmail.com>

Wessie · 2024-12-19T01:56:06Z

Hello, I just hit a code path introduced in this patch that made tempo spin a CPU core.

This is because the for loop on this line has no exit condition. This means when tempo is exiting (but slowly for some reason), the ctx is canceled and this makes f() instantly return with this kind of log message:

caller=compactor.go:143 msg="caught context cancelled at the top of the compaction loop. bailing." err="context canceled" cause="context canceled"

and then this line also returns instantly inside of doForAtLeast because ctx is canceled. Causing the problem.

joe-elliott · 2024-12-19T11:50:16Z

@Wessie Great find. Thank you 🙏. Would you like to submit a PR to fix?

If not I will patch it up today, no worries.

Wessie · 2024-12-19T12:33:03Z

Feel free to patch.

joe-elliott requested review from mdisibio, mapno, yvrhdn, zalegrala, electron0zero, ie-pham, stoewer and javiermolinar as code owners December 5, 2024 19:20

joe-elliott added 3 commits December 9, 2024 15:53

first attempt - abandoned

d29f5c0

Signed-off-by: Joe Elliott <number101010@gmail.com>

Revert "first attempt - abandoned"

d0b6f9c

This reverts commit d29f5c0.

context cancel approach

5bf18f0

Signed-off-by: Joe Elliott <number101010@gmail.com>

joe-elliott force-pushed the stop-compacting-on-job-loss branch from 0b3d145 to 5bf18f0 Compare December 10, 2024 17:46

joe-elliott added 3 commits December 10, 2024 15:04

bail out of doForAtLeast if contxt is cancelled

e3479bc

Signed-off-by: Joe Elliott <number101010@gmail.com>

old man shakes his fist at the linter

31cadb4

Signed-off-by: Joe Elliott <number101010@gmail.com>

changelog

c35cc58

Signed-off-by: Joe Elliott <number101010@gmail.com>