-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Halt compaction if job is lost #4420
Halt compaction if job is lost #4420
Conversation
After some thought I'm leaning towards the cancel context approach described above. This is not a "real" fix to the problem but will just prevent compactors from duplicating data most of the time. Because of this we should opt for a less invasive change while we consider a more holistic improvement on compaction. |
Signed-off-by: Joe Elliott <number101010@gmail.com>
This reverts commit d29f5c0.
Signed-off-by: Joe Elliott <number101010@gmail.com>
0b3d145
to
5bf18f0
Compare
Signed-off-by: Joe Elliott <number101010@gmail.com>
Signed-off-by: Joe Elliott <number101010@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat. Just a couple comments.
// every second test if we still own the job. if we don't then cancel the context with a cause | ||
// that we can then test for | ||
go func() { | ||
ticker := time.NewTicker(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now it appears this 1 second is in conflict with the compactionCycle
. If we have a compactionCycle
less than 1 second, it means we could continue compaction for one extra than desired. Is this a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't block the return of the function so it won't stop the compaction. if a compaction takes < 1s then compactWhileOwns
will still return successfully and some short amount of time later this goroutine will exit.
Hello, I just hit a code path introduced in this patch that made tempo spin a CPU core. This is because the for loop on this line has no exit condition. This means when tempo is exiting (but slowly for some reason), the
and then this line also returns instantly inside of |
@Wessie Great find. Thank you 🙏. Would you like to submit a PR to fix? If not I will patch it up today, no worries. |
Feel free to patch. |
What this PR does:
In a previous PR we added a log message when compactors finished a job and recognized it no longer owned the job. We have since seen this message pop up sporadically in production suggesting that unstable compactors will rarely increase the amount of data in object storage.
This PR extends on that work by adding code to abandon compaction in the event that ownership of the job changes. This is not foolproof but should make this an extremely rare occurrence.
I also considered an approach where we created a child context indoCompaction
and cancel it with cause in the event that job ownership changes. We could use the cause to confirm that a "context cancelled" that bubbles up from.compact()
was in fact due to ownership change. The reason I don't like it is b/c we would have to create a side goroutine that polls.Owns()
watching for ownership change which felt messier. I am definitely game to go this route if others prefer it.I have since pivoted to the approach described in the strikethrough for the reason mentioned below. The PR now contains the following changes:
time.Sleep
to accomplish the same intent.doCompaction
andcompact
for clarity and added acompactWhileOwns
method that does the job of checking job ownership during compactionI have also left the first attempt as the first commit in this branch and then reverted it if a reviewer would like to see both options.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]