-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lease: Add a heap to optimize lease expiration checks #9418
lease: Add a heap to optimize lease expiration checks #9418
Conversation
can you write a benchmark to compare the old/new approach? |
Absolutely. Do you have an example benchmark you would like us to emulate? |
We're working on a reproducible, synthetic benchmark, but here is a chart of 3.2.11, 3.3.1 and 3.3.1 with this change under our production traffic to give you an idea of the problem we are trying to fix: This is our application observed latency for creating a lease and making a transaction (plus some other stuff responsible for the baseline 100ms. These match our other metrics, where average latency is normal, but p99 and higher get very, very large. We should have a reproducible benchmark available soon. Follow up - that spikiness in the last section was from bad compaction settings - it's almost flat with correct settings. |
@mgates Do you mean the tail latency with heap is much higher than the one with map? |
Ideally, benchmark (in Go, |
Great - we'll get those benchmarks set up. |
I think I'm misunderstanding - the tail latency with the heap is much much lower (and doesn't seem to grow unbounded with lease map length). |
Then, sounds good! Can you provide reproducible workloads that can be cross-checked in my side? |
Just to let you all know, we got side-tracked, but we're still planning on getting some benchmarks done and ready. |
lease/lessor.go
Outdated
@@ -151,14 +152,52 @@ type lessor struct { | |||
doneC chan struct{} | |||
} | |||
|
|||
type LeaseWithTime struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to a new file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea.
@mgates This is missing some methods (e.g. revoke). Let me cherry-pick your commit with more extensive benchmarks, tomorrow. |
So, we don't actually think we need to revoke or update - the idea is that the heap is just candidates - we always check the map for the actual lease. If a lease is revoked, when we get to it in the heap, we find it's not in the map, and we ignore it. If a lease is renewed, an additional marker gets put in the heap, and we ignore the first one, because the lease isn't expired in the map yet. |
Then how are we going to handle false positives in server-side lease revoke routine? If a bunch of manual revoke happened, then expire channel could return leases with stale secondary index, and server errors. Was there any performance bottleneck updating heap on revoke? |
I don't think that we'd get errors - if we get a stale lease ID, we just ignore it. My understanding is that it would be slow to iterate through the leap finding them by lease ID, but if you think it won't be we can try it and benchmark it. |
Oh I see now that your code ignores it. I will take another look. |
@gyuho We've added a set of benchmark tests around lease operations. From the
and with our patch:
We believe this accurately reflects the improvement in performance we've seen with the find expired leases function. |
@mgates Thanks for updates!
I would like to cross-check on this before we merging in. What kind of workloads did you ingest to verify this improvements? I am busy till next week. So will try to merge this by following week latest. |
@gyuho - We cloned our traffic in a production environment to a separate cluster, for about 500 GRPC operations per second (tx, get, put, lease grant etc) with a max of about 250 leases expiring per second and a total number of keys averaging around 5 million. This same workload rendered our unpatched cluster unusable within 30 minutes of operation. |
The benchmark result makes sense to me. |
lease/lessor.go
Outdated
if l == nil { | ||
// lease has expired or been revoked, continue | ||
continue | ||
} else if time.Now().UnixNano() < item.expiration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need the else statement here. just if time.Now ... is enough.
@jcalvert can you please clean this PR up? Then we can get it merged. |
lease/lessor.go
Outdated
func NewLessor(b backend.Backend, minLeaseTTL int64) Lessor { | ||
return newLessor(b, minLeaseTTL) | ||
} | ||
|
||
func newLessor(b backend.Backend, minLeaseTTL int64) *lessor { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
lease/lessor.go
Outdated
// TODO: probably should change to <= 100-500 millisecond to | ||
// make up committing latency. | ||
for { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
LGTM after addressing @xiang90's comments. We can merge first. I will do the benchmarks sometime later. Thanks! |
@mgates @cosgroveb Could you squash commits and add license header? I have example branch https://github.com/gyuho/etcd/commits/tmp. |
97b1c02
to
cb5e822
Compare
@gyuho Added the license header to both new files and squashed our commits. |
@cosgroveb One last nit, could you amend commit titles to Thanks a lot! |
This adds a heap acting as a priority queue to keep track of lease exiprations. Previously the whole lease map had to be iterated through each time. The queue allows us to check only those leases which might be expired. When the expiration changes, we add an additional entry. If we check an entry that isn't expired, it means that the lease got extended. If we find a entry in the heap that doesn't have a corresponding entry in the map, we know that the lease has already been expired or revoked.
cb5e822
to
3f85ae7
Compare
@mgates I just rebased from master to trigger CIs. Will add release notes as well. Thanks. |
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
Test failures aren't related. I also manually ran full test suites and no failures. Merging. |
…xpirations lease: Add a heap to optimize lease expiration checks
This TODO was addressed in etcd-io#9418.
This adds a heap acting as a priority queue to keep track of lease
exiprations. Previously the whole lease map had to be iterated through
each time.
The queue allows us to check only those leases which might be expired.
When the expiration changes, we add an additional entry. If we check an
entry that isn't expired, it means that the lease got extended.
If we find a entry in the heap that doesn't have a corresponding entry in
the map, we know that the lease has already been expired or revoked.
This is our first stab at writing real Go, and we're happy to make any changes if you all think this is a good way of doing things.
/cc @jcalvert