-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rls: fix a data race involving the LRU cache #5925
Conversation
balancer/rls/balancer.go
Outdated
// cacheMu guards access to the data cache and pending requests map. | ||
cacheMu sync.RWMutex | ||
// cacheMu guards access to the data cache and pending requests map. We | ||
// cannot use an RWMutex here since even a operation like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*an operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
balancer/rls/balancer.go
Outdated
resizeCache := false | ||
if newCfg.cacheSizeBytes != b.lbCfg.cacheSizeBytes { | ||
resizeCache = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Resize the cache if the size in the config has changed.
resizeCache := newCfg.cacheSizeBytes != b.lbCfg.cacheSizeBytes
(Comment optional, but explains what could be seen as "more confusing logic", and still uses fewer lines.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
balancer/rls/balancer.go
Outdated
@@ -284,6 +286,14 @@ func (b *rlsBalancer) UpdateClientConnState(ccs balancer.ClientConnState) error | |||
b.updateCh.Put(resumePickerUpdates{done: done}) | |||
b.stateMu.Unlock() | |||
<-done | |||
|
|||
if resizeCache { | |||
// If the new config changes the size of the data cache, we might have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If the new config reduces the size"...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -284,6 +286,14 @@ func (b *rlsBalancer) UpdateClientConnState(ccs balancer.ClientConnState) error | |||
b.updateCh.Put(resumePickerUpdates{done: done}) | |||
b.stateMu.Unlock() | |||
<-done | |||
|
|||
if resizeCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this operation above, where we compute it, and remove the bool entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cacheMu
needs to be grabbed before stateMu
, which is why we can't do this operation up there. Added a comment for the same.
// to write to the cache (if we have to send out an RLS request), we will | ||
// release the read-lock and acquire a write-lock. | ||
p.lb.cacheMu.RLock() | ||
p.lb.cacheMu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reduce the amount of time we're holding this at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following actions are happening under the lock in this method:
- read the cache entry:
- this is non-blocking and should be fast.
- send an RLS request if required:
- this is the only thing which could block. But the control channel implementation spawns a goroutine to do the actual work of sending the RLS rpc. So, this should also be fast.
- queue the pick or delegate to child policies (or default policy):
- this should also be non-blocking and should be fast.
So, I think we are OK to hold the lock for the whole duration of this method. If we are blocking or taking too long in this method, we have other problems as well I guess (since lb policies are expected to be quick in Pick()
and non block in there).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, I was just wondering because the defer
makes it impossible to give it up early on one path vs. another, and ideally we hold this for as short a time as possible.
Thanks for the quick review !! |
Summary of changes:
sync.RWMutex
. In the RPC path, the cache lookup was done under a read-lock, while most of the other operations were done under a write-lock. This was under the assumption that a cache lookup was a read operation. But as it turns out, the cache lookup modifies the LRU cache because it makes the matched entry the most-recently-used.sync.RWMutex
and moving to a regularsync.Mutex
.iterateAndRun
method on thedataCache
type.onEvict
field in thecacheEntry
type. This was not being used.This takes care of a P1 internal issue.
RELEASE NOTES: