-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize tidying up dangling secret-id accessors #6252
Conversation
Previous behavior was to do the following: 1. Get a lock on all secret ids 2. For each dangling accessor list all of the secret ids in all of the approles During this process there is a write lock on approles which prevents any authentication to happen. For small periods of time this can go unnoticed. However there are situations where this causes a lot of service disruption (full details at the end) New behavior: 1. Wait for a lock on all of the secret ids to make sure that all secret-ids have been written for the list of accessors we created earlier. 2. Get the full list of secret id hmacs only once and re-use it for all dangling accessors. The previous comment explained that the locking is only present to avoid race conditions between the accessors being written and not the secret-id yet. If there is another reason why a lock is necessary while getting the list of secret ids I can easily update this so that listing all secret ids happens while the locks are in place. My understanding is that just waiting to get the lock and then releasing it is enough to guarantee that we don't hit the accessor exists but not yet secret-id race condition. Why am I making these changes and what happened? We have lots of approles, with lots of secret-ids and lots of accessors. An approle was removed which contained ~8000 dangling accessors. After making this change we suddenly saw that it wasn't possible to authenticate with approle authentication any more. After much troubleshooting we enabled trace logging and saw that it had to do with dangling accessors, but couldn't tell how many there were and how long they would take to clean up. After adding in some extra trace logging I discovered that there was 8000 and because of the amount of other approles and secretids that it was taking around 2 minutes per dangling accessor to clean up. With the previous logic this would have taken at least 12 days to clean up while holding a lock on approle authentication the entire time. Part of the slowness is no doubt caused by the fact that we use GCS as the backend, and I'm sure such a large amount of approles and secret ids is not best practice or expected. However this vault instance is being run as a service for other teams where we don't have full control over enforcing how they use it, and vault also let them create that many approles and secret ids for them.
for _, lock := range b.secretIDLocks { | ||
lock.Lock() | ||
defer lock.Unlock() | ||
lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this in the right place? It seems strange to me that we would grab a lock and then unlock it, plus below there's a comment that we're doing stuff now that we waited for the lock. Usually locks are held for the duration of time when we're doing work that might otherwise lead to a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say for certain if this is the right place. But for the current implementation I believe this does achieve the same goal with way less downtime.
so if we have the lock we know we're not in a wrote-accessor-but-not-yet-secret case, which can be racy.
So this means we want to make sure that nothing is currently still writing the secret-id between these two calls.
First a list of accessors is created. It is possible that an accessor exists but the secretID has not been written yet.
accessorHashes, err := s.List(ctx, accessorIDPrefixToUse) |
Then a list of secretIDs is made:
secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC)) |
In the scenario where the secretID does not exist yet this accessor will not be deleted and will be considered dangling.
By waiting for a lock here:
vault/builtin/credential/approle/path_tidy_user_id.go
Lines 171 to 186 in 6e96d87
for _, lock := range b.secretIDLocks { | |
lock.Lock() | |
lock.Unlock() | |
} | |
// Now that we waited for a lock get a fresh list of all of the secretIDHMACs | |
// so that we can be sure if the accessor is dangling or not | |
allSecretIDHMACs := make([]string, 0) | |
for _, roleNameHMAC := range roleNameHMACs { | |
secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC)) | |
if err != nil { | |
return err | |
} | |
for _, v := range secretIDHMACs { | |
allSecretIDHMACs = append(allSecretIDHMACs, v) | |
} | |
} |
We can be sure that all secretIDs have been written since the list of accessors was created. Which prevents the race condition from occurring.
After stepping through this I think I understand your comment better and realise that all of this locking could occur earlier without any need to do later "cleanups".
Waiting for the lock (and immediately releasing it) here:
vault/builtin/credential/approle/path_tidy_user_id.go
Lines 151 to 163 in 6e96d87
for _, roleNameHMAC := range roleNameHMACs { | |
logger.Trace("listing secret ID HMACs", "role_hmac", roleNameHMAC) | |
secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC)) | |
if err != nil { | |
return err | |
} | |
for _, secretIDHMAC := range secretIDHMACs { | |
err = secretIDCleanupFunc(secretIDHMAC, roleNameHMAC, secretIDPrefixToUse) | |
if err != nil { | |
return err | |
} | |
} | |
} |
Would achieve the same thing without needing to do the cleanup for dangling accessors. By doing it here I believe that the accessor could be immediately removed once being detected as being marked as dangling and that
if len(accessorMap) > 0 { |
I'll test this locally first and see if it still prevents the race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, looks like it works just fine! I just pushed a new commit that waits for the lock before listing the secretIDHMACs. When commenting out the waiting for the lock I can get a race condition within around 30 seconds. With the current code in place I have run it for 40 minutes (4 in parallel for 10 minutes) without hitting a single race condition.
// Ideally, locking on accessors should be performed here too | ||
// but for that, accessors are required in plaintext, which are | ||
// not available. The code above helps but it may still be | ||
// racy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain that we should change this code. It sounds like it's already racy. I'm not clear on where we might place the unlock for the mutex, but it seems like anything that holds the lock for a shorter amount of time would result in racier behavior. Perhaps I'm not thinking of it the right way and it'll be clearer once it's pushed.
By doing the wait here it means we don't need to re-list the secretIDs when they are dangling to test for the race condition.
Previously this race condition test would fail around 10% of the time on my own machine. By moving the delay to after listing the secretIDHMACs I'm able to get it to fail 100% of the time when removing the `lock.Lock() lock.Unlock()` code that has been added in this pull request.
By moving the fakeTidyDelay to somewhere after the secretIDHMACs were being listed I was able to remove the flakiness of the race condition test. I'm now able to get it to fail 100% of the time when removing the
|
@tyrannosaurus-becks are you able to take another look at my recent changes? |
// where the accessor has been written but not yet the secretIDHMAC | ||
for _, lock := range b.secretIDLocks { | ||
lock.Lock() | ||
lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this still doesn't seem right to me. I've never seen locks grabbed before without doing some work before releasing them. I suspect that in a racy production environment this would not prevent races, though it would slow the code execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that in a racy production environment this would not prevent races, though it would slow the code execution.
Are you referring to the race condition I'm describing in #6252 (comment) ? Or is there another one that you can see that could potentially cause issues?
@jefferai are you able to chime in? I don't think either of us are 100% sure if my proposal here is safe and you are the original author of the dangling accessor cleanup code.
Hi @Crazybus ! Today I took a closer look at the code and what it's currently doing and I have some questions. First of all, there's a lot happening in this code with locks and cleaning things up. I think it would be great if this PR started off with a test that reproduces the long locking behavior that you're seeing in production, so we can also see how this fix changes that test's behavior. From there, let's suppose we're in a situation where there are 8,000 dangling accessors to clean up and it'll take 12 continuous days to clean up given the current code, which will block authentication. Let's also suppose that the current approach to protecting against raciness is the maximum amount of risk we want on raciness. Would the desired behavior be that instead of blocking auth for 12 days, accessors get cleaned up in smaller batches? Batches of a size that doesn't block authentication? Also, if we were batching these, would we want that to happen for all users, or just for users with this need? How would we differentiate between the two? What I'm leading towards here is the idea that perhaps we should consider adding a Let me know what you think. Thanks! |
This test adds a delay to simulate the simulation where cleaning up lots of dangling accessors would block new secret-ids from being created.
Added a test in 9708c3a. To test that it fails properly with current master the testing delay needs to be added at
Is there another race condition that you can see which is not being covered by the current race condition test? As far as I can tell the race condition that needs to be avoided is the situation where the accessor has been created but not yet the secretIDHMAC. Waiting for all write operations to have finished guarantees that all accessors which existed have now also had their secretIDs created too. I couldn't trigger a single race condition test to fail even when running it in parallel for a long time. If there is another one that you are aware of I'm happy to add in a new test to make sure that it does not get hit.
I don't see how this would help. If it takes more than 60 seconds (the default vault client timeout) to list the secretIDHMACS (
|
Hi @Crazybus ! I chatted with one of the primary authors of this code, and we don't want to proceed with this particular approach for this fix. As for what approach, it probably doesn't make sense for us to continue discussing the approach when we, the Vault folks, don't really have as sound a grasp of the issue as you do. I think we should get there first before we continue to dive into this code. It does seem strange that it would take 12 days to clean up 8000 dangling accessors, and we're not certain yet from the issue description that Vault is the source of the problem. There could be other contributing factors, like the enclosing environment having gone into swap. Would you be willing to instead file a bug report with steps to reproduce what you're seeing? If Vault is the problem, we'd like to take a closer look and come up with a fix. Steps to reproduce would include what Vault version you're on, how to recreate an environment like what you're seeing, maybe even pertinent information around what it's running on, etc. I'm going to go ahead and close this PR for now, until we can get our heads better around this. Thanks again for working on this! |
Done! #6710 |
This is a less aggressive followup to hashicorp#6252 which helps to mitigate the total downtime during the dangling accessor cleanup described in hashicorp#6710. It changes the downtime from `(time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors` to just `time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime to `2 minutes + (time_to_delete_accessor * 8000)` which would be around 15 minutes in total instead. This change gets the list of secret-id HMACs once instead of getting the exact same list for each dangling accessor. It is only needed to do this once at the start because the secret-id backends all have write locks on them making it impossible for them to be changed during the cleanup process.
This is a less aggressive followup to hashicorp#6252 which helps to mitigate the total downtime during the dangling accessor cleanup described in hashicorp#6710. It changes the downtime from `(time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors` to just `time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime to `2 minutes + (time_to_delete_accessor * 8000)` which would be around 15 minutes in total instead. This change gets the list of secret-id HMACs once instead of getting the exact same list for each dangling accessor. It is only needed to do this once at the start because the secret-id backends all have write locks on them making it impossible for them to be changed during the cleanup process.
Continuing the discuss with @jefferai from #7052 in here:
My short answer to this is that there is no need to double check if the accessor is dangling or not because of where the locking is now done before Here is the slightly longer version comparing the two approaches with some pseudo code. Current approach:
Approach in this PR:
|
Closing this one out since a much better implementation has been written and merged in #8418 |
Previous behavior was to do the following:
approles
During this process there is a write lock on approles which prevents any
authentication to happen. For small periods of time this can go
unnoticed. However there are situations where this causes a lot of
service disruption (full details at the end)
New behavior:
secret-ids have been written for the list of accessors we created
earlier.
dangling accessors.
The previous comment explained that the locking is only present to avoid race
conditions between the accessors being written and not the secret-id yet.
If there is another reason why a lock is necessary while getting the
list of secret ids I can easily update this so that listing all secret
ids happens while the locks are in place. My understanding is that just
waiting to get the lock and then releasing it is enough to guarantee
that we don't hit the accessor exists but not yet secret-id race
condition.
Why am I making these changes and what happened?
We have lots of approles, with lots of secret-ids and lots of
accessors. An approle was removed which contained ~8000 dangling
accessors. After making this change we suddenly saw that it wasn't
possible to authenticate with approle authentication any more. After
much troubleshooting we enabled trace logging and saw that it had to do
with dangling accessors, but couldn't tell how many there were and how
long they would take to clean up.
After adding in some extra trace logging I discovered that there was
8000 and because of the amount of other approles and secretids that it
was taking around 2 minutes per dangling accessor to clean up. With the
previous logic this would have taken at least 12 days to clean up while
holding a lock on approle authentication the entire time.
Part of the slowness is no doubt caused by the fact that we use GCS as
the backend, and I'm sure such a large amount of approles and secret ids
is not best practice or expected. However this vault instance is being
run as a service for other teams where we don't have full control over
enforcing how they use it, and vault also let them create that many
approles and secret ids for them.
Closes #6710