-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cleaning up of dangling accessors #7052
Conversation
This is a less aggressive followup to hashicorp#6252 which helps to mitigate the total downtime during the dangling accessor cleanup described in hashicorp#6710. It changes the downtime from `(time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors` to just `time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime to `2 minutes + (time_to_delete_accessor * 8000)` which would be around 15 minutes in total instead. This change gets the list of secret-id HMACs once instead of getting the exact same list for each dangling accessor. It is only needed to do this once at the start because the secret-id backends all have write locks on them making it impossible for them to be changed during the cleanup process.
It sounds like you have some real IO issues somewhere. Taking 2 minutes to perform this sounds well above what I'd expect. |
I think this seems like a nice approach but doesn't really address the underlying issue, which is the lock being held during the process. 15 minutes will still cause a lot of client requests to timeout. It seems like it should be possible to only grab the lock explicitly when needed and release it when it's not needed, allowing other operations to continue. |
It's a combination of IO and the scale. The GCS backend is pretty fast but this runs into issues when you have a lot of approles and secret-ids. We currently have 50 approles. For most approles the listing takes between 50-200ms, for the largest one it takes around 30 seconds. I can get exact numbers for how many secret ids are in the largest approle if you are interested. Here are the logs with the timestamps for how long each list operation takes.
|
I 100% agree which is actually the approach I took in #6252. If you think that approach looks good (and is safe) then I would rather re-open that PR and continue working on it. If there are some changes needed then I'm more than willing to work on it. As that would bring our downtime from ~2 minutes to none. |
At time of writing we currently have 91055 secret ids that need to be listed. This is what the distribution looks like:
|
I don't really know why the other one was closed, so my feedback above is in a bubble. I just think that unless we need the lock the whole time while this change is (probably) good it would also be good to fix the locking behavior so other requests can run. |
I believe there are issues with the previous PR but it seems to address the race condition. Probably that's the right place to start. I'll close this and reopen that, can you true up the code between the two and poke when you think it's ready for a look? One thing in particular that I didn't follow as I looked at the other PR right now is that once you have done the locking/unlocking you then go through the accessor map and unilaterally delete anything found there -- I would think you'd need to revalidate whether or not the entries there are actually dangling. |
This is a less aggressive followup to #6252 which helps to mitigate the
total downtime during the dangling accessor cleanup described in #6710.
It changes the downtime from
(time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors
to justtime_to_list_all_accessors + time_to_delete_dangling_accessors
. In oursituation where we had 8000 dangling accessors taking 2 minutes to list
all secret-ids means going from ~12 days of downtime to
2 minutes + (time_to_delete_accessor * 8000)
which would be around 15 minutes intotal instead.
This change gets the list of secret-id HMACs once instead of getting the
exact same list for each dangling accessor. It is only needed to do this
once at the start because the secret-id backends all have write locks on
them making it impossible for them to be changed during the cleanup
process.