Optimize cleaning up of dangling accessors #7052

Crazybus · 2019-07-03T05:57:00Z

This is a less aggressive followup to #6252 which helps to mitigate the
total downtime during the dangling accessor cleanup described in #6710.
It changes the downtime from (time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors to just
time_to_list_all_accessors + time_to_delete_dangling_accessors. In our
situation where we had 8000 dangling accessors taking 2 minutes to list
all secret-ids means going from ~12 days of downtime to 2 minutes + (time_to_delete_accessor * 8000) which would be around 15 minutes in
total instead.

This change gets the list of secret-id HMACs once instead of getting the
exact same list for each dangling accessor. It is only needed to do this
once at the start because the secret-id backends all have write locks on
them making it impossible for them to be changed during the cleanup
process.

This is a less aggressive followup to hashicorp#6252 which helps to mitigate the total downtime during the dangling accessor cleanup described in hashicorp#6710. It changes the downtime from `(time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors` to just `time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime to `2 minutes + (time_to_delete_accessor * 8000)` which would be around 15 minutes in total instead. This change gets the list of secret-id HMACs once instead of getting the exact same list for each dangling accessor. It is only needed to do this once at the start because the secret-id backends all have write locks on them making it impossible for them to be changed during the cleanup process.

jefferai · 2019-07-03T18:17:29Z

In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime

It sounds like you have some real IO issues somewhere. Taking 2 minutes to perform this sounds well above what I'd expect.

jefferai · 2019-07-03T18:21:32Z

I think this seems like a nice approach but doesn't really address the underlying issue, which is the lock being held during the process. 15 minutes will still cause a lot of client requests to timeout. It seems like it should be possible to only grab the lock explicitly when needed and release it when it's not needed, allowing other operations to continue.

Crazybus · 2019-07-03T19:30:57Z

It sounds like you have some real IO issues somewhere. Taking 2 minutes to perform this sounds well above what I'd expect.

It's a combination of IO and the scale. The GCS backend is pretty fast but this runs into issues when you have a lot of approles and secret-ids. We currently have 50 approles. For most approles the listing takes between 50-200ms, for the largest one it takes around 30 seconds. I can get exact numbers for how many secret ids are in the largest approle if you are interested.

Here are the logs with the timestamps for how long each list operation takes.

2019-07-03T19:16:57.054Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.101Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.379Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.427Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.657Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.710Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.821Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.872Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.909Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.950Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.996Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.040Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.066Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.109Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.147Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.389Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.426Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.477Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.511Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.867Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.356Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.404Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.507Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.546Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.606Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.659Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.694Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.743Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.654Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.693Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.750Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.849Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.007Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.078Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.115Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.375Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.412Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:13.217Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:13.258Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.782Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.826Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.880Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.926Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.963Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:43.013Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:43.059Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:43.105Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:46.513Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:47.502Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:47.620Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed

Crazybus · 2019-07-03T19:34:02Z

I think this seems like a nice approach but doesn't really address the underlying issue, which is the lock being held during the process. 15 minutes will still cause a lot of client requests to timeout. It seems like it should be possible to only grab the lock explicitly when needed and release it when it's not needed, allowing other operations to continue.

I 100% agree which is actually the approach I took in #6252. If you think that approach looks good (and is safe) then I would rather re-open that PR and continue working on it. If there are some changes needed then I'm more than willing to work on it. As that would bring our downtime from ~2 minutes to none.

Crazybus · 2019-07-03T19:50:48Z

I can get exact numbers for how many secret ids are in the largest approle if you are interested.

At time of writing we currently have 91055 secret ids that need to be listed. This is what the distribution looks like:

jefferai · 2019-07-03T21:30:04Z

I don't really know why the other one was closed, so my feedback above is in a bubble. I just think that unless we need the lock the whole time while this change is (probably) good it would also be good to fix the locking behavior so other requests can run.

jefferai · 2019-07-03T21:53:25Z

I believe there are issues with the previous PR but it seems to address the race condition. Probably that's the right place to start. I'll close this and reopen that, can you true up the code between the two and poke when you think it's ready for a look? One thing in particular that I didn't follow as I looked at the other PR right now is that once you have done the locking/unlocking you then go through the accessor map and unilaterally delete anything found there -- I would think you'd need to revalidate whether or not the entries there are actually dangling.

jefferai modified the milestones: 1.3, 1.2 Jul 3, 2019

jefferai closed this Jul 3, 2019

Crazybus mentioned this pull request Jul 4, 2019

Optimize tidying up dangling secret-id accessors #6252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cleaning up of dangling accessors #7052

Optimize cleaning up of dangling accessors #7052

Crazybus commented Jul 3, 2019

jefferai commented Jul 3, 2019

jefferai commented Jul 3, 2019

Crazybus commented Jul 3, 2019

Crazybus commented Jul 3, 2019

Crazybus commented Jul 3, 2019

jefferai commented Jul 3, 2019

jefferai commented Jul 3, 2019

Optimize cleaning up of dangling accessors #7052

Optimize cleaning up of dangling accessors #7052

Conversation

Crazybus commented Jul 3, 2019

jefferai commented Jul 3, 2019

jefferai commented Jul 3, 2019

Crazybus commented Jul 3, 2019

Crazybus commented Jul 3, 2019

Crazybus commented Jul 3, 2019

jefferai commented Jul 3, 2019

jefferai commented Jul 3, 2019