Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cleaning up of dangling accessors #7052

Closed
wants to merge 1 commit into from

Conversation

Crazybus
Copy link
Contributor

@Crazybus Crazybus commented Jul 3, 2019

This is a less aggressive followup to #6252 which helps to mitigate the
total downtime during the dangling accessor cleanup described in #6710.
It changes the downtime from (time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors to just
time_to_list_all_accessors + time_to_delete_dangling_accessors. In our
situation where we had 8000 dangling accessors taking 2 minutes to list
all secret-ids means going from ~12 days of downtime to 2 minutes + (time_to_delete_accessor * 8000) which would be around 15 minutes in
total instead.

This change gets the list of secret-id HMACs once instead of getting the
exact same list for each dangling accessor. It is only needed to do this
once at the start because the secret-id backends all have write locks on
them making it impossible for them to be changed during the cleanup
process.

This is a less aggressive followup to hashicorp#6252 which helps to mitigate the
total downtime during the dangling accessor cleanup described in hashicorp#6710.
It changes the downtime from `(time_to_list_all_accessors *
dangling_accessors_count) + time_to_delete_dangling_accessors` to just
`time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our
situation where we had 8000 dangling accessors taking 2 minutes to list
all secret-ids means going from ~12 days of downtime to `2 minutes +
(time_to_delete_accessor * 8000)` which would be around 15 minutes in
total instead.

This change gets the list of secret-id HMACs once instead of getting the
exact same list for each dangling accessor. It is only needed to do this
once at the start because the secret-id backends all have write locks on
them making it impossible for them to be changed during the cleanup
process.
@jefferai
Copy link
Member

jefferai commented Jul 3, 2019

In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime

It sounds like you have some real IO issues somewhere. Taking 2 minutes to perform this sounds well above what I'd expect.

@jefferai
Copy link
Member

jefferai commented Jul 3, 2019

I think this seems like a nice approach but doesn't really address the underlying issue, which is the lock being held during the process. 15 minutes will still cause a lot of client requests to timeout. It seems like it should be possible to only grab the lock explicitly when needed and release it when it's not needed, allowing other operations to continue.

@jefferai jefferai modified the milestones: 1.3, 1.2 Jul 3, 2019
@Crazybus
Copy link
Contributor Author

Crazybus commented Jul 3, 2019

It sounds like you have some real IO issues somewhere. Taking 2 minutes to perform this sounds well above what I'd expect.

It's a combination of IO and the scale. The GCS backend is pretty fast but this runs into issues when you have a lot of approles and secret-ids. We currently have 50 approles. For most approles the listing takes between 50-200ms, for the largest one it takes around 30 seconds. I can get exact numbers for how many secret ids are in the largest approle if you are interested.

Here are the logs with the timestamps for how long each list operation takes.

2019-07-03T19:16:57.054Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.101Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.379Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.427Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.657Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.710Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.821Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.872Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.909Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.950Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:57.996Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.040Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.066Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.109Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:16:58.147Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.389Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.426Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.477Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.511Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:01.867Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.356Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.404Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.507Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.546Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.606Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.659Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.694Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:02.743Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.654Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.693Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.750Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:03.849Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.007Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.078Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.115Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.375Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:12.412Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:13.217Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:13.258Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.782Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.826Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.880Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.926Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:42.963Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:43.013Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:43.059Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:43.105Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:46.513Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:47.502Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed
2019-07-03T19:17:47.620Z [TRACE] auth.approle.auth_approle_3001920a.tidy: listing secret ID HMACs: role_hmac=removed

@Crazybus
Copy link
Contributor Author

Crazybus commented Jul 3, 2019

I think this seems like a nice approach but doesn't really address the underlying issue, which is the lock being held during the process. 15 minutes will still cause a lot of client requests to timeout. It seems like it should be possible to only grab the lock explicitly when needed and release it when it's not needed, allowing other operations to continue.

I 100% agree which is actually the approach I took in #6252. If you think that approach looks good (and is safe) then I would rather re-open that PR and continue working on it. If there are some changes needed then I'm more than willing to work on it. As that would bring our downtime from ~2 minutes to none.

@Crazybus
Copy link
Contributor Author

Crazybus commented Jul 3, 2019

I can get exact numbers for how many secret ids are in the largest approle if you are interested.

At time of writing we currently have 91055 secret ids that need to be listed. This is what the distribution looks like:

27207
22662
10023
8320 
7562 
5760 
2307 
1792 
1777 
1366 
 355 
 280 
 260 
 231 
 220 
 184 
 161 
  81 
  71 
  63 
  62 
  53 
  33 
  26 
  14 
  11 
  10 
   9 
   8 
   7 
   7 
   6 
   5 
   5 
   5 
   5 
   4 
   4 
   4 
   4 
   4 
   4 
   4 
   4 
   4 
   4 
   4 
   4 
   2 
   2 
   2 

@jefferai
Copy link
Member

jefferai commented Jul 3, 2019

I don't really know why the other one was closed, so my feedback above is in a bubble. I just think that unless we need the lock the whole time while this change is (probably) good it would also be good to fix the locking behavior so other requests can run.

@jefferai
Copy link
Member

jefferai commented Jul 3, 2019

I believe there are issues with the previous PR but it seems to address the race condition. Probably that's the right place to start. I'll close this and reopen that, can you true up the code between the two and poke when you think it's ready for a look? One thing in particular that I didn't follow as I looked at the other PR right now is that once you have done the locking/unlocking you then go through the accessor map and unilaterally delete anything found there -- I would think you'd need to revalidate whether or not the entries there are actually dangling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants