Optimize tidying up dangling secret-id accessors #6252

Crazybus · 2019-02-18T07:36:07Z

Previous behavior was to do the following:

Get a lock on all secret ids
For each dangling accessor list all of the secret ids in all of the
approles

During this process there is a write lock on approles which prevents any
authentication to happen. For small periods of time this can go
unnoticed. However there are situations where this causes a lot of
service disruption (full details at the end)

New behavior:

Wait for a lock on all of the secret ids to make sure that all
secret-ids have been written for the list of accessors we created
earlier.
Get the full list of secret id hmacs only once and re-use it for all
dangling accessors.

The previous comment explained that the locking is only present to avoid race
conditions between the accessors being written and not the secret-id yet.
If there is another reason why a lock is necessary while getting the
list of secret ids I can easily update this so that listing all secret
ids happens while the locks are in place. My understanding is that just
waiting to get the lock and then releasing it is enough to guarantee
that we don't hit the accessor exists but not yet secret-id race
condition.

Why am I making these changes and what happened?

We have lots of approles, with lots of secret-ids and lots of
accessors. An approle was removed which contained ~8000 dangling
accessors. After making this change we suddenly saw that it wasn't
possible to authenticate with approle authentication any more. After
much troubleshooting we enabled trace logging and saw that it had to do
with dangling accessors, but couldn't tell how many there were and how
long they would take to clean up.

After adding in some extra trace logging I discovered that there was
8000 and because of the amount of other approles and secretids that it
was taking around 2 minutes per dangling accessor to clean up. With the
previous logic this would have taken at least 12 days to clean up while
holding a lock on approle authentication the entire time.

Part of the slowness is no doubt caused by the fact that we use GCS as
the backend, and I'm sure such a large amount of approles and secret ids
is not best practice or expected. However this vault instance is being
run as a service for other teams where we don't have full control over
enforcing how they use it, and vault also let them create that many
approles and secret ids for them.

Closes #6710

Previous behavior was to do the following: 1. Get a lock on all secret ids 2. For each dangling accessor list all of the secret ids in all of the approles During this process there is a write lock on approles which prevents any authentication to happen. For small periods of time this can go unnoticed. However there are situations where this causes a lot of service disruption (full details at the end) New behavior: 1. Wait for a lock on all of the secret ids to make sure that all secret-ids have been written for the list of accessors we created earlier. 2. Get the full list of secret id hmacs only once and re-use it for all dangling accessors. The previous comment explained that the locking is only present to avoid race conditions between the accessors being written and not the secret-id yet. If there is another reason why a lock is necessary while getting the list of secret ids I can easily update this so that listing all secret ids happens while the locks are in place. My understanding is that just waiting to get the lock and then releasing it is enough to guarantee that we don't hit the accessor exists but not yet secret-id race condition. Why am I making these changes and what happened? We have lots of approles, with lots of secret-ids and lots of accessors. An approle was removed which contained ~8000 dangling accessors. After making this change we suddenly saw that it wasn't possible to authenticate with approle authentication any more. After much troubleshooting we enabled trace logging and saw that it had to do with dangling accessors, but couldn't tell how many there were and how long they would take to clean up. After adding in some extra trace logging I discovered that there was 8000 and because of the amount of other approles and secretids that it was taking around 2 minutes per dangling accessor to clean up. With the previous logic this would have taken at least 12 days to clean up while holding a lock on approle authentication the entire time. Part of the slowness is no doubt caused by the fact that we use GCS as the backend, and I'm sure such a large amount of approles and secret ids is not best practice or expected. However this vault instance is being run as a service for other teams where we don't have full control over enforcing how they use it, and vault also let them create that many approles and secret ids for them.

hashicorp-cla · 2019-02-18T07:36:14Z

All committers have signed the CLA.

tyrannosaurus-becks · 2019-02-28T18:18:01Z

builtin/credential/approle/path_tidy_user_id.go

 				for _, lock := range b.secretIDLocks {
 					lock.Lock()
-					defer lock.Unlock()
+					lock.Unlock()


Is this in the right place? It seems strange to me that we would grab a lock and then unlock it, plus below there's a comment that we're doing stuff now that we waited for the lock. Usually locks are held for the duration of time when we're doing work that might otherwise lead to a race condition.

I can't say for certain if this is the right place. But for the current implementation I believe this does achieve the same goal with way less downtime.

so if we have the lock we know we're not in a wrote-accessor-but-not-yet-secret case, which can be racy.

So this means we want to make sure that nothing is currently still writing the secret-id between these two calls.

First a list of accessors is created. It is possible that an accessor exists but the secretID has not been written yet.

vault/builtin/credential/approle/path_tidy_user_id.go

Line 71 in 6e96d87

accessorHashes, err := s.List(ctx, accessorIDPrefixToUse)

Then a list of secretIDs is made:

vault/builtin/credential/approle/path_tidy_user_id.go

Line 153 in 6e96d87

secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))

In the scenario where the secretID does not exist yet this accessor will not be deleted and will be considered dangling.

By waiting for a lock here:

vault/builtin/credential/approle/path_tidy_user_id.go

Lines 171 to 186 in 6e96d87

for _, lock := range b.secretIDLocks {

lock.Lock()

lock.Unlock()

}

// Now that we waited for a lock get a fresh list of all of the secretIDHMACs

// so that we can be sure if the accessor is dangling or not

allSecretIDHMACs := make([]string, 0)

for _, roleNameHMAC := range roleNameHMACs {

secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))

if err != nil {

return err

}

for _, v := range secretIDHMACs {

allSecretIDHMACs = append(allSecretIDHMACs, v)

}

}

We can be sure that all secretIDs have been written since the list of accessors was created. Which prevents the race condition from occurring.

After stepping through this I think I understand your comment better and realise that all of this locking could occur earlier without any need to do later "cleanups".

Waiting for the lock (and immediately releasing it) here:

vault/builtin/credential/approle/path_tidy_user_id.go

Lines 151 to 163 in 6e96d87

for _, roleNameHMAC := range roleNameHMACs {

logger.Trace("listing secret ID HMACs", "role_hmac", roleNameHMAC)

secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))

if err != nil {

return err

}

for _, secretIDHMAC := range secretIDHMACs {

err = secretIDCleanupFunc(secretIDHMAC, roleNameHMAC, secretIDPrefixToUse)

if err != nil {

return err

}

}

}

Would achieve the same thing without needing to do the cleanup for dangling accessors. By doing it here I believe that the accessor could be immediately removed once being detected as being marked as dangling and that

vault/builtin/credential/approle/path_tidy_user_id.go

Line 167 in 6e96d87

if len(accessorMap) > 0 {

could be completely removed altogether as the situation would no longer be possible.

I'll test this locally first and see if it still prevents the race condition.

Yup, looks like it works just fine! I just pushed a new commit that waits for the lock before listing the secretIDHMACs. When commenting out the waiting for the lock I can get a race condition within around 30 seconds. With the current code in place I have run it for 40 minutes (4 in parallel for 10 minutes) without hitting a single race condition.

tyrannosaurus-becks · 2019-02-28T18:26:15Z

builtin/credential/approle/path_tidy_user_id.go

-					// Ideally, locking on accessors should be performed here too
-					// but for that, accessors are required in plaintext, which are
-					// not available. The code above helps but it may still be
-					// racy.


I'm not certain that we should change this code. It sounds like it's already racy. I'm not clear on where we might place the unlock for the mutex, but it seems like anything that holds the lock for a shorter amount of time would result in racier behavior. Perhaps I'm not thinking of it the right way and it'll be clearer once it's pushed.

By doing the wait here it means we don't need to re-list the secretIDs when they are dangling to test for the race condition.

Previously this race condition test would fail around 10% of the time on my own machine. By moving the delay to after listing the secretIDHMACs I'm able to get it to fail 100% of the time when removing the `lock.Lock() lock.Unlock()` code that has been added in this pull request.

Crazybus · 2019-03-08T07:02:17Z

By moving the fakeTidyDelay to somewhere after the secretIDHMACs were being listed I was able to remove the flakiness of the race condition test. I'm now able to get it to fail 100% of the time when removing the lock.Lock() && lock.Unlock() that I added in this pull request. And I'm also able to confirm that it passes every time when running the tests in 10 parallel terminals for a long period of time with:

go test -count=100 -failfast -run TestAppRole_TidyDanglingAccessors_RaceTest github.com/hashicorp/vault/builtin/credential/approle

Crazybus · 2019-03-26T10:34:49Z

@tyrannosaurus-becks are you able to take another look at my recent changes?

tyrannosaurus-becks · 2019-03-27T18:51:04Z

builtin/credential/approle/path_tidy_user_id.go

+			// where the accessor has been written but not yet the secretIDHMAC
+			for _, lock := range b.secretIDLocks {
+				lock.Lock()
+				lock.Unlock()


Hmm, this still doesn't seem right to me. I've never seen locks grabbed before without doing some work before releasing them. I suspect that in a racy production environment this would not prevent races, though it would slow the code execution.

I suspect that in a racy production environment this would not prevent races, though it would slow the code execution.

Are you referring to the race condition I'm describing in #6252 (comment) ? Or is there another one that you can see that could potentially cause issues?

@jefferai are you able to chime in? I don't think either of us are 100% sure if my proposal here is safe and you are the original author of the dangling accessor cleanup code.

tyrannosaurus-becks · 2019-04-03T18:11:08Z

Hi @Crazybus ! Today I took a closer look at the code and what it's currently doing and I have some questions.

First of all, there's a lot happening in this code with locks and cleaning things up. I think it would be great if this PR started off with a test that reproduces the long locking behavior that you're seeing in production, so we can also see how this fix changes that test's behavior.

From there, let's suppose we're in a situation where there are 8,000 dangling accessors to clean up and it'll take 12 continuous days to clean up given the current code, which will block authentication. Let's also suppose that the current approach to protecting against raciness is the maximum amount of risk we want on raciness.

Would the desired behavior be that instead of blocking auth for 12 days, accessors get cleaned up in smaller batches? Batches of a size that doesn't block authentication? Also, if we were batching these, would we want that to happen for all users, or just for users with this need? How would we differentiate between the two?

What I'm leading towards here is the idea that perhaps we should consider adding a batch_size parameter to the tidy call that would give pauses for authentication. In just looking at the code, I'm not certain where exactly that should be inserted, but I think with having a test to guide us it would help a lot.

Let me know what you think. Thanks!

This test adds a delay to simulate the simulation where cleaning up lots of dangling accessors would block new secret-ids from being created.

Crazybus · 2019-04-23T11:15:47Z

First of all, there's a lot happening in this code with locks and cleaning things up. I think it would be great if this PR started off with a test that reproduces the long locking behavior that you're seeing in production, so we can also see how this fix changes that test's behavior.

Added a test in 9708c3a. To test that it fails properly with current master the testing delay needs to be added at

vault/builtin/credential/approle/path_tidy_user_id.go

Line 171 in 2166445

}

.

Let's also suppose that the current approach to protecting against raciness is the maximum amount of risk we want on raciness.

Is there another race condition that you can see which is not being covered by the current race condition test? As far as I can tell the race condition that needs to be avoided is the situation where the accessor has been created but not yet the secretIDHMAC. Waiting for all write operations to have finished guarantees that all accessors which existed have now also had their secretIDs created too. I couldn't trigger a single race condition test to fail even when running it in parallel for a long time. If there is another one that you are aware of I'm happy to add in a new test to make sure that it does not get hit.

Would the desired behavior be that instead of blocking auth for 12 days, accessors get cleaned up in smaller batches? Batches of a size that doesn't block authentication? Also, if we were batching these, would we want that to happen for all users, or just for users with this need? How would we differentiate between the two?
What I'm leading towards here is the idea that perhaps we should consider adding a batch_size parameter to the tidy call that would give pauses for authentication. In just looking at the code, I'm not certain where exactly that should be inserted, but I think with having a test to guide us it would help a lot.

I don't see how this would help. If it takes more than 60 seconds (the default vault client timeout) to list the secretIDHMACS (

vault/builtin/credential/approle/path_tidy_user_id.go

Line 200 in 2166445

for _, roleNameHMAC := range roleNameHMACs {

) then even with a batch size of 1 this is still going to cause failed logins.

tyrannosaurus-becks · 2019-04-25T19:28:24Z

Hi @Crazybus !

I chatted with one of the primary authors of this code, and we don't want to proceed with this particular approach for this fix. As for what approach, it probably doesn't make sense for us to continue discussing the approach when we, the Vault folks, don't really have as sound a grasp of the issue as you do. I think we should get there first before we continue to dive into this code.

It does seem strange that it would take 12 days to clean up 8000 dangling accessors, and we're not certain yet from the issue description that Vault is the source of the problem. There could be other contributing factors, like the enclosing environment having gone into swap.

Would you be willing to instead file a bug report with steps to reproduce what you're seeing? If Vault is the problem, we'd like to take a closer look and come up with a fix. Steps to reproduce would include what Vault version you're on, how to recreate an environment like what you're seeing, maybe even pertinent information around what it's running on, etc.

I'm going to go ahead and close this PR for now, until we can get our heads better around this. Thanks again for working on this!

Crazybus · 2019-05-09T20:36:06Z

Would you be willing to instead file a bug report with steps to reproduce what you're seeing

Done! #6710

This is a less aggressive followup to hashicorp#6252 which helps to mitigate the total downtime during the dangling accessor cleanup described in hashicorp#6710. It changes the downtime from `(time_to_list_all_accessors * dangling_accessors_count) + time_to_delete_dangling_accessors` to just `time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our situation where we had 8000 dangling accessors taking 2 minutes to list all secret-ids means going from ~12 days of downtime to `2 minutes + (time_to_delete_accessor * 8000)` which would be around 15 minutes in total instead. This change gets the list of secret-id HMACs once instead of getting the exact same list for each dangling accessor. It is only needed to do this once at the start because the secret-id backends all have write locks on them making it impossible for them to be changed during the cleanup process.

Crazybus · 2019-07-04T09:29:36Z

Continuing the discuss with @jefferai from #7052 in here:

I believe there are issues with the previous PR but it seems to address the race condition. Probably that's the right place to start. I'll close this and reopen that, can you true up the code between the two and poke when you think it's ready for a look?

One thing in particular that I didn't follow as I looked at the other PR right now is that once you have done the locking/unlocking you then go through the accessor map and unilaterally delete anything found there -- I would think you'd need to revalidate whether or not the entries there are actually dangling.

My short answer to this is that there is no need to double check if the accessor is dangling or not because of where the locking is now done before secretIDCleanupFunc is called we know that the roleNameHMACs and SecretIDHMACs contains all accessors already. During secretIDCleanupFunc the accessor is either valid and removed from the hash, or an error was encountered which causes the code to return. I could be totally off here, so if there is something I'm missing then I would love to add a test for the scenario where these assumptions don't work.

Here is the slightly longer version comparing the two approaches with some pseudo code.

Current approach:

list roleNameHMACs
list accessors
create accessorMap with all accessors
run secretIDCleanupFunc looping over roleNameHMACs and SecretIDHMACs
loop over dangling accessors
    lock all secretids
    get a fresh list of roleNameHMACs and SecretIDHMACs to avoid the race condition
    delete it if it is still really dangling, if found don't delete

Approach in this PR:

list roleNameHMACs
list accessors
create accessorMap with all accessors
wait for a lock and then immediately unlock each secretIDLock. This makes sure that all write operations to the secretID backend have finished, so we don't need to double check it later for dangling accesors. 
run secretIDCleanupFunc looping over roleNameHMACs and SecretIDHMACs
loop over dangling accessors
    delete them all since we already know that they are dangling

Crazybus · 2020-05-14T12:39:31Z

Closing this one out since a much better implementation has been written and merged in #8418

tyrannosaurus-becks reviewed Feb 28, 2019

View reviewed changes

Crazybus and others added 3 commits March 7, 2019 18:35

Wait for lock before listing the secretIDHMACs

bbbf5e8

By doing the wait here it means we don't need to re-list the secretIDs when they are dangling to test for the race condition.

Merge branch 'master' into dangling_dangler

f7e6566

tyrannosaurus-becks reviewed Mar 27, 2019

View reviewed changes

Add test to make sure tidy function is not blocking

9708c3a

This test adds a delay to simulate the simulation where cleaning up lots of dangling accessors would block new secret-ids from being created.

tyrannosaurus-becks closed this Apr 25, 2019

Crazybus mentioned this pull request May 9, 2019

Approle logins timeout while vault is tidying up dangling accessors #6710

Closed

Crazybus mentioned this pull request Jul 3, 2019

Optimize cleaning up of dangling accessors #7052

Closed

jefferai reopened this Jul 3, 2019

catsby added enhancement auth/approle labels Dec 12, 2019

Merge branch 'master' into dangling_dangler

7bd0e78

Crazybus closed this May 14, 2020

Crazybus deleted the dangling_dangler branch May 14, 2020 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize tidying up dangling secret-id accessors #6252

Optimize tidying up dangling secret-id accessors #6252

Crazybus commented Feb 18, 2019 •

edited by jefferai

Loading

hashicorp-cla commented Feb 18, 2019 •

edited

Loading

tyrannosaurus-becks Feb 28, 2019

Crazybus Mar 7, 2019

Crazybus Mar 7, 2019

tyrannosaurus-becks Feb 28, 2019

Crazybus commented Mar 8, 2019

Crazybus commented Mar 26, 2019

tyrannosaurus-becks Mar 27, 2019

Crazybus Apr 2, 2019

tyrannosaurus-becks commented Apr 3, 2019

Crazybus commented Apr 23, 2019

tyrannosaurus-becks commented Apr 25, 2019

Crazybus commented May 9, 2019

Crazybus commented Jul 4, 2019

Crazybus commented May 14, 2020

	for _, lock := range b.secretIDLocks {
	lock.Lock()
	lock.Unlock()
	}
	// Now that we waited for a lock get a fresh list of all of the secretIDHMACs
	// so that we can be sure if the accessor is dangling or not
	allSecretIDHMACs := make([]string, 0)
	for _, roleNameHMAC := range roleNameHMACs {
	secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))
	if err != nil {
	return err
	}
	for _, v := range secretIDHMACs {
	allSecretIDHMACs = append(allSecretIDHMACs, v)
	}
	}

	for _, roleNameHMAC := range roleNameHMACs {
	logger.Trace("listing secret ID HMACs", "role_hmac", roleNameHMAC)
	secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))
	if err != nil {
	return err
	}
	for _, secretIDHMAC := range secretIDHMACs {
	err = secretIDCleanupFunc(secretIDHMAC, roleNameHMAC, secretIDPrefixToUse)
	if err != nil {
	return err
	}
	}
	}

Optimize tidying up dangling secret-id accessors #6252

Optimize tidying up dangling secret-id accessors #6252

Conversation

Crazybus commented Feb 18, 2019 • edited by jefferai Loading

hashicorp-cla commented Feb 18, 2019 • edited Loading

tyrannosaurus-becks Feb 28, 2019

Choose a reason for hiding this comment

Crazybus Mar 7, 2019

Choose a reason for hiding this comment

Crazybus Mar 7, 2019

Choose a reason for hiding this comment

tyrannosaurus-becks Feb 28, 2019

Choose a reason for hiding this comment

Crazybus commented Mar 8, 2019

Crazybus commented Mar 26, 2019

tyrannosaurus-becks Mar 27, 2019

Choose a reason for hiding this comment

Crazybus Apr 2, 2019

Choose a reason for hiding this comment

tyrannosaurus-becks commented Apr 3, 2019

Crazybus commented Apr 23, 2019

tyrannosaurus-becks commented Apr 25, 2019

Crazybus commented May 9, 2019

Crazybus commented Jul 4, 2019

Crazybus commented May 14, 2020

Crazybus commented Feb 18, 2019 •

edited by jefferai

Loading

hashicorp-cla commented Feb 18, 2019 •

edited

Loading