Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tidying up dangling secret-id accessors #6252

Closed
wants to merge 6 commits into from

Conversation

Crazybus
Copy link
Contributor

@Crazybus Crazybus commented Feb 18, 2019

Previous behavior was to do the following:

  1. Get a lock on all secret ids
  2. For each dangling accessor list all of the secret ids in all of the
    approles

During this process there is a write lock on approles which prevents any
authentication to happen. For small periods of time this can go
unnoticed. However there are situations where this causes a lot of
service disruption (full details at the end)

New behavior:

  1. Wait for a lock on all of the secret ids to make sure that all
    secret-ids have been written for the list of accessors we created
    earlier.
  2. Get the full list of secret id hmacs only once and re-use it for all
    dangling accessors.

The previous comment explained that the locking is only present to avoid race
conditions between the accessors being written and not the secret-id yet.
If there is another reason why a lock is necessary while getting the
list of secret ids I can easily update this so that listing all secret
ids happens while the locks are in place. My understanding is that just
waiting to get the lock and then releasing it is enough to guarantee
that we don't hit the accessor exists but not yet secret-id race
condition.

Why am I making these changes and what happened?

We have lots of approles, with lots of secret-ids and lots of
accessors. An approle was removed which contained ~8000 dangling
accessors. After making this change we suddenly saw that it wasn't
possible to authenticate with approle authentication any more. After
much troubleshooting we enabled trace logging and saw that it had to do
with dangling accessors, but couldn't tell how many there were and how
long they would take to clean up.

After adding in some extra trace logging I discovered that there was
8000 and because of the amount of other approles and secretids that it
was taking around 2 minutes per dangling accessor to clean up. With the
previous logic this would have taken at least 12 days to clean up while
holding a lock on approle authentication the entire time.

Part of the slowness is no doubt caused by the fact that we use GCS as
the backend, and I'm sure such a large amount of approles and secret ids
is not best practice or expected. However this vault instance is being
run as a service for other teams where we don't have full control over
enforcing how they use it, and vault also let them create that many
approles and secret ids for them.

Closes #6710

Previous behavior was to do the following:

1. Get a lock on all secret ids
2. For each dangling accessor list all of the secret ids in all of the
approles

During this process there is a write lock on approles which prevents any
authentication to happen. For small periods of time this can go
unnoticed. However there are situations where this causes a lot of
service disruption (full details at the end)

New behavior:

1. Wait for a lock on all of the secret ids to make sure that all
secret-ids have been written for the list of accessors we created
earlier.
2. Get the full list of secret id hmacs only once and re-use it for all
dangling accessors.

The previous comment explained that the locking is only present to avoid race
conditions between the accessors being written and not the secret-id yet.
If there is another reason why a lock is necessary while getting the
list of secret ids I can easily update this so that listing all secret
ids happens while the locks are in place. My understanding is that just
waiting to get the lock and then releasing it is enough to guarantee
that we don't hit the accessor exists but not yet secret-id race
condition.

Why am I making these changes and what happened?

We have lots of approles, with lots of secret-ids and lots of
accessors. An approle was removed which contained ~8000 dangling
accessors. After making this change we suddenly saw that it wasn't
possible to authenticate with approle authentication any more. After
much troubleshooting we enabled trace logging and saw that it had to do
with dangling accessors, but couldn't tell how many there were and how
long they would take to clean up.

After adding in some extra trace logging I discovered that there was
8000 and because of the amount of other approles and secretids that it
was taking around 2 minutes per dangling accessor to clean up. With the
previous logic this would have taken at least 12 days to clean up while
holding a lock on approle authentication the entire time.

Part of the slowness is no doubt caused by the fact that we use GCS as
the backend, and I'm sure such a large amount of approles and secret ids
is not best practice or expected. However this vault instance is being
run as a service for other teams where we don't have full control over
enforcing how they use it, and vault also let them create that many
approles and secret ids for them.
@hashicorp-cla
Copy link

hashicorp-cla commented Feb 18, 2019

CLA assistant check
All committers have signed the CLA.

for _, lock := range b.secretIDLocks {
lock.Lock()
defer lock.Unlock()
lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the right place? It seems strange to me that we would grab a lock and then unlock it, plus below there's a comment that we're doing stuff now that we waited for the lock. Usually locks are held for the duration of time when we're doing work that might otherwise lead to a race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say for certain if this is the right place. But for the current implementation I believe this does achieve the same goal with way less downtime.

so if we have the lock we know we're not in a wrote-accessor-but-not-yet-secret case, which can be racy.

So this means we want to make sure that nothing is currently still writing the secret-id between these two calls.

First a list of accessors is created. It is possible that an accessor exists but the secretID has not been written yet.

accessorHashes, err := s.List(ctx, accessorIDPrefixToUse)

Then a list of secretIDs is made:

secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))

In the scenario where the secretID does not exist yet this accessor will not be deleted and will be considered dangling.

By waiting for a lock here:

for _, lock := range b.secretIDLocks {
lock.Lock()
lock.Unlock()
}
// Now that we waited for a lock get a fresh list of all of the secretIDHMACs
// so that we can be sure if the accessor is dangling or not
allSecretIDHMACs := make([]string, 0)
for _, roleNameHMAC := range roleNameHMACs {
secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))
if err != nil {
return err
}
for _, v := range secretIDHMACs {
allSecretIDHMACs = append(allSecretIDHMACs, v)
}
}

We can be sure that all secretIDs have been written since the list of accessors was created. Which prevents the race condition from occurring.

After stepping through this I think I understand your comment better and realise that all of this locking could occur earlier without any need to do later "cleanups".

Waiting for the lock (and immediately releasing it) here:

for _, roleNameHMAC := range roleNameHMACs {
logger.Trace("listing secret ID HMACs", "role_hmac", roleNameHMAC)
secretIDHMACs, err := s.List(ctx, fmt.Sprintf("%s%s", secretIDPrefixToUse, roleNameHMAC))
if err != nil {
return err
}
for _, secretIDHMAC := range secretIDHMACs {
err = secretIDCleanupFunc(secretIDHMAC, roleNameHMAC, secretIDPrefixToUse)
if err != nil {
return err
}
}
}

Would achieve the same thing without needing to do the cleanup for dangling accessors. By doing it here I believe that the accessor could be immediately removed once being detected as being marked as dangling and that

if len(accessorMap) > 0 {
could be completely removed altogether as the situation would no longer be possible.

I'll test this locally first and see if it still prevents the race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks like it works just fine! I just pushed a new commit that waits for the lock before listing the secretIDHMACs. When commenting out the waiting for the lock I can get a race condition within around 30 seconds. With the current code in place I have run it for 40 minutes (4 in parallel for 10 minutes) without hitting a single race condition.

// Ideally, locking on accessors should be performed here too
// but for that, accessors are required in plaintext, which are
// not available. The code above helps but it may still be
// racy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain that we should change this code. It sounds like it's already racy. I'm not clear on where we might place the unlock for the mutex, but it seems like anything that holds the lock for a shorter amount of time would result in racier behavior. Perhaps I'm not thinking of it the right way and it'll be clearer once it's pushed.

Crazybus and others added 3 commits March 7, 2019 18:35
By doing the wait here it means we don't need to re-list the secretIDs
when they are dangling to test for the race condition.
Previously this race condition test would fail around 10% of the time on
my own machine. By moving the delay to after listing the secretIDHMACs
I'm able to get it to fail 100% of the time when removing the
`lock.Lock() lock.Unlock()` code that has been added in this pull
request.
@Crazybus
Copy link
Contributor Author

Crazybus commented Mar 8, 2019

By moving the fakeTidyDelay to somewhere after the secretIDHMACs were being listed I was able to remove the flakiness of the race condition test. I'm now able to get it to fail 100% of the time when removing the lock.Lock() && lock.Unlock() that I added in this pull request. And I'm also able to confirm that it passes every time when running the tests in 10 parallel terminals for a long period of time with:

go test -count=100 -failfast -run TestAppRole_TidyDanglingAccessors_RaceTest github.com/hashicorp/vault/builtin/credential/approle

@Crazybus
Copy link
Contributor Author

@tyrannosaurus-becks are you able to take another look at my recent changes?

// where the accessor has been written but not yet the secretIDHMAC
for _, lock := range b.secretIDLocks {
lock.Lock()
lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this still doesn't seem right to me. I've never seen locks grabbed before without doing some work before releasing them. I suspect that in a racy production environment this would not prevent races, though it would slow the code execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that in a racy production environment this would not prevent races, though it would slow the code execution.

Are you referring to the race condition I'm describing in #6252 (comment) ? Or is there another one that you can see that could potentially cause issues?

@jefferai are you able to chime in? I don't think either of us are 100% sure if my proposal here is safe and you are the original author of the dangling accessor cleanup code.

@tyrannosaurus-becks
Copy link
Contributor

Hi @Crazybus ! Today I took a closer look at the code and what it's currently doing and I have some questions.

First of all, there's a lot happening in this code with locks and cleaning things up. I think it would be great if this PR started off with a test that reproduces the long locking behavior that you're seeing in production, so we can also see how this fix changes that test's behavior.

From there, let's suppose we're in a situation where there are 8,000 dangling accessors to clean up and it'll take 12 continuous days to clean up given the current code, which will block authentication. Let's also suppose that the current approach to protecting against raciness is the maximum amount of risk we want on raciness.

Would the desired behavior be that instead of blocking auth for 12 days, accessors get cleaned up in smaller batches? Batches of a size that doesn't block authentication? Also, if we were batching these, would we want that to happen for all users, or just for users with this need? How would we differentiate between the two?

What I'm leading towards here is the idea that perhaps we should consider adding a batch_size parameter to the tidy call that would give pauses for authentication. In just looking at the code, I'm not certain where exactly that should be inserted, but I think with having a test to guide us it would help a lot.

Let me know what you think. Thanks!

This test adds a delay to simulate the simulation where cleaning up lots
of dangling accessors would block new secret-ids from being created.
@Crazybus
Copy link
Contributor Author

First of all, there's a lot happening in this code with locks and cleaning things up. I think it would be great if this PR started off with a test that reproduces the long locking behavior that you're seeing in production, so we can also see how this fix changes that test's behavior.

Added a test in 9708c3a. To test that it fails properly with current master the testing delay needs to be added at

.

Let's also suppose that the current approach to protecting against raciness is the maximum amount of risk we want on raciness.

Is there another race condition that you can see which is not being covered by the current race condition test? As far as I can tell the race condition that needs to be avoided is the situation where the accessor has been created but not yet the secretIDHMAC. Waiting for all write operations to have finished guarantees that all accessors which existed have now also had their secretIDs created too. I couldn't trigger a single race condition test to fail even when running it in parallel for a long time. If there is another one that you are aware of I'm happy to add in a new test to make sure that it does not get hit.

Would the desired behavior be that instead of blocking auth for 12 days, accessors get cleaned up in smaller batches? Batches of a size that doesn't block authentication? Also, if we were batching these, would we want that to happen for all users, or just for users with this need? How would we differentiate between the two?
What I'm leading towards here is the idea that perhaps we should consider adding a batch_size parameter to the tidy call that would give pauses for authentication. In just looking at the code, I'm not certain where exactly that should be inserted, but I think with having a test to guide us it would help a lot.

I don't see how this would help. If it takes more than 60 seconds (the default vault client timeout) to list the secretIDHMACS (

for _, roleNameHMAC := range roleNameHMACs {
) then even with a batch size of 1 this is still going to cause failed logins.

@tyrannosaurus-becks
Copy link
Contributor

Hi @Crazybus !

I chatted with one of the primary authors of this code, and we don't want to proceed with this particular approach for this fix. As for what approach, it probably doesn't make sense for us to continue discussing the approach when we, the Vault folks, don't really have as sound a grasp of the issue as you do. I think we should get there first before we continue to dive into this code.

It does seem strange that it would take 12 days to clean up 8000 dangling accessors, and we're not certain yet from the issue description that Vault is the source of the problem. There could be other contributing factors, like the enclosing environment having gone into swap.

Would you be willing to instead file a bug report with steps to reproduce what you're seeing? If Vault is the problem, we'd like to take a closer look and come up with a fix. Steps to reproduce would include what Vault version you're on, how to recreate an environment like what you're seeing, maybe even pertinent information around what it's running on, etc.

I'm going to go ahead and close this PR for now, until we can get our heads better around this. Thanks again for working on this!

@Crazybus
Copy link
Contributor Author

Crazybus commented May 9, 2019

Would you be willing to instead file a bug report with steps to reproduce what you're seeing

Done! #6710

Crazybus added a commit to Crazybus/vault that referenced this pull request Jul 3, 2019
This is a less aggressive followup to hashicorp#6252 which helps to mitigate the
total downtime during the dangling accessor cleanup described in hashicorp#6710.
It changes the downtime from `(time_to_list_all_accessors *
dangling_accessors_count) + time_to_delete_dangling_accessors` to just
`time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our
situation where we had 8000 dangling accessors taking 2 minutes to list
all secret-ids means going from ~12 days of downtime to `2 minutes +
(time_to_delete_accessor * 8000)` which would be around 15 minutes in
total instead.

This change gets the list of secret-id HMACs once instead of getting the
exact same list for each dangling accessor. It is only needed to do this
once at the start because the secret-id backends all have write locks on
them making it impossible for them to be changed during the cleanup
process.
Crazybus added a commit to Crazybus/vault that referenced this pull request Jul 3, 2019
This is a less aggressive followup to hashicorp#6252 which helps to mitigate the
total downtime during the dangling accessor cleanup described in hashicorp#6710.
It changes the downtime from `(time_to_list_all_accessors *
dangling_accessors_count) + time_to_delete_dangling_accessors` to just
`time_to_list_all_accessors + time_to_delete_dangling_accessors`. In our
situation where we had 8000 dangling accessors taking 2 minutes to list
all secret-ids means going from ~12 days of downtime to `2 minutes +
(time_to_delete_accessor * 8000)` which would be around 15 minutes in
total instead.

This change gets the list of secret-id HMACs once instead of getting the
exact same list for each dangling accessor. It is only needed to do this
once at the start because the secret-id backends all have write locks on
them making it impossible for them to be changed during the cleanup
process.
@jefferai jefferai reopened this Jul 3, 2019
@Crazybus
Copy link
Contributor Author

Crazybus commented Jul 4, 2019

Continuing the discuss with @jefferai from #7052 in here:

I believe there are issues with the previous PR but it seems to address the race condition. Probably that's the right place to start. I'll close this and reopen that, can you true up the code between the two and poke when you think it's ready for a look?

One thing in particular that I didn't follow as I looked at the other PR right now is that once you have done the locking/unlocking you then go through the accessor map and unilaterally delete anything found there -- I would think you'd need to revalidate whether or not the entries there are actually dangling.

My short answer to this is that there is no need to double check if the accessor is dangling or not because of where the locking is now done before secretIDCleanupFunc is called we know that the roleNameHMACs and SecretIDHMACs contains all accessors already. During secretIDCleanupFunc the accessor is either valid and removed from the hash, or an error was encountered which causes the code to return. I could be totally off here, so if there is something I'm missing then I would love to add a test for the scenario where these assumptions don't work.

Here is the slightly longer version comparing the two approaches with some pseudo code.

Current approach:

list roleNameHMACs
list accessors
create accessorMap with all accessors
run secretIDCleanupFunc looping over roleNameHMACs and SecretIDHMACs
loop over dangling accessors
    lock all secretids
    get a fresh list of roleNameHMACs and SecretIDHMACs to avoid the race condition
    delete it if it is still really dangling, if found don't delete

Approach in this PR:

list roleNameHMACs
list accessors
create accessorMap with all accessors
wait for a lock and then immediately unlock each secretIDLock. This makes sure that all write operations to the secretID backend have finished, so we don't need to double check it later for dangling accesors. 
run secretIDCleanupFunc looping over roleNameHMACs and SecretIDHMACs
loop over dangling accessors
    delete them all since we already know that they are dangling

@Crazybus
Copy link
Contributor Author

Closing this one out since a much better implementation has been written and merged in #8418

@Crazybus Crazybus closed this May 14, 2020
@Crazybus Crazybus deleted the dangling_dangler branch May 14, 2020 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Approle logins timeout while vault is tidying up dangling accessors
5 participants