Reduce the need to reload roles from index #33205

jaymode · 2018-08-28T14:38:26Z

If the roles.yml file is changed, the Security code takes the heavy handed approach of invalidating all cache entries rather than those that use changed roles. This should be a rare occurrence but in a busy cluster this is problematic since any roles stored in a index will need to be reloaded, which means a search (or get if only one name is requested) needs to be executed and shares the same threadpool/queue as other searches. If the search is rejected, then we use our negative lookup cache to prevent this role from being queried for again. This leads to poor behavior for the user as users do not have the expected permissions.

We should look at how we can improve this behavior through the need to reduce which roles get reloaded from an index. This may mean a change in our caching or how we deal with changes to the roles.yml and reloading.

elasticmachine · 2018-08-28T14:38:30Z

Pinging @elastic/es-security

jaymode · 2018-09-06T08:45:57Z

Another item to consider is to limit role combination loading to one per unique role combination. This would be similar to what was done for concurrent authentications of a user.

In order to optimize the use of the role cache, when the roles.yml file is reloaded we now calculate the names of removed, changed, and added roles so that they may be passed to any listeners. This allows a listener to selectively clear cache for only the roles that have been modified. The CompositeRolesStore has been adapted to do exactly that so that we limit the need to reload roles from sources such as the native roles stores or external role providers. See elastic#33205

The native roles store previously would issue a search if attempting to retrieve more than a single role. This is fine when we are attempting to retrieve all of the roles to list them in the api, but could cause issues when attempting to find roles for a user. The search is not prioritized over other search requests, so heavy aggregations/searches or overloaded nodes could cause roles to be cached as missing if the search is rejected. When attempting to load specific roles, we know the document id for the role that we are trying to load, which allows us to use the multi-get api for loading these roles. This change makes use of the multi-get api when attempting to load more than one role by name. This api is also constrained by a threadpool but the tasks in the GET threadpool should be quicker than searches. See elastic#33205

Today all searches happen on the search threadpool which is the correct behavior in almost any case. Yet, there are exceptions where for instance searches are required to succeed ie. in the case of a .security index. These searches should not be rejected if a node is under load. There are other more specialized usecases were searches should be passed through a single-thread threadpool to reduce impact on a node. Relates to elastic#33205

In order to optimize the use of the role cache, when the roles.yml file is reloaded we now calculate the names of removed, changed, and added roles so that they may be passed to any listeners. This allows a listener to selectively clear cache for only the roles that have been modified. The CompositeRolesStore has been adapted to do exactly that so that we limit the need to reload roles from sources such as the native roles stores or external role providers. See #33205

jasontedor · 2018-09-27T01:05:16Z

I have been thinking about this issue on and off since @s1monw opened the pull request to enable to route these requests to the generic thread pool. That led to a discussion about how we should avoid that because that thread pool is large and unbounded, and we do not want to be shuttling tasks from user-initiated actions to an unbounded thread pool. We reversed that portion of the change and I had proposed that we force put these requests on to the search thread pool, so that they can not be rejected. When @s1monw pinged me this morning that that is basically the same as allowing these to grow unbounded again, that triggered me to take another look at the underlying issue.

If the search is rejected, then we use our negative lookup cache to prevent this role from being queried for again.

I think this is the key point that we need to revisit. I do not think we should be entering role lookups that fail (not fail negatively, but fail exceptionally) into the negative lookup cache. Rather, an exceptional failure should be an indication that we do not know the result of the role lookup. So it's neither a positive hit, nor a negative hit, but we simply do not know and we should not be caching this result. In particular, rejections of role lookups on the search thread pool should cause the initial request to fail, we can notify the client who can retry the request.

Separately, I think that we can consider migrating these role lookups to a separate thread pool but still with a bounded queue so that they are not competing with search requests.

What do you think @jaymode and @s1monw?

jaymode · 2018-09-27T04:00:15Z

I do not think we should be entering role lookups that fail (not fail negatively, but fail exceptionally) into the negative lookup cache.

I am already working on this change, so we are on the same page regarding this aspect. @tvernum and @albertzaharovits also agree #33531 (comment).

Separately, I think that we can consider migrating these role lookups to a separate thread pool but still with a bounded queue so that they are not competing with search requests.

I think that this will be helpful. I did not have in mind that the search on a different threadpool change would be the ultimate solution to the problem but it does help avoid security searches being stuck behind large aggregations. Rejections will still be possible so the handling definitely needs to be improved around that. We could also consider adding a built-in retry with exponential backoff, but I haven't given too much thought to that yet.

jasontedor · 2018-09-27T04:10:35Z

We could also consider adding a built-in retry with exponential backoff, but I haven't given too much thought to that yet.

It is worth considering. Yet, for a first version I do not think that it is necessary, we can see how large of a problem it is in practice first after we make these initial changes before we commit to something like that.

Security caches the result of role lookups and negative lookups are cached indefinitely. In the case of transient failures this leads to a bad experience as the roles could truly exist. The CompositeRolesStore needs to know if a failure occurred in one of the roles stores in order to make the appropriate decision as it relates to caching. In order to provide this information to the CompositeRolesStore, the return type of methods to retrieve roles has changed to a new class, RoleRetrievalResult. This class provides the ability to pass back an exception to the roles store. This exception does not mean that a request should be failed but instead serves as a signal to the roles store that missing roles should not be cached and neither should the combined role if there are missing roles. As part of this, the negative lookup cache was also changed from an unbounded cache to a cache with a configurable limit. Relates elastic#33205

The security native stores follow a pattern where `SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls made for the security index. The reasoning behind this was to check if the security index had been upgraded to the latest version in a consistent manner. However, this has the potential side effect that a read will trigger the creation of the security index or an updating of its mappings, which can lead to issues such as failures due to put mapping requests timing out even though we might have been able to read from the index and get the data necessary. This change introduces a new method, `checkIndexVersionThenExecute`, that provides the consistent checking of the security index to make sure it has been upgraded. That is the only check that this method performs prior to running the passed in operation, which removes the possible triggering of index creation and mapping updates for reads. Additionally, areas where we do reads now check the availability of the security index and can short circuit requests. Availability in this context means that the index exists and all primaries are active. Relates elastic#33205

Security caches the result of role lookups and negative lookups are cached indefinitely. In the case of transient failures this leads to a bad experience as the roles could truly exist. The CompositeRolesStore needs to know if a failure occurred in one of the roles stores in order to make the appropriate decision as it relates to caching. In order to provide this information to the CompositeRolesStore, the return type of methods to retrieve roles has changed to a new class, RoleRetrievalResult. This class provides the ability to pass back an exception to the roles store. This exception does not mean that a request should be failed but instead serves as a signal to the roles store that missing roles should not be cached and neither should the combined role if there are missing roles. As part of this, the negative lookup cache was also changed from an unbounded cache to a cache with a configurable limit. Relates #33205

The security native stores follow a pattern where `SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls made for the security index. The reasoning behind this was to check if the security index had been upgraded to the latest version in a consistent manner. However, this has the potential side effect that a read will trigger the creation of the security index or an updating of its mappings, which can lead to issues such as failures due to put mapping requests timing out even though we might have been able to read from the index and get the data necessary. This change introduces a new method, `checkIndexVersionThenExecute`, that provides the consistent checking of the security index to make sure it has been upgraded. That is the only check that this method performs prior to running the passed in operation, which removes the possible triggering of index creation and mapping updates for reads. Additionally, areas where we do reads now check the availability of the security index and can short circuit requests. Availability in this context means that the index exists and all primaries are active. Relates #33205

The security native stores follow a pattern where `SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls made for the security index. The reasoning behind this was to check if the security index had been upgraded to the latest version in a consistent manner. However, this has the potential side effect that a read will trigger the creation of the security index or an updating of its mappings, which can lead to issues such as failures due to put mapping requests timing out even though we might have been able to read from the index and get the data necessary. This change introduces a new method, `checkIndexVersionThenExecute`, that provides the consistent checking of the security index to make sure it has been upgraded. That is the only check that this method performs prior to running the passed in operation, which removes the possible triggering of index creation and mapping updates for reads. Additionally, areas where we do reads now check the availability of the security index and can short circuit requests. Availability in this context means that the index exists and all primaries are active. This is the fixed version of elastic#34246, which was reverted. Relates elastic#33205

The security native stores follow a pattern where `SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls made for the security index. The reasoning behind this was to check if the security index had been upgraded to the latest version in a consistent manner. However, this has the potential side effect that a read will trigger the creation of the security index or an updating of its mappings, which can lead to issues such as failures due to put mapping requests timing out even though we might have been able to read from the index and get the data necessary. This change introduces a new method, `checkIndexVersionThenExecute`, that provides the consistent checking of the security index to make sure it has been upgraded. That is the only check that this method performs prior to running the passed in operation, which removes the possible triggering of index creation and mapping updates for reads. Additionally, areas where we do reads now check the availability of the security index and can short circuit requests. Availability in this context means that the index exists and all primaries are active. This is the fixed version of #34246, which was reverted. Relates #33205

In order to optimize the use of the role cache, when the roles.yml file is reloaded we now calculate the names of removed, changed, and added roles so that they may be passed to any listeners. This allows a listener to selectively clear cache for only the roles that have been modified. The CompositeRolesStore has been adapted to do exactly that so that we limit the need to reload roles from sources such as the native roles stores or external role providers. See #33205

Security caches the result of role lookups and negative lookups are cached indefinitely. In the case of transient failures this leads to a bad experience as the roles could truly exist. The CompositeRolesStore needs to know if a failure occurred in one of the roles stores in order to make the appropriate decision as it relates to caching. In order to provide this information to the CompositeRolesStore, the return type of methods to retrieve roles has changed to a new class, RoleRetrievalResult. This class provides the ability to pass back an exception to the roles store. This exception does not mean that a request should be failed but instead serves as a signal to the roles store that missing roles should not be cached and neither should the combined role if there are missing roles. As part of this, the negative lookup cache was also changed from an unbounded cache to a cache with a configurable limit. Relates #33205

The security native stores follow a pattern where `SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls made for the security index. The reasoning behind this was to check if the security index had been upgraded to the latest version in a consistent manner. However, this has the potential side effect that a read will trigger the creation of the security index or an updating of its mappings, which can lead to issues such as failures due to put mapping requests timing out even though we might have been able to read from the index and get the data necessary. This change introduces a new method, `checkIndexVersionThenExecute`, that provides the consistent checking of the security index to make sure it has been upgraded. That is the only check that this method performs prior to running the passed in operation, which removes the possible triggering of index creation and mapping updates for reads. Additionally, areas where we do reads now check the availability of the security index and can short circuit requests. Availability in this context means that the index exists and all primaries are active. Relates #33205

The security native stores follow a pattern where `SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls made for the security index. The reasoning behind this was to check if the security index had been upgraded to the latest version in a consistent manner. However, this has the potential side effect that a read will trigger the creation of the security index or an updating of its mappings, which can lead to issues such as failures due to put mapping requests timing out even though we might have been able to read from the index and get the data necessary. This change introduces a new method, `checkIndexVersionThenExecute`, that provides the consistent checking of the security index to make sure it has been upgraded. That is the only check that this method performs prior to running the passed in operation, which removes the possible triggering of index creation and mapping updates for reads. Additionally, areas where we do reads now check the availability of the security index and can short circuit requests. Availability in this context means that the index exists and all primaries are active. This is the fixed version of #34246, which was reverted. Relates #33205

The native roles store previously would issue a search if attempting to retrieve more than a single role. This is fine when we are attempting to retrieve all of the roles to list them in the api, but could cause issues when attempting to find roles for a user. The search is not prioritized over other search requests, so heavy aggregations/searches or overloaded nodes could cause roles to be cached as missing if the search is rejected. When attempting to load specific roles, we know the document id for the role that we are trying to load, which allows us to use the multi-get api for loading these roles. This change makes use of the multi-get api when attempting to load more than one role by name. This api is also constrained by a threadpool but the tasks in the GET threadpool should be quicker than searches. See #33205

jaymode · 2019-01-15T21:47:58Z

I think we've made good progress here with improved logic when we encounter an error (#34197), not attempting to create index or put mapping on reads (#34246), switching to mget for roles instead of search (#33531), and calculating changed roles on file reloads (#33525). I'm going to close this issue and we can open another if there are more issues.

jaymode added >enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Aug 28, 2018

jaymode self-assigned this Aug 28, 2018

jaymode mentioned this issue Sep 7, 2018

Calculate changed roles on roles.yml reload #33525

Merged

jaymode mentioned this issue Sep 7, 2018

Native roles store uses mget to retrieve roles #33531

Merged

s1monw self-assigned this Sep 13, 2018

s1monw mentioned this issue Sep 15, 2018

Introduce a search_throttled threadpool #33732

Merged

jaymode mentioned this issue Oct 1, 2018

Use RoleRetrievalResult for better caching #34197

Merged

jaymode mentioned this issue Oct 2, 2018

Security: don't call prepare index for reads #34246

Merged

jaymode mentioned this issue Oct 17, 2018

Security: don't call prepare index for reads #34568

Merged

jaymode unassigned s1monw Oct 25, 2018

jaymode closed this as completed Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the need to reload roles from index #33205

Reduce the need to reload roles from index #33205

jaymode commented Aug 28, 2018

elasticmachine commented Aug 28, 2018

jaymode commented Sep 6, 2018

jasontedor commented Sep 27, 2018 •

edited

Loading

jaymode commented Sep 27, 2018

jasontedor commented Sep 27, 2018

jaymode commented Jan 15, 2019

Reduce the need to reload roles from index #33205

Reduce the need to reload roles from index #33205

Comments

jaymode commented Aug 28, 2018

elasticmachine commented Aug 28, 2018

jaymode commented Sep 6, 2018

jasontedor commented Sep 27, 2018 • edited Loading

jaymode commented Sep 27, 2018

jasontedor commented Sep 27, 2018

jaymode commented Jan 15, 2019

jasontedor commented Sep 27, 2018 •

edited

Loading