-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return 429 status code when there's a read_only cluster block #50166
Return 429 status code when there's a read_only cluster block #50166
Conversation
Pinging @elastic/es-distributed (:Distributed/CRUD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up @gaobinlong. In the issue #49393 I did say:
Similarly, the status codes of other cluster blocks should be reexamined in this context.
However, I think this change took it too far. For example, if a write block is placed on an index manually by a user to prevent any further writes to the index, is that really a retryable situation? I would argue not.
Can you take another stab at thinking through which other blocks should return 429?
e93f73d
to
d114f31
Compare
@jasontedor , I have changed the code and only modify the status code to 429 when there is a read_only cluster block:
These cluster blocks have METADATA_WRITE and WRITE block level, I think they can return 429 so user can retry the write operations avoiding data loss. |
I’m on vacation, can someone on @elastic/es-distributed review this while I’m out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the extra work on this @gaobinlong . First of all, please do not force-push to PRs, it makes it hard to review the changes made since last time and comments loose their meaning when we cannot see what the original version of the code were.
I think that of the 4 blocks modified here, we should only return TOO_MANY_REQUESTS
for the one block where the disk threshold monitor clears the block automatically (i.e., the index level index.blocks.read_only_allow_delete
). I worry that the other cases could have legitimate user-facing use cases (for instance manually marking an index read-only) that would break by this change.
Will you look into amending the PR (no force push, please)?
@henningandersen Thanks for your comment and advise. I have pushed a new commit and only return TOO_MANY_REQUESTS for index level read_only_allow_delete block. Can you help to review the change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good. I left two smaller things to address. I will start a CI job for this too.
test/framework/src/main/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java
Outdated
Show resolved
Hide resolved
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gaobinlong . I added a few more comments, AFAICS, the new logic to prefer 403 over 429 will not work, please have a look.
test/framework/src/main/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertions.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/block/ClusterBlockException.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/block/ClusterBlockException.java
Outdated
Show resolved
Hide resolved
Hi @henningandersen, can I take you some time to help to review the new commit I have pushed? Thanks a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gaobinlong for the extra iteration. I left a few smaller comments, but otherwise it is looking good. I will kick off a test run too.
server/src/main/java/org/elasticsearch/cluster/block/ClusterBlockException.java
Outdated
Show resolved
Hide resolved
test/framework/src/test/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertionsTests.java
Outdated
Show resolved
Hide resolved
test/framework/src/test/java/org/elasticsearch/test/hamcrest/ElasticsearchAssertionsTests.java
Outdated
Show resolved
Hide resolved
@elasticmachine test this please |
@gaobinlong the test failures looks like they could be caused by incompatibilities between this branch and latest changes in master/7.x. Will you merge in master changes into this branch? |
Hi @henningandersen, I have merged master changes into the branch and optimized some code following your advice. |
@elasticmachine test this please |
@gaobinlong , nearly there but unfortunately the branch is now again outdated with master/7.x causing the build to fail. I can merge in the changes unless you prefer to do so yourself? |
@henningandersen, thanks a lot if you can help to merge in master changes. |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, @gaobinlong . I added a couple of comments. Thanks again for your efforts.
test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java
Outdated
Show resolved
Hide resolved
modules/reindex/src/test/java/org/elasticsearch/index/reindex/DeleteByQueryBasicTests.java
Outdated
Show resolved
Hide resolved
modules/reindex/src/test/java/org/elasticsearch/index/reindex/DeleteByQueryBasicTests.java
Show resolved
Hide resolved
@henningandersen, the new commit I have pushed contains some changes following your comment. The difference is that instead of updating the setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay here. I have one comment left and then I think we are good.
modules/reindex/src/test/java/org/elasticsearch/index/reindex/DeleteByQueryBasicTests.java
Outdated
Show resolved
Hide resolved
@henningandersen,I have pushed a new commit which delay refreshing cluster info using |
@elasticmachine test this please |
@gaobinlong there are a few checkstyle issues reported by the build, see: Notice that you can run:
to check all precommit checks before pushing. Would you mind looking into those? |
@henningandersen, I have done that and merged in the master changes into the branch. |
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the additional iterations @gaobinlong .
…#50166) We consider index level read_only_allow_delete blocks temporary since the DiskThresholdMonitor can automatically release those when an index is no longer allocated on nodes above high threshold. The rest status has therefore been changed to 429 when encountering this index block to signal retryability to clients. Related to elastic#49393
We consider index level read_only_allow_delete blocks temporary since the DiskThresholdMonitor can automatically release those when an index is no longer allocated on nodes above high threshold. The rest status has therefore been changed to 429 when encountering this index block to signal retryability to clients. Related to #49393
Users are perennially confused by the message they get when writing to an index is blocked due to excessive disk usage: TOO_MANY_REQUESTS/12/index read-only / allow delete (api) Of course this is technically accurate but it is hard to join the dots from this message to "your disk was too full" without some searching of forums and documentation. Additionally in elastic#50166 we changed the status code to today's `429` from the previous `403` which changed the message from the one that's widely documented elsewhere: FORBIDDEN/12/index read-only / allow delete (api) Since elastic#42559 we've considered this block to be under the sole control of the disk-based shard allocator, and we have seen no evidence to suggest that anyone is applying this block manually. Therefore this commit adjusts this block's message to indicate that it's caused by a lack of disk space.
Users are perennially confused by the message they get when writing to an index is blocked due to excessive disk usage: TOO_MANY_REQUESTS/12/index read-only / allow delete (api) Of course this is technically accurate but it is hard to join the dots from this message to "your disk was too full" without some searching of forums and documentation. Additionally in #50166 we changed the status code to today's `429` from the previous `403` which changed the message from the one that's widely documented elsewhere: FORBIDDEN/12/index read-only / allow delete (api) Since #42559 we've considered this block to be under the sole control of the disk-based shard allocator, and we have seen no evidence to suggest that anyone is applying this block manually. Therefore this commit adjusts this block's message to indicate that it's caused by a lack of disk space.
Users are perennially confused by the message they get when writing to an index is blocked due to excessive disk usage: TOO_MANY_REQUESTS/12/index read-only / allow delete (api) Of course this is technically accurate but it is hard to join the dots from this message to "your disk was too full" without some searching of forums and documentation. Additionally in #50166 we changed the status code to today's `429` from the previous `403` which changed the message from the one that's widely documented elsewhere: FORBIDDEN/12/index read-only / allow delete (api) Since #42559 we've considered this block to be under the sole control of the disk-based shard allocator, and we have seen no evidence to suggest that anyone is applying this block manually. Therefore this commit adjusts this block's message to indicate that it's caused by a lack of disk space.
This PR is related to #49393.
The main point of the change is: