Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication operation that try to perform the primary phase on a replica should be retried #17358

Merged
merged 1 commit into from
Mar 29, 2016

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Mar 27, 2016

In extreme cases a local primary shard can be replaced with a replica while a replication request is in flight and the primary action is applied to the shard (via acquirePrimaryOperationLock()). #17044 changed the exception used in that method to something that isn't recognized as TransportActions.isShardNotAvailableException, causing the operation to fail immediately instead of retrying. This commit fixes this by check the primary flag before
acquiring the lock. This is safe to do as an IndexShard will never be demoted once a primary.

Marking as non issue as this was never released.

Example failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-os-compatibility/os=centos/138

@@ -741,7 +741,11 @@ public String toString() {
QUERY_SHARD_EXCEPTION(org.elasticsearch.index.query.QueryShardException.class,
org.elasticsearch.index.query.QueryShardException::new, 141),
NO_LONGER_PRIMARY_SHARD_EXCEPTION(ShardStateAction.NoLongerPrimaryShardException.class,
ShardStateAction.NoLongerPrimaryShardException::new, 142);
ShardStateAction.NoLongerPrimaryShardException::new, 142),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last thing we need is another exception. We should get rid of the instanceof checks in order to do retries and use ElasticsearchException#addHeader I think we can just use IllegalIndexShardStateException instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s an interesting idea and I agree more tailor made exceptions are a shame. I’ll look at that approach…

On 28 Mar 2016, at 20:48, Simon Willnauer notifications@github.com wrote:

In core/src/main/java/org/elasticsearch/ElasticsearchException.java:

@@ -741,7 +741,11 @@ public String toString() {
QUERY_SHARD_EXCEPTION(org.elasticsearch.index.query.QueryShardException.class,
org.elasticsearch.index.query.QueryShardException::new, 141),
NO_LONGER_PRIMARY_SHARD_EXCEPTION(ShardStateAction.NoLongerPrimaryShardException.class,

  •            ShardStateAction.NoLongerPrimaryShardException::new, 142);
    
  •        ShardStateAction.NoLongerPrimaryShardException::new, 142),
    

I think the last thing we need is another exception. We should get rid of the instanceof checks in order to do retries and use ElasticsearchException#addHeader I think we can just use IllegalIndexShardStateException instead?


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub

@bleskes bleskes force-pushed the index_shard_exceptions branch 2 times, most recently from 0cbe6b7 to 0ecf7a4 Compare March 29, 2016 10:10
@bleskes
Copy link
Contributor Author

bleskes commented Mar 29, 2016

@s1monw I pushed another commit, removing the custom exceptions and relying a primary check in TRA, before acquiring an operation lock from the relevant IndexShard instance. This is OK since a primary shard will never be demoted

@@ -799,6 +800,9 @@ void finishBecauseUnavailable(ShardId shardId, String message) {
protected IndexShardReference getIndexShardReferenceOnPrimary(ShardId shardId) {
IndexService indexService = indicesService.indexServiceSafe(shardId.getIndex());
IndexShard indexShard = indexService.getShard(shardId.id());
if (indexShard.routingEntry().primary() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a small comment here when and why this could happen?

@s1monw
Copy link
Contributor

s1monw commented Mar 29, 2016

left a minor regarding a comment - LGTM otherwise thanks!

setState(clusterService, stateWithActivePrimary(index, true, randomInt(3)));
logger.debug("--> using initial state:\n{}", clusterService.state().prettyPrint());
Request request = new Request(shardId);
boolean timeout = true; //randomBoolean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has happened before (:wink:, rescue the randomBoolean invocation from the comment grave).

@jasontedor
Copy link
Member

Just a minor issue (the commented out randomization in the test), and two nits. Otherwise, LGTM.

@bleskes bleskes force-pushed the index_shard_exceptions branch from efb7836 to d03562f Compare March 29, 2016 15:11
…ica should be retried

In extreme cases a local primary shard can be replaced with a replica while a replication request is in flight and the primary action is applied to the shard (via `acquirePrimaryOperationLock()).  elastic#17044 changed the exception used in that method to something that isn't recognized as `TransportActions.isShardNotAvailableException`, causing the operation to fail immediately instead of retrying. This commit fixes this by check the primary flag before
acquiring the lock. This is safe to do as an IndexShard will never be demoted once a primary.

Closes elastic#17358
@bleskes bleskes force-pushed the index_shard_exceptions branch from d03562f to 48b4f08 Compare March 29, 2016 15:22
@bleskes bleskes merged commit 48b4f08 into elastic:master Mar 29, 2016
@bleskes bleskes deleted the index_shard_exceptions branch March 29, 2016 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants