Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic mapping updates are unboundedly parallel #50670

Closed
DaveCTurner opened this issue Jan 6, 2020 · 7 comments · Fixed by #51038
Closed

Dynamic mapping updates are unboundedly parallel #50670

DaveCTurner opened this issue Jan 6, 2020 · 7 comments · Fixed by #51038
Assignees
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jan 6, 2020

Before 7.2.0 dynamic mapping updates would block a write thread waiting for the master to acknowledge the new mapping. In #39793 we moved to an asynchronous model, freeing up the write thread to carry on with other indexing tasks.

One feature of the pre-7.2.0 blocking approach was that the number of write threads is limited and this limits the number of parallel dynamic mapping updates pending on the master. The asynchronous model has no such limit and may send a very large number of dynamic mapping updates in a short time since the shard bulks are processed much faster. Furthermore, many indexing operations may require the same mapping update but since they are now generated much more quickly it may be that many of the mapping updates sent to the master are duplicates.

Related discussion thread.

@DaveCTurner DaveCTurner added >bug discuss :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jan 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@DaveCTurner
Copy link
Contributor Author

DaveCTurner commented Jan 6, 2020

We discussed a couple of options in Slack:

  1. impose an explicit limit on the number of in-flight dynamic mapping updates on each data node, throttling indexing when that limit is reached.

  2. detect the case where the in-flight mapping updates are already sufficient and, if so, wait locally for those updates to complete instead of sending duplicates to the master.

  3. apply backpressure at the network level (on the master) by stopping reading from the wire before reaching breaking point

@ywelsch ywelsch self-assigned this Jan 8, 2020
@SpencerLN
Copy link

I believe we are running into a very similar issue after an upgrade from 6.8.0 > 7.5.1. In 6.8.0 we would see a couple of minutes where the cluster would apply the new mappings and then recover, while in 7.5.1 we see the number of pending mapping changes reach 30,000+. This quickly results in the master node becoming unresponsive and data nodes repeatedly leaving/joining the cluster due to being unable to contact the master node in a timely manner.

Since our upgrade, each night at 12 AM (new date based indexes are created at this time) we have had to restart all of our master nodes simultaneously to bring the cluster back to a healthy state.

Is there any timeline on a potential fix for this issue being made available, or a recommended workaround?

@DaveCTurner
Copy link
Contributor Author

@SpencerLN that sounds like this issue indeed.

As a general rule, dynamic mappings should be used sparingly in production since they cause indexing to bottleneck on the master. It's much more efficient to use an index template to set up most of the mappings when the index is created, and this is particularly important at the kind of scale that would result in tens of thousands of pending tasks. Dynamic mappings are more appropriate for handling an occasional unexpected field.

@ywelsch
Copy link
Contributor

ywelsch commented Jan 15, 2020

We discussed various options to solving this issue in the distributed sync (and combinations thereof):

  • reintroduce blocking the writer thread on the data node once it has a certain number of dynamic mapping updates in-flight.
  • deduplicate the mapping updates that are sent to the master, assuming that a large number of these updates would be similar.
  • track the number of semi-processed, but uncompleted requests and start rejecting new requests once uncompleted requests reach a certain bound.
  • bound the number of in-flight mapping updates on the master node, and reject any new requests coming in. Combine this with a retry/backoff mechanism on the data node side.

To get a fix out quickly, I will look at reintroducing the blocking behavior in a first step.

ywelsch added a commit that referenced this issue Jan 15, 2020
Ensures that there are not too many concurrent dynamic mapping updates going out from the
data nodes to the master.

Closes #50670
ywelsch added a commit that referenced this issue Jan 15, 2020
Ensures that there are not too many concurrent dynamic mapping updates going out from the
data nodes to the master.

Closes #50670
ywelsch added a commit that referenced this issue Jan 15, 2020
Ensures that there are not too many concurrent dynamic mapping updates going out from the
data nodes to the master.

Closes #50670
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Ensures that there are not too many concurrent dynamic mapping updates going out from the
data nodes to the master.

Closes elastic#50670
@farin99
Copy link

farin99 commented Apr 27, 2020

@ywelsch
Copy link
Contributor

ywelsch commented Apr 27, 2020

Indeed, looks like I missed the backport to the 7.6 branch, and it also missed the 7.5.2 release (it's on the 7.5 branch, but just after that release), probably backported at the time where the new branches were cut. It was backported to 7.x (i.e. future 7.7.0), so will be released as part of that. I will adapt the labels on the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants