-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eunit replication test fails with {nocatch, {mp_parser_died,noproc}} #574
Comments
We actually got a recurrence of this: |
I created a really dumb python script to search for jenkins-couchdb-104-2017-07-18T16:11:41.083372 |
The most recent of these failures is Summary below:
couch.log looks similar to before:
|
There is a race condition here in how a 413 (request body is too large) error This test posts a 70k document followed by a 70k attachment in a single MP PUT Here is the rough sequence of events: Source fetches the document revision using the open_revs GET request. Then document data (body and attachments) is sent to the target using a PUT
https://github.com/apache/couchdb/blob/master/src/couch/src/couch_httpd_multipart.erl#L29 The PUT attachment streamer gets bytes from the parser and sends them to the Target notices that the request body is too large so it sends back a 413 This is where the race condition happens:
After the retry, request tries to send data to the target again but this Possible fixes:
|
In some case such as when replicator flushes a document received from an open_revs response, it explictly sets the number of retries to 0 because the context for that request might not be restartable and the retry should happen at a higher level. Issue apache#574
Signal mochiweb to force close a connection if the error was a 413 (request body too long). Closing the socket flushes the data back to the client quicker so it can parse the 413 error before the connection is closed and it just gets a connection closed event. Issue apache#574
Connections which received a 413 response might be dirty so to be safe, clean them up by killing them, waiting for them to die, then return them to the connection pool. Issue apache#574
It is possible that sometimes a multipart/related PUT with a doc and an attachment would fail with the connection being un-expectedly closed before the client (ibrowse) gets to parse the 413 error response. That makes the test flaky so it is disabled for now. Issue apache#574
In some case such as when replicator flushes a document received from an open_revs response, it explictly sets the number of retries to 0 because the context for that request might not be restartable and the retry should happen at a higher level. Issue #574
It is possible that sometimes a multipart/related PUT with a doc and an attachment would fail with the connection being un-expectedly closed before the client (ibrowse) gets to parse the 413 error response. That makes the test flaky so it is disabled for now. Issue #574
This should have closed out with @nickva 's PR. Closing. |
Expected & Current Behaviour
Usually, the eunit
couch_replicator_small_max_request_size_target
sub-test withshould_populate_source_one_large_attachment
passes. Occasionally, the Makefile shows the test timing out. This appears to be due to an actual crash incouch_att
.When it passes, the
couch.log
looks like:When it fails, the
couch.log
looks like the following. Notice how the attempt to convert the attachment to a multipart is failing, killing the replicator PUT connection. A backoff occurs and retry occurs; the code assumes the error is remote. The test fails after 60s of re-trying:Possible Solution
The text was updated successfully, but these errors were encountered: