Replication with attachments never completes, {mp_parser_died,noproc} error #745

wohali · 2017-08-10T20:14:31Z

Expected Behavior

Replication of a DB with attachments into 2.1.0 should be successful.

Current Behavior

Replication crashes after a while with the following stack trace:

[notice] 2017-08-10T07:07:19.694049Z couchdb@127.0.0.1 <0.4248.0> 41b32fb786 127.0.0.1:5984 127.0.0.1 undefined PUT /dbname/05a410bd-3d15-4d32-a410-bd3d156d32c2?new_edits=false 201 ok 812
[error] 2017-08-10T07:07:19.712888Z couchdb@127.0.0.1 emulator -------- Error in process <0.2147.1> on node 'couchdb@127.0.0.1' with exit value:
{{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}

[info] 2017-08-10T07:07:19.712956Z couchdb@127.0.0.1 <0.489.0> -------- Replication connection to: "127.0.0.1":5984 died with reason {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
[error] 2017-08-10T07:07:19.713776Z couchdb@127.0.0.1 <0.4020.0> ef685c906e req_err(3669112652) badmatch : ok
    [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6 L91">>,<<"proc_lib:init_p_do_apply/3 L247">>]
[notice] 2017-08-10T07:07:19.714171Z couchdb@127.0.0.1 <0.4020.0> ef685c906e 127.0.0.1:5984 127.0.0.1 undefined PUT /dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false 500 ok 1
[error] 2017-08-10T07:07:19.714284Z couchdb@127.0.0.1 <0.22189.0> -------- Replicator, request PUT to "http://127.0.0.1:5984/dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false" failed due to error {error,
    {'EXIT',
        {{{nocatch,{mp_parser_died,noproc}},
          [{couch_att,'-foldl/4-fun-0-',3,
               [{file,"src/couch_att.erl"},{line,591}]},
           {couch_att,fold_streamed_data,4,
               [{file,"src/couch_att.erl"},{line,642}]},
           {couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595}]},
           {couch_httpd_multipart,atts_to_mp,4,
               [{file,"src/couch_httpd_multipart.erl"},{line,208}]}]},
         {gen_server,call,
             [<0.5894.0>,
              {send_req,
                  {{url,
                       "http://127.0.0.1:5984/dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false",
                       "127.0.0.1",5984,undefined,undefined,
                       "/dbname/1ab680f5-eb77-4450-b680-f5eb774450a2?new_edits=false",
                       http,ipv4_address},
                   [{"Accept","application/json"},
                    {"Content-Length",278575},
                    {"Content-Type",
                     "multipart/related; boundary=\"dac3f5492529b83c6ba2be5e0894827f\""},
                    {"User-Agent","CouchDB-Replicator/2.1.0-f527f2a"}],
                   put,
                   {#Fun<couch_replicator_api_wrap.11.133909485>,
                    {<<{DOCUMENT HAS BEEN PARTIALLY CENSORED}>>,
                     [{att,
                          <<"9352f01c630c34550c81bf4c57ded4e9e5607e08f4c9a94b1383b111c5950b19">>,
                          <<"application/pdf">>,276333,276333,
                          <<90,37,115,74,154,16,223,165,110,51,141,171,118,182,
                            182,62>>,
                          3,
                          {follows,<0.22188.0>,#Ref<0.0.262145.216597>},
                          identity}],
                     <<"dac3f5492529b83c6ba2be5e0894827f">>,278575}},
                   [{response_format,binary},
                    {inactivity_timeout,30000},
                    {socket_options,[{keepalive,true},{nodelay,false}]}],
                   infinity}},
              infinity]}}}}
[notice] 2017-08-10T07:07:19.716645Z couchdb@127.0.0.1 <0.20928.0> -------- Retrying GET to https://mb-d46f6b75-bedb-496e-97e6-f230be51e571:*****@couchdb.icure.cloud:443/icure-mb-d46f6b75-bedb-496e-97e6-f230be51e571-healthdata/1ab680f5-eb77-4450-b680-f5eb774450a2?revs=true&open_revs=%5B%224-14513d194dd675cea97e2f415c71856b%22%5D&latest=true in 1.0 seconds due to error {http_request_failed,[80,85,84],[104,116,116,112,58,47,47,49,50,55,46,48,46,48,46,49,58,53,57,56,52,47,105,99,117,114,101,45,104,101,97,108,116,104,100,97,116,97,47,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,63,110,101,119,95,101,100,105,116,115,61,102,97,108,115,101],{error,{error,{'EXIT',{{{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,[115,114,99,47,99,111,117,99,104,95,97,116,116,46,101,114,108]},{line,591}]},{couch_att,fold_streamed_data,4,[{file,[115,114,99,47,99,111,117,99,104,95,97,116,116,46,101,114,108]},{line,642}]},{couch_att,foldl,4,[{file,[115,114,99,47,99,111,117,99,104,95,97,116,116,46,101,114,108]},{line,595}]},{couch_httpd_multipart,atts_to_mp,4,[{file,[115,114,99,47,99,111,117,99,104,95,104,116,116,112,100,95,109,117,108,116,105,112,97,114,116,46,101,114,108]},{line,208}]}]},{gen_server,call,[<0.5894.0>,{send_req,{{url,[104,116,116,112,58,47,47,49,50,55,46,48,46,48,46,49,58,53,57,56,52,47,105,99,117,114,101,45,104,101,97,108,116,104,100,97,116,97,47,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,63,110,101,119,95,101,100,105,116,115,61,102,97,108,115,101],[49,50,55,46,48,46,48,46,49],5984,undefined,undefined,[47,105,99,117,114,101,45,104,101,97,108,116,104,100,97,116,97,47,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,63,110,101,119,95,101,100,105,116,115,61,102,97,108,115,101],http,ipv4_address},[{[65,99,99,101,112,116],[97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110]},{[67,111,110,116,101,110,116,45,76,101,110,103,116,104],278575},{[67,111,110,116,101,110,116,45,84,121,112,101],[109,117,108,116,105,112,97,114,116,47,114,101,108,97,116,101,100,59,32,98,111,117,110,100,97,114,121,61,34,100,97,99,51,102,53,52,57,50,53,50,57,98,56,51,99,54,98,97,50,98,101,53,101,48,56,57,52,56,50,55,102,34]},{[85,115,101,114,45,65,103,101,110,116],[67,111,117,99,104,68,66,45,82,101,112,108,105,99,97,116,111,114,47,50,46,49,46,48,45,102,53,50,55,102,50,97]}],put,{#Fun<couch_replicator_api_wrap.11.133909485>,{<<123,34,95,105,100,34,58,34,49,97,98,54,56,48,102,53,45,101,98,55,55,45,52,52,53,48,45,98,54,56,48,45,102,53,101,98,55,55,52,52,53,48,97,50,34,44,34,95,114,101,118,34,58,34,52,45,49,52,53,49,51,100,49,57,52,100,100,54,55,53,99,101,97,57,55,101,50,102,52,49,53,99,55,49,56,53,54,98,34,44,34,99,114,101,97,116,101,100,34,58,49,51,57,52,53,50,54,50,50,52,57,50,55,44,34,109,111,100,105,102,105,101,100,34,58,49,51,57,52,53,50,54,53,51,48,56,51,52,44,34,99,111,100,101,115,34,58,91,93,44,34,116,97,103,115,34,58,91,93,44,34,115,101,99,114,101,116,70,111,114,101,105,103,110,75,101,121,115,34,58,91,93,44,34,99,114,121,112,116,101,100,70,111,114,101,105,103,110,75,101,121,115,34,58,123,125,44,34,100,101,108,101,103,97,116,105,111,110,115,34,58,123,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,58,91,123,34,111,119,110,101,114,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,100,101,108,101,103,97,116,101,100,84,111,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,107,101,121,34,58,34,102,100,48,50,101,99,49,97,102,50,100,102,53,101,49,52,50,99,51,99,101,54,101,98,57,55,52,53,49,48,99,57,57,49,97,49,57,52,53,56,51,48,56,51,97,97,50,98,98,100,97,54,100,57,51,54,56,57,98,50,49,99,52,101,97,55,102,56,55,50,100,101,53,53,55,53,97,99,55,102,52,97,53,48,97,53,48,100,99,50,99,53,57,48,57,55,52,56,48,57,98,101,51,99,97,102,49,48,57,56,52,101,49,53,97,99,99,52,101,52,101,99,55,57,101,51,54,102,50,52,97,99,52,51,50,55,53,49,100,48,51,50,48,97,48,97,50,55,55,54,53,49,100,99,56,102,56,57,52,99,100,51,51,52,55,49,102,49,53,51,49,55,101,97,51,97,100,102,100,101,51,100,57,102,57,53,98,54,57,51,54,55,34,125,93,44,34,54,56,102,52,100,99,48,53,45,102,55,98,55,45,52,55,102,98,45,98,52,100,99,45,48,53,102,55,98,55,49,55,102,98,54,53,34,58,91,123,34,111,119,110,101,114,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,100,101,108,101,103,97,116,101,100,84,111,34,58,34,54,56,102,52,100,99,48,53,45,102,55,98,55,45,52,55,102,98,45,98,52,100,99,45,48,53,102,55,98,55,49,55,102,98,54,53,34,44,34,107,101,121,34,58,34,55,51,97,49,98,100,49,52,102,101,51,54,51,50,57,98,50,55,49,100,97,99,53,99,53,49,49,52,52,102,48,98,56,56,101,98,51,99,101,50,51,50,97,48,54,51,57,56,54,100,57,99,97,50,51,101,55,57,54,52,97,100,100,57,55,99,102,55,97,48,55,50,55,50,56,98,53,52,101,101,56,98,98,101,53,57,55,100,97,56,98,102,99,51,99,54,51,50,97,53,55,51,54,53,100,50,50,56,54,55,50,53,51,49,57,100,56,99,50,57,102,99,100,48,102,98,57,102,57,101,97,50,53,55,53,101,50,57,53,55,102,49,100,53,53,56,99,52,54,48,51,51,54,99,97,53,49,99,57,97,52,97,51,101,56,50,100,97,50,101,55,100,54,102,101,102,55,57,98,50,50,99,55,54,57,55,50,101,99,50,56,57,34,125,93,44,34,50,101,54,48,54,97,53,52,45,99,50,97,99,45,52,55,52,102,45,97,48,54,97,45,53,52,99,50,97,99,100,55,52,102,50,102,34,58,91,123,34,111,119,110,101,114,34,58,34,55,52,53,55,97,99,48,98,45,99,49,101,48,45,52,55,52,50,45,57,55,97,99,45,48,98,99,49,101,48,53,55,52,50,51,97,34,44,34,100,101,108,101,103,97,116,101,100,84,111,34,58,34,50,101,54,48,54,97,53,52,45,99,50,97,99,45,52,55,52,102,45,97,48,54,97,45,53,52,99,50,97,99,100,55,52,102,50,102,34,44,34,107,101,121,34,58,34,53,98,53,56,97,56,99,49,51,102,54,52,102,57,55,57,97,101,100,100,99,100,50,49,56,98,98,57,102,101,50,51,49,55,97,51,52,101,56,100,52,57,56,55,52,99,102,57,102,52,57,51,56,100,49,57,97,99,98,100,50,50,97,97,56,52,50,50,53,102,50,102,51,51,100,100,50,97,51,100,101,97,57,55,48,56,55,57,99,50,57,51,53,50,50,49,53,102,100,102,50,101,57,57,51,98,53,56,101,49,48,50,51,101,54,51,102,99,55,54,50,51,99,53,97,101,48,98,101,53,54,99,55,53,48,57,102,50,55,101,50,101,100,55,97,99,54,55,54,55,102,98,50,99,55,101,48,101,49,57,50,51,55,54,98,101,53,48,50,57,56,97,51,51,48,55,48,54,55,54,52,99,102,100,54,99,56,57,53,100,55,102,34,125,93,125,44,34,97,116,116,97,99,104,109,101,110,116,69,110,99,114,121,112,116,105,111,110,75,101,121,115,34,58,91,93,44,34,97,116,116,97,99,104,109,101,110,116,73,100,34,58,34,57,51,53,50,102,48,49,99,54,51,48,99,51,52,53,53,48,99,56,49,98,102,52,99,53,55,100,101,100,52,101,57,101,53,54,48,55,101,48,56,102,52,99,57,97,57,52,98,49,51,56,51,98,49,49,49,99,53,57,53,48,98,49,57,34,44,34,100,111,99,117,109,101,110,116,84,121,112,101,34,58,34,105,110,118,111,105,99,101,34,44,34,109,97,105,110,85,116,105,34,58,34,99,111,109,46,97,100,111,98,101,46,112,100,102,34,44,34,110,97,109,101,34,58,34,68,79,80,80,76,69,82,32,77,73,78,70,32,50,56,32,48,50,32,49,52,32,40,49,49,47,48,51,47,49,52,41,34,44,34,111,116,104,101,114,85,116,105,115,34,58,91,34,100,121,110,46,97,103,107,56,119,115,109,50,34,93,44,34,106,97,118,97,95,116,121,112,101,34,58,34,111,114,103,46,116,97,107,116,105,107,46,105,99,117,114,101,46,101,110,116,105,116,105,101,115,46,68,111,99,117,109,101,110,116,34,44,34,114,101,118,95,104,105,115,116,111,114,121,34,58,123,125,44,34,95,114,101,118,105,115,105,111,110,115,34,58,123,34,115,116,97,114,116,34,58,52,44,34,105,100,115,34,58,91,34,49,52,53,49,51,100,49,57,52,100,100,54,55,53,99,101,97,57,55,101,50,102,52,49,53,99,55,49,56,53,54,98,34,44,34,49,52,101,55,98,57,54,50,57,101,99,97,50,56,55,48,102,48,48,101,100,102,100,51,100,99,56,54,56,101,54,49,34,44,34,98,101,99,98,50,48,51,56,55,102,56,52,98,48,51,48,98,54,55,101,49,48,56,57,48,54,48,53,101,50,54,50,34,44,34,57,54,55,50,48,50,48,57,100,49,54,101,57,57,100,98,56,52,54,56,97,50,51,54,57,49,97,56,53,99,101,97,34,93,125,44,34,95,97,116,116,97,99,104,109,101,110,116,115,34,58,123,34,57,51,53,50,102,48,49,99,54,51,48,99,51,52,53,53,48,99,56,49,98,102,52,99,53,55,100,101,100,52,101,57,101,53,54,48,55,101,48,56,102,52,99,57,97,57,52,98,49,51,56,51,98,49,49,49,99,53,57,53,48,98,49,57,34,58,123,34,99,111,110,116,101,110,116,95,116,121,112,101,34,58,34,97,112,112,108,105,99,97,116,105,111,110,47,112,100,102,34,44,34,114,101,118,112,111,115,34,58,51,44,34,100,105,103,101,115,116,34,58,34,109,100,53,45,87,105,86,122,83,112,111,81,51,54,86,117,77,52,50,114,100,114,97,50,80,103,61,61,34,44,34,108,101,110,103,116,104,34,58,50,55,54,51,51,51,44,34,102,111,108,108,111,119,115,34,58,116,114,117,101,125,125,125>>,[{att,<<57,51,53,50,102,48,49,99,54,51,48,99,51,52,53,53,48,99,56,49,98,102,52,99,53,55,100,101,100,52,101,57,101,53,54,48,55,101,48,56,102,52,99,57,97,57,52,98,49,51,56,51,98,49,49,49,99,53,57,53,48,98,49,57>>,<<97,112,112,108,105,99,97,116,105,111,110,47,112,100,102>>,276333,276333,<<90,37,115,74,154,16,223,165,110,51,141,171,118,182,182,62>>,3,{follows,<0.22188.0>,#Ref<0.0.262145.216597>},identity}],<<100,97,99,51,102,53,52,57,50,53,50,57,98,56,51,99,54,98,97,50,98,101,53,101,48,56,57,52,56,50,55,102>>,278575}},[{response_format,binary},{inactivity_timeout,30000},{socket_options,[{keepalive,true},{nodelay,false}]}],infinity}},infinity]}}}}}}
[notice] 2017-08-10T07:07:19.727533Z couchdb@127.0.0.1 <0.3632.0> 351785348c 127.0.0.1:5984 127.0.0.1 undefined PUT /dbname/10363625-cd90-4233-b636-25cd90e23378?new_edits=false 201 ok 16
[notice] 2017-08-10T07:07:19.762312Z couchdb@127.0.0.1 <0.3879.0>

Replication restarts, the error repeats and replication never finishes.

Feels like an instance of #574 which we thought had been resolved.

Your Environment

Version used: 2.1.0 release
Operating System and version (desktop or mobile): macOS 10.11

wohali · 2017-10-05T21:49:15Z

Seeing this now in multiple production environments. In one case, it is potentially completely freezing a node participating in continuous replication with large attachments. In the other, it's a one-time replication that must be restarted many times before it runs to completion.

Discussion on IRC today with @nickva follows.

17:17 < vatamane> #574 was mostly about how 413 (request too big) are handled
17:17 <+Wohali> right, and in both cases these are large attachments
17:18 < vatamane> do you think they trigger the 413 that is, are they bigger
                  than the maximum size request setting?
17:19 <+Wohali> very likely
17:20 < vatamane> k, yeah this is tricky one
17:20 < vatamane> I was trying to think what the patch would be for ibrowse and
                  got thoroughly confused by its parsing state machine
17:22 < vatamane> mochiweb (the server) might also have to behave nicely to
                  ensure it actually sends out the 413 response before closing
                  the streams
17:23 <+Wohali> hm
17:23 <+Wohali> https://github.com/cmullaparthi/ibrowse/issues/105
17:24 <+Wohali> and of course https://github.com/cmullaparthi/ibrowse/issues/146
17:24 <+Wohali> which might prevent us from moving to 4.3
17:24 <+Wohali> or i guess 4.4 now
17:28 < vatamane> the fix here might also need the setting of erlang's socket
                  options
17:29 < vatamane> namely {exit_on_close, false}
17:30 < vatamane> i meant to say ibrowse lets us set socket options, i remember
                  trying but it wasn't enough
17:32 < vatamane> also i remember testing the server side with this script
https://gist.github.com/nickva/84bbe3a51b9ceda8bca8256148be1a18
17:32 < vatamane> it opens a plain socket for upload then even on send failure
                  tries to receive data
17:32 <+Wohali> right so we agree the issue is probably ibrowse?
17:33 < vatamane> 80% or so sure
17:36 <+Wohali> we could run that eunit test in a loop, I mean
17:37 < vatamane> yap could do that
17:37 < vatamane> I was doing it that way
17:37 < vatamane> with debugging and logging enabled

calonso · 2017-10-13T17:18:20Z

Hi everyone!!

I think I have some more information on this issue in the form of a side effect. My setup is a small cluster, with just 3 nodes continuously replicating a few databases from another, bigger one. Only 3 databases out of all the ones being replicated hold attachments and, by chance, the same node is responsible for replicating the 3 of them. That node throws the described error quite often (a few thousand times per hour), depending on the speed at which documents are received.

That particular node shows a continuous increment on the process_count metric read from the _system endpoint. Growing at a similar rate of this errors' rate. That metric grows from about 1.2k processes that the nodes start with up to a bit above 5k when it gets frozen. It stops responding on the clustered (5984) endpoint and doesn't replicate any more data. But annoyingly it is not considered as down in the cluster, so the other nodes are not taking his responsibilities over.

After connecting the Observer to that node, to see which processes are there I could see a lot of erlang:apply/2 in function couch_httpd_multipart:maybe_send_data/1 with 0 reductions and 0 messages in the queue and also a lot of mochiweb_acceptor:init/4 in function couch_doc:-doc_from_multi_part_stream/3-fun-1-/1 Some of them with 1 message on the queue, some of them with 0 and 0 reductions as well...

Also this node has quite many 'erlang:apply/2' processes in function 'couch_http_multipart:mp_parse_attrs/2'.

I think there may be something preventing the processes from exiting and that's why they pile up until it freezes.

Hope this helps.

nickva · 2017-10-13T19:38:09Z

Hi Carlos,

Was wondering how many attachment you have roughly and their approximate size distribution.

How about large document or document ID lengths larger 4KB?

And this is still CouchDB 2.1.0 like mentioned above?

Also what are the values of these configuration parameters:

couchdb.max_document_size
httpd.max_http_request_size

Note that the default request size for httpd.max_http_request_size is 64MB. If you use the default, and your attachments are large, consider raising the limit there.

Basically trying see if this is an issue of target cluster rejecting requests because of some of those limits, or there is something else.

calonso · 2017-10-14T18:05:04Z

Hi Nick,

I've been reviewing some of the mp_parser_died errors I see on the logs and I see that the documents, with its attachments end up appearing on the database, I suppose the scheduler retries them and the replication ends up working (I've seen a few appearing as an error on the logs two or three times and I can see them on the DB, others just fail once and they are on the DB as well, I haven't found any error whose document is not found on the DB, but I haven't reviewed all errors one by one either).

The documents' sizes are not big, at least the ones I've reviewed. The ones I have reviewed sizes' range from 4 to 60 Kb pdf and xls docs. I haven't iterated through all of them to compute the distribution you suggest. Is there an easier CouchDB way to get that overview on attachments size?

I'm using CouchDB 2.1.0 here and about the configs they both should be the default ones as I haven't specified any of them. Checking the config values from _node/<node>/_config there's nothing specified for couchdb.max_document_size and 67108864 as httpd.max_http_request_size, which I guess is the default value.

nickva · 2017-10-15T16:57:04Z

Hi Carlos,

Thanks for the additional info! So it seems like with retries they eventually finish. We'd still would rather not have these errors to start with...

I doesn't seem like request/document/attachment size limits are not involved in this case.

Now I am thinking perhaps it could be unreliable network connections or a large number of replications.

In your setup how reliable is the network. Any chance there is intermittent connectivity issues, or high latency, maybe running out of sockets?

Another question is how many replications are running at the same time, would there be more than 500 per cluster node? That's currently the max jobs value for scheduler and if there are more than that, scheduling replicator would stop some and start others as it cycles through them. Wondering if that is an issue.

calonso · 2017-10-15T20:38:04Z

Hi Nick,

So although I've definitely seen some replication errors pointing to a closed connection on the source from time to time, they are very sparse and I don't think we're affected by unreliable network either as source is a cluster hosted on Softlayer central-US I think and target is a cluster located in Europe-West region of Google Compute Engine. I think both platforms, while located far away in terms of distance, they have very reliable and strong network links.

About the number of replications I don't think we're anywhere near that figure. I'm replicating on a three nodes cluster, each of them having 9 running replications.

Regards

elistevens · 2017-12-04T22:09:52Z

We believe that we're seeing this in internal testing on single-node hosts doing local replication too. Our attachment sizes can be in the gigabytes.

nickva · 2017-12-04T22:24:25Z

Hi @elistevens,

Thanks for your report. Would you be able to make a short script to reproduce the issue. Or at least describe the steps in more details, for example something like: 1: clone couch at version X, erlang version Y, OS version Z etc, 2: build, 3: setup with these config parameters, 4: create 2 dbs, 5: populate with attachment of this size, ...).

nickva · 2017-12-04T22:32:22Z

@calonso, Sorry for the delayed response. There a minor fix in that code in 2.1.1, would you be able to retry it with that latest version, to see if results in the same error? If you do upgrade, take a look a release notes regarding vm.args file and localhost vs 127.0.0.1 node names.

calonso · 2017-12-05T11:31:12Z

Hi @nickva.

We updated to 2.1.1 a while ago and unfortunately we keep seeing the same error... :(

Thanks!

nickva · 2017-12-06T03:53:25Z

@calonso, thanks for checking, it helps to know that.

elistevens · 2017-12-07T00:11:15Z

Bah, my earlier draft response got eaten by a browser shutdown.

I don't have an easy repro script, sadly. We're seeing the issue under load during our test runs, but any single test seems to work fine when run in isolation. Our largest attachments are in the range of 100MB to 1GB. I know that's against recommended practices, but that wasn't clear when the bones of our architecture was laid down in ~2011.

We are running 2.1.1 on Ubuntu, using the official .debs.

wohali · 2017-12-13T06:57:15Z

Hey @nickva I spent time today on getting a repro for this, as it's affecting more and more people. Bear with me on the setup, it's a little involved.

Set up a VM with the following parameters (I used VMWare Workstation):

20GB HDD, 512 MB RAM (!), 1 CPU
Debian 8.latest (or 9, I tested on 8), 64-bit
CouchDB master - have it running on localhost:5984 with admin:password as the creds, n=1, and logging at debug level
sudo apt-get install stress python3 python3-virtualenv virtualenv python3-cxx-dev
mkdir 745 && cd 745 && virtualenv -p python3 venv && source venv/bin/activate
pip install RandomWords RandomIO requests docopt schema urllib3 chardet certifi idna
wget https://gist.github.com/wohali/1cd19b78c0a417dbeb9f66b3229f7b58/raw/6539d48fa9e05b021f344b80fbf0d7c3e7fcd6e4/makeit.py

Now you're ready to setup the test:

$ curl -X PUT http://admin:password@localhost:5984/foo
$ python ./makeit.py 10

Repeat the above a few times - get the DB to 1GB or larger. You can increase 10 but at some point you'll run out of RAM, so be careful. This script creates sample docs with a few fields and a 50MB attachment full of random bytes.

Now to run the test:

In one window: tail -f | grep your couch log for mp_parser.
In another window, stress the machine's CPU and network: stress --timeout 90m --cpu 1 --io 4. (You can add disk access to this with -d 8 if desired.
In the window running the Python virtualenv, start the replication:
curl http://admin:password@localhost:5984/_replicate -H "Content-Type: application/json" --data '{"create_target": true, "source": "foo", "target": "bar"}'

If the above succeeds, curl -X DELETE http://admin:password@localhost:5984/bar and try to replicate again.

This produces a failure for me within 10 minutes. The command line returns:

{"error":"error","reason":"{worker_died,<0.26846.9>,{process_died,<0.27187.9>,kaboom}}"}

and the logfile has errors identical to that in the original post above, and in #574.

nickva · 2018-01-05T23:24:54Z

@wohali thanks for that test! I'll take a look at it when I get a chance. The kaboom thing is interesting, wonder if that's something we saw before or something new in relation to this bug. That comes from getting all the open revisions of a document. Definitely a good data point to have.

nickva · 2018-01-18T22:08:28Z

Leaving it here as it might be relevant:

A similar issue was noticed with someone using attachments in the 10MB range. One attachment had larger size around 50MB.

Investigating on #couchdb-dev IRC channel, implicated the couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1, function (it looks ugly because it was an anonymous function with its name mangled).

Also possibly related in that case was that there were intermittent network problems - nodes being connected and disconnected.

wohali · 2018-01-19T21:09:41Z

FYI the intermittent network problems are not a prerequisite for this problem to surface.

However, I think we are going in the right direction thinking this is related to incorrect attachment length calculation and/or incomplete network transfers.

wohali · 2018-01-19T21:21:09Z

One other thing - there was a previous attempt at changing some of this behaviour that never landed that references some old JIRA tickets:

#138

wohali · 2018-01-22T17:14:37Z

@nickva @davisp The client has stated that the "similar issue" (with the couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1, function) only started after an upgrade from 2.0.0rc3 to 2.1.1.

Hopefully that narrows the git bisect a bit further?

nickva · 2018-01-22T17:54:46Z

@wohali thanks, it does help!

nickva · 2018-01-24T19:46:29Z

(From discussion on IRC)

This might be related to setting a lower max http request limit:

e767b34

Before that was defaulting to 4GB which what the code has, but default.ini file set it to 64MB so that became the value being used. Max request will limit will prevent larger attachments to replicate. Also 413 error is not always raised cleanly: (see #574 also referenced in the top description)

To confirm if this is the cause or is affecting this issue at all, can try to bump:

[httpd]
max_http_request_size = 67108864 ; 64 MB

To a higher value, one that's larger than a two or three times the largest attachment or document perhaps (to account for some overhead).

nickva · 2018-01-24T19:57:42Z

Related to this, we also started to enforce http request limits more strictly:

5d7170c

janl · 2018-01-31T15:07:08Z

Moar data from affected nodes. I’ve listed process by current_function and sorted by occurrence

current function count/group script (click to reveal)

io:format("~p", [
    lists:keysort(2,
        maps:to_list(lists:foldl(
            fun(Elm, Acc) ->
                case Elm of
                    {M, F, A} ->
                        N = maps:get({M, F, A}, Acc, 0),
                        maps:put({M, F, A}, N + 1, Acc);
                    Else ->
                        Acc
                    end
            end,
            #{},
            lists:map(
                fun(Pid) ->
                    case process_info(Pid) of
                        undefined -> [];
                        Info -> proplists:get_value(current_function, Info)
                    end
                end, 
                processes()
            )
        ))
    )
])

Output

Output from an affected node:

[{{code_server,loop,1},1},
 {{couch_replicator_scheduler,stats_updater_loop,1},1},
 {{cpu_sup,measurement_server_loop,1},1},
 {{cpu_sup,port_server_loop,2},1},
 {{erl_eval,do_apply,6},1},
 {{erl_prim_loader,loop,3},1},
 {{erlang,hibernate,3},1},
 {{gen,do_call,4},1},
 {{global,loop_the_locker,1},1},
 {{global,loop_the_registrar,0},1},
 {{inet_gethost_native,main_loop,1},1},
 {{init,loop,1},1},
 {{mem3_shards,'-start_changes_listener/1-fun-0-',1},1},
 {{memsup,port_idle,1},1},
 {{net_kernel,ticker_loop,2},1},
 {{shell,shell_rep,4},1},
 {{standard_error,server_loop,1},1},
 {{user,server_loop,2},1},
 {{couch_changes,wait_updated,3},2},
 {{prim_inet,recv0,3},2},
 {{dist_util,con_loop,9},3},
 {{gen_event,fetch_msg,5},6},
 {{couch_os_process,'-init/1-fun-0-',2},23},
 {{application_master,loop_it,4},25},
 {{application_master,main_loop,2},25},
 {{prim_inet,accept0,2},29},
 {{couch_httpd_multipart,mp_parse_atts,2},31},
 {{fabric_db_update_listener,cleanup_monitor,3},210},
 {{fabric_db_update_listener,wait_db_updated,1},210},
 {{rexi_monitor,wait_monitors,1},210},
 {{rexi_utils,process_message,6},212},
 {{couch_event_listener,loop,2},412},
 {{couch_httpd_multipart,maybe_send_data,1},881},
 {{couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1},912},
 {{gen_server,loop,6},2816}]

Unaffected node in the same cluster:

[{{code_server,loop,1},1},
 {{couch_ejson_compare,less,2},1},
 {{couch_index_server,get_index,3},1},
 {{couch_replicator_scheduler,stats_updater_loop,1},1},
 {{cpu_sup,measurement_server_loop,1},1},
 {{cpu_sup,port_server_loop,2},1},
 {{erl_eval,do_apply,6},1},
 {{erl_prim_loader,loop,3},1},
 {{fabric_util,get_shard,4},1},
 {{global,loop_the_locker,1},1},
 {{global,loop_the_registrar,0},1},
 {{inet_gethost_native,main_loop,1},1},
 {{init,loop,1},1},
 {{mem3_shards,'-start_changes_listener/1-fun-0-',1},1},
 {{memsup,port_idle,1},1},
 {{net_kernel,ticker_loop,2},1},
 {{prim_inet,recv0,3},1},
 {{shell,shell_rep,4},1},
 {{standard_error,server_loop,1},1},
 {{user,server_loop,2},1},
 {{couch_changes,wait_updated,3},2},
 {{dist_util,con_loop,9},3},
 {{erlang,hibernate,3},3},
 {{gen_event,fetch_msg,5},6},
 {{rexi,wait_for_ack,2},8},
 {{couch_os_process,'-init/1-fun-0-',2},16},
 {{application_master,loop_it,4},25},
 {{application_master,main_loop,2},25},
 {{prim_inet,accept0,2},28},
 {{couch_httpd_multipart,mp_parse_atts,2},37},
 {{fabric_db_update_listener,wait_db_updated,1},112},
 {{fabric_db_update_listener,cleanup_monitor,3},113},
 {{rexi_monitor,wait_monitors,1},114},
 {{rexi_utils,process_message,6},114},
 {{couch_event_listener,loop,2},344},
 {{couch_httpd_multipart,maybe_send_data,1},361},
 {{couch_doc,'-doc_from_multi_part_stream/4-fun-1-',1},398},
 {{gen_server,loop,6},10270}]

Output from a cluster that doesn’t have attachments:

{{code_server,loop,1},1},
 {{cpu_sup,measurement_server_loop,1},1},
 {{cpu_sup,port_server_loop,2},1},
 {{erl_eval,do_apply,6},1},
 {{erl_prim_loader,loop,3},1},
 {{erts_code_purger,loop,0},1},
 {{fabric_db_update_listener,cleanup_monitor,3},1},
 {{fabric_db_update_listener,wait_db_updated,1},1},
 {{global,loop_the_locker,1},1},
 {{global,loop_the_registrar,0},1},
 {{init,loop,1},1},
 {{net_kernel,ticker_loop,2},1},
 {{rexi_monitor,wait_monitors,1},1},
 {{rexi_utils,process_message,6},1},
 {{shell,shell_rep,4},1},
 {{standard_error,server_loop,1},1},
 {{timer,sleep,1},1},
 {{user,server_loop,2},1},
 {{dist_util,con_loop,2},3},
 {{gen_event,fetch_msg,5},6},
 {{couch_changes,wait_updated,3},10},
 {{couch_event_listener,loop,2},19},
 {{application_master,loop_it,4},24},
 {{application_master,main_loop,2},24},
 {{couch_os_process,'-init/1-fun-0-',2},32},
 {{prim_inet,accept0,2},33},
 {{erlang,hibernate,3},75},
 {{gen_server,loop,6},2521}]

nickva · 2018-01-31T16:05:23Z

Possible solution from davisp discussed on IRC:

https://gist.github.com/davisp/27cd7ab54cdffeaa6e96590df4f988f9

…davisp

…davisp)

In some cases the higher level code from `couch_replicator_api_wrap` needs to handle retries explicitly and cannot cope with retries happening in the lower level http client. In such cases it sets `retries = 0`. For example: https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275 The http client then should avoid unconditional retries and instead consult `retries` value. If `retries = 0`, it shouldn't retry and instead bubble the exception up to the caller. This bug was discovered when attachments were replicated to a target cluster and the target cluster's resources were constrainted. Since attachment `PUT` requests were made from the context of an open_revs `GET` request, `PUT` request timed out, and they would retry. However, because the retry didn't bubble up to the `open_revs` code, the second `PUT` request would die with a `noproc` error, since the old parser had exited by then. See issue #745 for more.

Occasionally it's possible to lose track of our RPC workers in the main multipart parsing code. This change monitors each worker process and then exits if all workers have exited before the parser considers itself finished. Fixes part of #745

wohali · 2018-03-01T01:53:44Z

So in testing with a client, this no longer hangs/crashes/eats all the RAM, but it does still cause an issue where a too-large request body fails to transmit a document. The replicator thinks it HAS successfully transferred the document, and declares replication successful. A subsequent attempt to GET the document results in a 404.

Here is a censored excerpt from the logfile of the situation:

[notice] 2018-02-27T01:16:58.004513Z couchdb@127.0.0.1 <0.31509.279> 0f1a3bf01b localhost:5984 127.0.0.1 undefined GET /_scheduler/docs/_replicator/862944988a0e42e8e7567da18c863571 200 ok 0
[notice] 2018-02-27T01:17:07.843508Z couchdb@127.0.0.1 <0.24829.275> -------- Starting replication 2def458d381dd99fa1f4e4887b5b9775+create_target (http://localhost:5984/db1/ -> http://localhost:5984/db2/) from doc _replicator:862944988a0e42e8e7567da18c863571 worker_procesess:4 worker_batch_size:500 session_id:bfafe761a21c1ea012228fc2df6790a9
[notice] 2018-02-27T01:17:07.843558Z couchdb@127.0.0.1 <0.24829.275> -------- Document `862944988a0e42e8e7567da18c863571` triggered replication `2def458d381dd99fa1f4e4887b5b9775+create_target`

[notice] 2018-02-27T01:17:10.600494Z couchdb@127.0.0.1 <0.18679.277> 66a9b51dfd localhost:5984 127.0.0.1 undefined GET /db1/foo?revs=true&open_revs=%5B%222-d32df74e77cfebcc08455cda37518117%22%5D&latest=true 200 ok 2668
[error] 2018-02-27T01:17:10.869278Z couchdb@127.0.0.1 <0.23437.278> -------- Replicator: error writing document `foo` to `http://localhost:5984/db2/`: {error,request_body_too_large}
[notice] 2018-02-27T01:17:27.107006Z couchdb@127.0.0.1 <0.24829.275> -------- Replication `2def458d381dd99fa1f4e4887b5b9775+create_target` completed (triggered by `862944988a0e42e8e7567da18c863571`)

[notice] 2018-02-27T01:18:05.754340Z couchdb@127.0.0.1 <0.24230.274> 2a9c5d5507 localhost:5984 127.0.0.1 undefined GET /db2/foo 404 ok 1

Note that in extensive testing, this has only happened four times, so I'm not sure I can provide an easy reproducer here, but we'll keep at it.

Couch was running at the info log level for this test, so I'm going bump it up to debug level and try the test again, hoping for a duplicate.

nickva · 2018-03-01T03:49:33Z

If some requests fail with a 413 it's not surprising that it completes. It should bump the doc_write_failures stats in the completion record to indicate how many documents it failed to write.

The question is why does that one request fail with a 413 to start with.

Good call on debug logs. Also what are the are the doc sizes involved, how many revisions per document. Any attachments? Then what are the [couchdb] max_document_size , [couchdb] max_attachment_size and [httpd] max_http_request_size params.

wohali · 2018-03-01T04:03:03Z

Yes attachments. you can see revisions per doc are low. in the log.

Everything else is default. (Working around this by bumping the defaults is just sweeping the error under the rug...)

Of course, we should be replicating the document. Why is it even getting a 413 in the first place? It's replicating a document from the same server to the same server, no settings have changed, surely it should be able to PUT a document it just did a GET of from itself. I believe this test (I didn't write it) runs in a loop, so the data is being replicated over and over from one database to the next on the same server.

Finally, even with a bumped doc_write_failures value, calling this a completed replication is a VERY surprising result. Unless you think to check /_scheduler/docs for the failed document count, or compare source/target document counts, you'd never know there was a failure.

wohali · 2018-03-02T04:08:51Z

We're now hard-rejecting attachments greater than max_http_request_size, sadly.

@nickva I have an excellent reproducible case:

Build couchdb, master, latest.
Set up my makeit.py script from this comment.
Run dev/run -n 1 --with-admin-party-please.
Run curl -X PUT localhost:15984/foo
Edit the URL in makeit.py to reflect http://localhost:15984/foo.
After entering the virtualenv for makeit.py, run: python ./makeit.py 10 --size=75000000
Start a replication: curl -X PUT localhost:15984/_replicator/bar -d '{"source": "http://localhost:15984/foo", "target": "http://localhost:15984/bar", "create_target": true}'
Watch the sparks fly.

/cc @janl

janl · 2018-03-02T12:14:44Z

Great repro Joan. I played with it and came up with this:

The python script uses the standalone attachment API: /db/doc/att The handler for this request does NOT apply max_http_request_size (which happens in chttpd:body/2 or couch_httpd:check_max_request_length(), neither of which is used by the standalone attachment API).

The twist now is that the replicator uses multipart requests and not standalone attachment requests. Multipart requests are subject to the max_http_request_size limit.

This leads to the observed behaviour that you can create an attachment in one db and can NOT replicate that attachment to another db on the same CouchDB node (or another node with the same max_http_request_size limit).

Applying max_http_request_size in the standalone attachment API is trivial[1], but leads to the next unfortunate behaviour:

Say you create a doc with two attachments, with a length that is just under max_http_request_size, each individual attachment write will succeed, but replicating it to another db will, again, produce a multipart request that overall is > max_http_request_size.

I haven’t checked this, but a conflicting doc with one attachment < max_http_request_size where the attachment data is conflicted might also produce a multipart http request > max_http_request_size to replicate both conflicting revisions and attachment bodies.

This leads us to having to decide:

is max_http_request_size a hard hard hard limit or do we accept requests larger than that, if they are multipart http requests?

if yes, do we apply the max_document_size and max_attachment_size to individual chunks of the multipart request?

if not 1., do we need to rewrite the replicator to not produce requests > max_http_request_size and potentially do attachments individually?

References:
[1]:

--- a/src/chttpd/src/chttpd_db.erl
+++ b/src/chttpd/src/chttpd_db.erl
@@ -1218,6 +1218,7 @@ db_attachment_req(#httpd{method=Method, user_ctx=Ctx}=Req, Db, DocId, FileNamePa
                 undefined -> <<"application/octet-stream">>;
                 CType -> list_to_binary(CType)
             end,
+           couch_httpd:check_max_request_length(Req),
            Data = fabric:att_receiver(Req, chttpd:body_length(Req)),
            ContentLen = case couch_httpd:header_value(Req,"Content-Length") of
                undefined -> undefined;

janl · 2018-03-02T12:52:43Z

Shorter repro that runs quickly, tests the 1 attachment > max_http_request_size as well as the 2 attachments < max_http_request_size but att1 + att2 > max_http_request_size cases.

Look for the two instances of "doc_write_failures":1 in the output.

#!/bin/sh

COUCH=http://127.0.0.1:15984
INT=http://127.0.0.1:15986
DBA=$COUCH/db
DBB=$COUCH/dbb

# cleanup
curl -X DELETE $DBA
curl -X DELETE $DBB

# setup
curl -X PUT $DBA
curl -X PUT $DBB

# config
curl -X PUT $INT/_config/httpd/max_http_request_size -d '"1500"'
curl -X PUT $INT/_config/replicator/retries_per_request -d '"1"'

# create an att > max_http_request_size, should succeed
# 3000 here as not to run into _local checkpoint size limits
BODY3000=11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
curl -X PUT http://127.0.0.1:15984/db/doc/att --data-binary "$BODY3000" -Hcontent-type:application/octet-stream

# replicate, should suceed, but with one doc_write_failure
curl -X POST $COUCH/_replicate -d "{\"source\": \"$DBA\", \"target\": \"$DBB\"}" -H content-type:application/json



# create two atts, each < max_http_request_size, but att1+att2 > max_http_request_size


# cleanup
curl -X DELETE $DBA
curl -X DELETE $DBB

# setup
curl -X PUT $DBA
curl -X PUT $DBB


BODY1500=11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
REV=`curl -sX PUT http://127.0.0.1:15984/db/doc1/att --data-binary "$BODY1500" -Hcontent-type:application/octet-stream | cut -b 31-64`

curl -X PUT http://127.0.0.1:15984/db/doc1/att2?rev=$REV --data-binary "$BODY1500" -Hcontent-type:application/octet-stream

# replicate, should suceed, but with one doc_write_failure
curl -X POST $COUCH/_replicate -d "{\"source\": \"$DBA\", \"target\": \"$DBB\"}" -H content-type:application/json

janl · 2018-03-05T10:27:36Z

I suggest we close this in favour of #1200.

tl;dr: CouchDB master works as expected, but has an unfortunate behaviour leading to replication failures when attachments are > max_http_request_size, the solution of which, we’re discussing in #1200.

This reverts commit 4a73d03.

This reverts commit 4a73d03. Latest Mochiweb 2.17 might have helped a bit but after runnig `soak-eunit suites=couch_replicator_small_max_request_size_target` make it fail after 10-15 runs locally for me.

…"" This reverts commit ba624ea.

This reverts commit ba624ea.

This reverts commit 4a73d03. Latest Mochiweb 2.17 might have helped a bit but after runnig `soak-eunit suites=couch_replicator_small_max_request_size_target` make it fail after 10-15 runs locally for me.

fix typo.

wohali added bug replication labels Aug 10, 2017

wohali assigned nickva Aug 10, 2017

wohali added the production label Oct 5, 2017

ReeceStevens mentioned this issue Jan 5, 2018

Hanging or 500 errors during concurrent replication #1093

Closed

wohali mentioned this issue Jan 18, 2018

Backport mochiweb fix for active_socket accounting #1117

Closed

nickva mentioned this issue Jan 24, 2018

Very slow replication for databases with attachments #1125

Closed

janl added a commit to janl/couchdb that referenced this issue Feb 20, 2018

fix apache#745: avoid race condition in mp-parser chunk handling (by @…

26592e7

…davisp)

nickva mentioned this issue Feb 21, 2018

Avoid unconditional retries in replicator's http client #1177

Merged

davisp mentioned this issue Feb 22, 2018

Prevent chttpd multipart zombie processes #1178

Merged

3 tasks

janl closed this as completed Mar 5, 2018

janl added a commit that referenced this issue Mar 5, 2018

re-enable "flaky" test in quest to nail down #745

4a73d03

tonysun83 added a commit that referenced this issue Mar 8, 2018

Revert "re-enable "flaky" test in quest to nail down #745"

3314093

This reverts commit 4a73d03.

tonysun83 added a commit that referenced this issue Mar 8, 2018

Revert "re-enable "flaky" test in quest to nail down #745"

415bd2a

This reverts commit 4a73d03.

nickva mentioned this issue Mar 8, 2018

Investigate why 413 error is not delivered reliably #1211

Closed

nickva added a commit to cloudant/couchdb that referenced this issue Mar 23, 2018

Revert "Revert "re-enable "flaky" test in quest to nail down apache#745…

97e14a8

…"" This reverts commit ba624ea.

rnewson mentioned this issue Mar 23, 2018

Fix 413 response handling and re-enable related couch replicator test #1234

Merged

3 tasks

janl pushed a commit that referenced this issue Mar 26, 2018

Revert "Revert "re-enable "flaky" test in quest to nail down #745""

3d702d8

This reverts commit ba624ea.

wohali mentioned this issue Apr 17, 2018

badarg in mp_parse_atts leading to memory leak with replications of DBs with larger docs/attachments #1286

Closed

wohali mentioned this issue May 9, 2018

Occasional very slow response times, that could only be fixed by service restart (Windows) #1319

Closed

nickva pushed a commit to nickva/couchdb that referenced this issue Sep 7, 2022

Merge pull request apache#745 from alishir/patch-1

65384f4

fix typo.

jdai1 mentioned this issue Jun 15, 2023

Ensure full commit timing out on replication of large database #4639

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication with attachments never completes, {mp_parser_died,noproc} error #745

Replication with attachments never completes, {mp_parser_died,noproc} error #745

wohali commented Aug 10, 2017

wohali commented Oct 5, 2017

calonso commented Oct 13, 2017

nickva commented Oct 13, 2017

calonso commented Oct 14, 2017

nickva commented Oct 15, 2017

calonso commented Oct 15, 2017

elistevens commented Dec 4, 2017

nickva commented Dec 4, 2017

nickva commented Dec 4, 2017

calonso commented Dec 5, 2017

nickva commented Dec 6, 2017

elistevens commented Dec 7, 2017

wohali commented Dec 13, 2017 •

edited by janl

Loading

nickva commented Jan 5, 2018

nickva commented Jan 18, 2018 •

edited

Loading

wohali commented Jan 19, 2018

wohali commented Jan 19, 2018

wohali commented Jan 22, 2018

nickva commented Jan 22, 2018

nickva commented Jan 24, 2018

nickva commented Jan 24, 2018

janl commented Jan 31, 2018

nickva commented Jan 31, 2018

wohali commented Mar 1, 2018 •

edited

Loading

nickva commented Mar 1, 2018

wohali commented Mar 1, 2018 •

edited

Loading

wohali commented Mar 2, 2018 •

edited

Loading

janl commented Mar 2, 2018

janl commented Mar 2, 2018 •

edited

Loading

janl commented Mar 5, 2018

Replication with attachments never completes, {mp_parser_died,noproc} error #745

Replication with attachments never completes, {mp_parser_died,noproc} error #745

Comments

wohali commented Aug 10, 2017

Expected Behavior

Current Behavior

Your Environment

wohali commented Oct 5, 2017

calonso commented Oct 13, 2017

nickva commented Oct 13, 2017

calonso commented Oct 14, 2017

nickva commented Oct 15, 2017

calonso commented Oct 15, 2017

elistevens commented Dec 4, 2017

nickva commented Dec 4, 2017

nickva commented Dec 4, 2017

calonso commented Dec 5, 2017

nickva commented Dec 6, 2017

elistevens commented Dec 7, 2017

wohali commented Dec 13, 2017 • edited by janl Loading

nickva commented Jan 5, 2018

nickva commented Jan 18, 2018 • edited Loading

wohali commented Jan 19, 2018

wohali commented Jan 19, 2018

wohali commented Jan 22, 2018

nickva commented Jan 22, 2018

nickva commented Jan 24, 2018

nickva commented Jan 24, 2018

janl commented Jan 31, 2018

Output

nickva commented Jan 31, 2018

wohali commented Mar 1, 2018 • edited Loading

nickva commented Mar 1, 2018

wohali commented Mar 1, 2018 • edited Loading

wohali commented Mar 2, 2018 • edited Loading

janl commented Mar 2, 2018

janl commented Mar 2, 2018 • edited Loading

janl commented Mar 5, 2018

wohali commented Dec 13, 2017 •

edited by janl

Loading

nickva commented Jan 18, 2018 •

edited

Loading

wohali commented Mar 1, 2018 •

edited

Loading

wohali commented Mar 1, 2018 •

edited

Loading

wohali commented Mar 2, 2018 •

edited

Loading

janl commented Mar 2, 2018 •

edited

Loading