Correct abortion of MR input-sending logic #408

beerriot · 2012-10-20T19:33:13Z

As of basho/riak_pipe@276e90c (released in Riak 1.1) using riak_pipe:destroy/1 causes the riak_pipe_builder process to exit with reason normal instead of something abnormal. This means that the asynchronous MapReduce input sender spawned in riak_kv_mrc_pipe:send_inputs_async/3 no longer receives an exit signal, despite being linked to the builder. So, an input-sender might continue running long after its target pipe has disappeared. This will fill riak logs with messages of the form:

riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup

This affects MR input types of explicit bucket-keys, and certain kinds of modfun inputs.

Changing riak_pipe_builder:destroy/1 to make the builder exit abnormally does not completely solve this case, because there are other points at which the builder decides to exit, but does so with reason 'normal' to avoid log spam from its supervisor.

In addition, not all MapReduce users use riak_kv_mrc_pipe:send_inputs_async/3; some use :send_inputs/2,3 (non-async). These senders also need to notice that the pipe has vanished.

This PR takes a two-pronged attack:

Clauses of riak_kv_mrc_pipe:send_inputs/3 now check the return value of their calls to riak_pipe_vnode:queue_work/2. When the pipe has closed, instead of returning ok, :queue_work will return {error, [worker_startup_failed,...]}. This error is raised for the endpoint to handle.
The riak_kv_wm_mapred and riak_kv_pb_mapred modules, handling HTTP and PB MapReduce requests, now explicitly kill the async sender process.

Messages about "fitting was gone before startup" may still appear in the log, but they should be limited to one input's N-value worth (since each of the primary vnodes in the preflist must fail before the error is raised). In addition to being protection against holes in this strategy, the explicit kill added to the HTTP and PB endpoints should help reduce the number of spam log messages further.

@evanmcc and @jonmeredith should both be interested in this PR.

Without this, a process might continue to send inputs to a MapReduce pipe for a long time after the pipe has disappeared. This change should clear up the majority of cases where a log is spammed with messages about "fitting was gone before startup". There may still be N (one from each of the N primaries of a preflist) as the call must fail once before triggering the exit, but it will not continue through a large input list.

To help the process of shutting down the asynchronus mapred input sender, this patch explicitly kills it whenever it knows that it should die. The previous patch made the sender better able to determine that time on its own, but this should improve the situation even more.

evanmcc · 2012-10-22T19:14:49Z

For reproducing:

https://gist.github.com/a36229c12102d38f4555

evanmcc · 2012-10-22T19:55:22Z

I'll have to leave the code review to someone a bit more familiar the riak_kv (although it looks fine to me, I am not sure if there are spots that might have been missed. I'll spend some more time with it tomorrow), even making the repro stress test worse and worse doesn't seem to have any affect (in a good way). My dev cluster never gets quite as loaded, and while there are errors (similar to the old ones), they are much less common and cease as soon as the bad MR spam ceases.

I'd like to spend some more time testing it tomorrow, but consider this a provisional testing +1.

kellymclaughlin · 2012-10-24T18:15:55Z

Changes look good. Verified that this change fixes the problem of the cleanup not happening in a timely manner resulting in a massive amount of Pipe worker startup failed:fitting was gone before startup messages being spewed to the logs and does not seem to impact subsequent MR job execution. +1 to merge.

beerriot added 2 commits October 20, 2012 13:51

ghost assigned kellymclaughlin and beerriot Oct 24, 2012

beerriot merged commit 37cd964 into 1.2 Oct 25, 2012

beerriot mentioned this pull request Nov 12, 2012

Keylisting continues after pipe death #293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct abortion of MR input-sending logic #408

Correct abortion of MR input-sending logic #408

beerriot commented Oct 20, 2012

evanmcc commented Oct 22, 2012

evanmcc commented Oct 22, 2012

kellymclaughlin commented Oct 24, 2012

Correct abortion of MR input-sending logic #408

Correct abortion of MR input-sending logic #408

Conversation

beerriot commented Oct 20, 2012

evanmcc commented Oct 22, 2012

evanmcc commented Oct 22, 2012

kellymclaughlin commented Oct 24, 2012