Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct abortion of MR input-sending logic #408

Merged
merged 2 commits into from
Oct 25, 2012
Merged

Conversation

beerriot
Copy link

As of basho/riak_pipe@276e90c (released in Riak 1.1) using riak_pipe:destroy/1 causes the riak_pipe_builder process to exit with reason normal instead of something abnormal. This means that the asynchronous MapReduce input sender spawned in riak_kv_mrc_pipe:send_inputs_async/3 no longer receives an exit signal, despite being linked to the builder. So, an input-sender might continue running long after its target pipe has disappeared. This will fill riak logs with messages of the form:

riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup

This affects MR input types of explicit bucket-keys, and certain kinds of modfun inputs.

Changing riak_pipe_builder:destroy/1 to make the builder exit abnormally does not completely solve this case, because there are other points at which the builder decides to exit, but does so with reason 'normal' to avoid log spam from its supervisor.

In addition, not all MapReduce users use riak_kv_mrc_pipe:send_inputs_async/3; some use :send_inputs/2,3 (non-async). These senders also need to notice that the pipe has vanished.

This PR takes a two-pronged attack:

  1. Clauses of riak_kv_mrc_pipe:send_inputs/3 now check the return value of their calls to riak_pipe_vnode:queue_work/2. When the pipe has closed, instead of returning ok, :queue_work will return {error, [worker_startup_failed,...]}. This error is raised for the endpoint to handle.
  2. The riak_kv_wm_mapred and riak_kv_pb_mapred modules, handling HTTP and PB MapReduce requests, now explicitly kill the async sender process.

Messages about "fitting was gone before startup" may still appear in the log, but they should be limited to one input's N-value worth (since each of the primary vnodes in the preflist must fail before the error is raised). In addition to being protection against holes in this strategy, the explicit kill added to the HTTP and PB endpoints should help reduce the number of spam log messages further.

@evanmcc and @jonmeredith should both be interested in this PR.

Without this, a process might continue to send inputs to a MapReduce
pipe for a long time after the pipe has disappeared.

This change should clear up the majority of cases where a log is spammed
with messages about "fitting was gone before startup". There may still
be N (one from each of the N primaries of a preflist) as the call must
fail once before triggering the exit, but it will not continue through a
large input list.
To help the process of shutting down the asynchronus mapred input
sender, this patch explicitly kills it whenever it knows that it should
die. The previous patch made the sender better able to determine that
time on its own, but this should improve the situation even more.
@evanmcc
Copy link
Contributor

evanmcc commented Oct 22, 2012

For reproducing:

https://gist.github.com/a36229c12102d38f4555

@evanmcc
Copy link
Contributor

evanmcc commented Oct 22, 2012

I'll have to leave the code review to someone a bit more familiar the riak_kv (although it looks fine to me, I am not sure if there are spots that might have been missed. I'll spend some more time with it tomorrow), even making the repro stress test worse and worse doesn't seem to have any affect (in a good way). My dev cluster never gets quite as loaded, and while there are errors (similar to the old ones), they are much less common and cease as soon as the bad MR spam ceases.

I'd like to spend some more time testing it tomorrow, but consider this a provisional testing +1.

@kellymclaughlin
Copy link
Contributor

Changes look good. Verified that this change fixes the problem of the cleanup not happening in a timely manner resulting in a massive amount of Pipe worker startup failed:fitting was gone before startup messages being spewed to the logs and does not seem to impact subsequent MR job execution. +1 to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants