Skip to content

Commit

Permalink
Work around a deadlock in hts_tpool_process_flush.
Browse files Browse the repository at this point in the history
The tpool_worker threads have a check on

    && q->qsize - q->n_output > p->tsize - p->nwaiting

This is to ensure that all waiting threads
have sufficient room to write their results to the output queue, so if
they all start up, there's room.

However, when flushing, we need to leave a bit more leeway.  It's
possible that we haven't yet consumed the results, so the output queue
is full, meaning the flush isn't something that we can call and just
wait on unless we have a separate thread that is able to drain the
queue.  (This is valid for SAM and BAM, but not CRAM where the
result consumer is also the same thread that calls the flush).

Hts_tpool_process_flush attempts to fix this, albeit with an ironic
comment of "Shouldn't be possible to get here, but just in case".  It
can, and it matters, but it's not sufficient.  The condition was:

    if (q->qsize < q->n_output + q->n_input + q->n_processing)

We can see from tpool_worker above however that it checks
"p->tsize - p->nwaiting", so it's not just a case of qsize vs n_output.
Adding "p->tsize" to our qsize increase avoids this potential
deadlock.

(Potentially we may want to do this as a temporary measure because a
caller may do a flush mid-way and not at a time where it's about to
tear everything down, but an increased queue size isn't a major
concern as it's simply a little bit more memory.)

As to how to trigger this, it was a 1 in 750 or so case from the
htslib test:

    ./test_view -@4 -o seqs_per_slice=100 -o no_ref=1 -o multi_seq_per_slice=-1 -S -C multi_ref.tmp.sam

The deadlock occurs in cram_close, which initially calls
cram_flush_container_mt and then moves on to hts_tpool_process_flush.
Inbetween these is the race condition that can lead to lots of waiting
working threads and hence the qsize being too small.  We can trigger
this bug 100% of the time by inserting a "sleep(1)" prior to the
"if (fd->pool &&..." conditional.  With the fix, it's survived 20,000
tests.

Arguably the bug is in cram_io.c and not the thread pool, but it's
best to make the pool robust to poor usage, especially given there is
no hts_tpool_process_flush call that has a timeout.
  • Loading branch information
jkbonfield committed Jul 15, 2021
1 parent d16bed5 commit 502c22b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions thread_pool.c
Original file line number Diff line number Diff line change
Expand Up @@ -950,8 +950,8 @@ int hts_tpool_process_flush(hts_tpool_process *q) {

// Ensure there is room for the final sprint.
// Shouldn't be possible to get here, but just in case.
if (q->qsize < q->n_output + q->n_input + q->n_processing)
q->qsize = q->n_output + q->n_input + q->n_processing;
if (q->qsize < q->n_output + q->n_input + q->n_processing + p->tsize)
q->qsize = q->n_output + q->n_input + q->n_processing + p->tsize;

// When shutdown, we won't be launching more, but we can still
// wait for any processing jobs complete.
Expand Down

0 comments on commit 502c22b

Please sign in to comment.