Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Work around a deadlock in hts_tpool_process_flush.
The tpool_worker threads have a check on && q->qsize - q->n_output > p->tsize - p->nwaiting This is to ensure that all waiting threads have sufficient room to write their results to the output queue, so if they all start up, there's room. However, when flushing, we need to leave a bit more leeway. It's possible that we haven't yet consumed the results, so the output queue is full, meaning the flush isn't something that we can call and just wait on unless we have a separate thread that is able to drain the queue. (This is valid for SAM and BAM, but not CRAM where the result consumer is also the same thread that calls the flush). Hts_tpool_process_flush attempts to fix this, albeit with an ironic comment of "Shouldn't be possible to get here, but just in case". It can, and it matters, but it's not sufficient. The condition was: if (q->qsize < q->n_output + q->n_input + q->n_processing) We can see from tpool_worker above however that it checks "p->tsize - p->nwaiting", so it's not just a case of qsize vs n_output. Adding "p->tsize" to our qsize increase avoids this potential deadlock. (Potentially we may want to do this as a temporary measure because a caller may do a flush mid-way and not at a time where it's about to tear everything down, but an increased queue size isn't a major concern as it's simply a little bit more memory.) As to how to trigger this, it was a 1 in 750 or so case from the htslib test: ./test_view -@4 -o seqs_per_slice=100 -o no_ref=1 -o multi_seq_per_slice=-1 -S -C multi_ref.tmp.sam The deadlock occurs in cram_close, which initially calls cram_flush_container_mt and then moves on to hts_tpool_process_flush. Inbetween these is the race condition that can lead to lots of waiting working threads and hence the qsize being too small. We can trigger this bug 100% of the time by inserting a "sleep(1)" prior to the "if (fd->pool &&..." conditional. With the fix, it's survived 20,000 tests. Arguably the bug is in cram_io.c and not the thread pool, but it's best to make the pool robust to poor usage, especially given there is no hts_tpool_process_flush call that has a timeout.
- Loading branch information