[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fix deadlock issue that was happening in telemetry component when stopping the Datadog context right after start.
The reason for the deadlock was this overridden
#stop
method:This method is called by the main thread when
Component#shutdown!
is called: it tries to acquire TELEMETRY_STARTED_ONCE lock possibly already acquired by a telemetry worker, which with some unfortunate timing could cause deadlock. I added this#flush_events
call to ensure that pending events in the queue are flushed before worker is stopped (it was not happening for a reason unknown to me).It turned out that Worker with
Core::Workers::Queue
andCore::Workers::Polling
indeed does not flush pending events by default because of#work_pending?
method being overridden several times:In
Core::Workers::IntervalLoop
(included together withCore::Workers::Polling
):In
Core::Workers::Queue
:In telemetry case as Polling is included after Queue, Polling wins and work is considered pending only if loop is currently running. This logic completely disregards the fact that there are some events waiting to be flushed.
What we actually want in this case is the combination of 2 approaches: work is pending when loop is running or when there are some events in buffer. This can be solved by overriding this method in Worker class:
It allows us to remove
flush_events
call from the main thread eliminating (or decreasing?) deadlock probability.How to test the change?
Confirm that deadlock issue is gone using reproducer by @ivoanjo:
Run it with:
... 100 times in different Ruby versions to confirm that deadlock exception does not appear.
Confirm that app-closing event is still being sent when stopping the app using empty rails app with DD_TRACE_DEBUG and telemetry enabled: