[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743

anmarchenko · 2024-06-28T09:32:22Z

What does this PR do?
Fix deadlock issue that was happening in telemetry component when stopping the Datadog context right after start.

The reason for the deadlock was this overridden #stop method:

        def stop(force_stop = false, timeout = @shutdown_timeout)
          buffer.close if running?

          flush_events(dequeue) if work_pending?

          super
        end

This method is called by the main thread when Component#shutdown! is called: it tries to acquire TELEMETRY_STARTED_ONCE lock possibly already acquired by a telemetry worker, which with some unfortunate timing could cause deadlock. I added this #flush_events call to ensure that pending events in the queue are flushed before worker is stopped (it was not happening for a reason unknown to me).

It turned out that Worker with Core::Workers::Queue and Core::Workers::Polling indeed does not flush pending events by default because of #work_pending? method being overridden several times:

In Core::Workers::IntervalLoop (included together with Core::Workers::Polling):

        def work_pending?
          run_loop?
        end

In Core::Workers::Queue:

        def work_pending?
          !buffer.empty?
        end

In telemetry case as Polling is included after Queue, Polling wins and work is considered pending only if loop is currently running. This logic completely disregards the fact that there are some events waiting to be flushed.

What we actually want in this case is the combination of 2 approaches: work is pending when loop is running or when there are some events in buffer. This can be solved by overriding this method in Worker class:

        def work_pending?
          run_loop? || !buffer.empty?
        end

It allows us to remove flush_events call from the main thread eliminating (or decreasing?) deadlock probability.

How to test the change?
Confirm that deadlock issue is gone using reproducer by @ivoanjo:

require 'socket'
require 'datadog'
server = TCPServer.new(8126) # fake agent which never replies
puts Datadog::Tracing.enabled?

Run it with:

$ time DD_INSTRUMENTATION_TELEMETRY_ENABLED=true bundle exec ruby repro-telemetry.rb 

I, [2024-06-28T09:31:01.995300 #308]  INFO -- datadog: [datadog] DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:31:01Z","os_name":"aarch64-unknown-linux-gnu","version":"2.1.0","lang":"ruby","lang_version":"3.1.2","env":null,"service":"repro","dd_version":null,"debug":false,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.1.2","health_metrics_enabled":false,"profiling_enabled":false}
true

real    0m1.161s
user    0m0.595s
sys     0m0.096s

... 100 times in different Ruby versions to confirm that deadlock exception does not appear.

Confirm that app-closing event is still being sent when stopping the app using empty rails app with DD_TRACE_DEBUG and telemetry enabled:

D, [2024-06-28T11:39:05.341113 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/configuration/settings.rb:712:in `block (3 levels) in <class:Settings>') Development environment detected, disabling Remote Configuration. You can enable it with DD_REMOTE_CONFIGURATION_ENABLED=true.
D, [2024-06-28T11:39:05.341532 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x00000001215f0650>
I, [2024-06-28T11:39:05.341773 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:05Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-28T11:39:05.367022 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-28T11:39:05.367681 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-28T11:39:05.367922 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:119:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
=> Booting Puma
=> Rails 7.1.3.4 application starting in development 
=> Run `bin/rails server --help` for more startup options
D, [2024-06-28T11:39:05.484789 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x0000000123475be8>
I, [2024-06-28T11:39:05.485100 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:05Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":null,"service":"rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-28T11:39:05.488417 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-28T11:39:05.488472 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-28T11:39:05.488518 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:119:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
D, [2024-06-28T11:39:05.502629 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x00000001240d4420>
D, [2024-06-28T11:39:05.505530 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/http/adapters/net.rb:47:in `rescue in post') Unable to send telemetry event to agent: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
D, [2024-06-28T11:39:05.505556 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: nil)
D, [2024-06-28T11:39:05.505568 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:119:in `block in started!') Error sending telemetry app-started event, retry after heartbeat interval...
I, [2024-06-28T11:39:05.505705 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:05Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":"local","service":"quotes-rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
D, [2024-06-28T11:39:05.564050 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/workers/async.rb:136:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Worker:0x00000001238d4f68>
D, [2024-06-28T11:39:05.599066 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-started` (code: 202)
D, [2024-06-28T11:39:05.599174 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:113:in `block in started!') Telemetry app-started event is successfully sent
D, [2024-06-28T11:39:05.721944 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-dependencies-loaded` (code: 202)
D, [2024-06-28T11:39:05.813240 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:39:05.813287 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 0 telemetry events
D, [2024-06-28T11:39:05.813732 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:39:05.813755 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 0 telemetry events
D, [2024-06-28T11:39:06.139217 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:39:06.139277 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 1 telemetry events
D, [2024-06-28T11:39:06.254879 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-integrations-change` (code: 202)
I, [2024-06-28T11:39:06.255109 #48226]  INFO -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/diagnostics/environment_logger.rb:13:in `log_configuration!') DATADOG CONFIGURATION - CORE - {"date":"2024-06-28T09:39:06Z","os_name":"arm64-apple-darwin23","version":"2.1.0","lang":"ruby","lang_version":"3.3.3","env":"local","service":"quotes-rails","dd_version":null,"debug":true,"tags":null,"runtime_metrics_enabled":false,"vm":"ruby-3.3.3","health_metrics_enabled":false,"profiling_enabled":false}
Puma starting in single mode...
* Puma version: 6.4.2 (ruby 3.3.3-p89) ("The Eagle of Durango")
*  Min threads: 5
*  Max threads: 5
*  Environment: development
*          PID: 48226
* Listening on http://127.0.0.1:3000
* Listening on http://[::1]:3000
Use Ctrl-C to stop



D, [2024-06-28T11:40:05.930896 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:40:05.931120 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 1 telemetry events
D, [2024-06-28T11:40:06.031016 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-integrations-change` (code: 202)


^C- Gracefully stopping, waiting for requests to finish
=== puma shutdown: 2024-06-28 11:40:26 +0200 ===
- Goodbye!
Exiting
D, [2024-06-28T11:40:27.062213 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-heartbeat` (code: 202)
D, [2024-06-28T11:40:27.062340 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/worker.rb:88:in `flush_events') Sending 1 telemetry events

# ---------------> here is the app-closing event that has to be sent on shutdown
D, [2024-06-28T11:40:27.177922 #48226] DEBUG -- datadog: [datadog] (/Users/andrey.marchenko/p/dd-trace-rb/lib/datadog/core/telemetry/emitter.rb:29:in `request') Telemetry sent for event `app-closing` (code: 202)

…op, override #work_pending? instead

codecov-commenter · 2024-06-28T11:09:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.11%. Comparing base (9a6e1d5) to head (5e8360f).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3743   +/-   ##
=======================================
  Coverage   98.10%   98.11%           
=======================================
  Files        1235     1235           
  Lines       73747    73749    +2     
  Branches     3561     3560    -1     
=======================================
+ Hits        72353    72357    +4     
+ Misses       1394     1392    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions bot added the core Involves Datadog core libraries label Jun 28, 2024

anmarchenko marked this pull request as ready for review June 28, 2024 09:59

anmarchenko requested a review from a team as a code owner June 28, 2024 09:59

anmarchenko changed the title ~~fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead~~ [SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead Jun 28, 2024

anmarchenko added 3 commits June 28, 2024 12:54

fix deadlock issue: don't flush_events in main thread when calling st…

ea82972

…op, override #work_pending? instead

fix flaky test

88ff3a0

use method instead of instance variable

5e8360f

anmarchenko force-pushed the anmarchenko/fix_telemetry_threads_deadlock branch from 1fc79e4 to 5e8360f Compare June 28, 2024 10:54

TonyCTHsu approved these changes Jun 28, 2024

View reviewed changes

anmarchenko mentioned this pull request Jun 28, 2024

fix telemetry threads deadlock backport #3745

Merged

anmarchenko merged commit 4cd328d into master Jun 28, 2024
166 checks passed

anmarchenko deleted the anmarchenko/fix_telemetry_threads_deadlock branch June 28, 2024 13:49

github-actions bot added this to the 2.2.0 milestone Jun 28, 2024

TonyCTHsu mentioned this pull request Jul 11, 2024

Bump to version 2.2.0 #3778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743

[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743

anmarchenko commented Jun 28, 2024 •

edited

Loading

codecov-commenter commented Jun 28, 2024

[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743

[SDTEST-409] fix deadlock issue: don't flush_events in main thread when calling stop, override #work_pending? instead #3743

Conversation

anmarchenko commented Jun 28, 2024 • edited Loading

codecov-commenter commented Jun 28, 2024

Codecov Report

anmarchenko commented Jun 28, 2024 •

edited

Loading