-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-7360] Enable CPU Profiling 2.0 by default #2702
Conversation
This check is a relic of older times when the profiling native extension did not exist. Nowadays, the checks for JRuby and Windows and non-linux platforms have moved to the `Datadog::Profiling.unsupported_reason`, and they will come from `native_library_compilation_skipped?` (via the `NativeExtensionHelpers` module). Thus, on those platforms, the profiling code flow doesn't even try to run the `Setup` task (so the warning would not be shown). For the macOS setup, we're working on moving the indication that macOS has no cpu-time to the Datadog UX itself. But I think it's OK to drop this now, because running on macOS is anyway something that involves manual work (such as compiling your own libdatadog), so it's not something we currently expect customers to go through.
As we're preparing to default CPU Profiling 2.0 to on, let's document that the `max_events` feature is deprecated, like the legacy profiler will be soon.
This setting (also available as an environment variable) will be wired up so that customers can opt-out of CPU Profiling 2.0.
This method will be used to decide when to (and when not to) enable CPU Profiling 2.0. It is part of the change to enable CPU Profiling 2.0 by default.
We've built enough trust on the feature that we don't need to put up the banner when it's enabled.
**What does this PR do?**: This PR graduates the new CPU Profiling 2.0 profiler from being in beta to being GA (general availability). For reference on the advantages brought by the new profiler, check the [dd-trace-rb 1.10.0 release notes](https://github.com/DataDog/dd-trace-rb/releases/tag/v1.10.0). Whenever profiling is enabled, the CPU Profiling 2.0 profiler will now be used by default unless: 1. The `mysql2` gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above. Because currently we don't (yet?) have a way to detect which version of libmysqlclient is in use by the `mysql2` gem, we opt for disabling CPU Profiling 2.0 in these situations, and emit a warning message telling why this has happened. 2. The profiler is running on Ruby 2.5 or below. These Ruby versions are missing an API that allows the profiler to detect the currently-active Ruby thread. We have deployed a workaround, but suspect that it may lead to crashes in extremely rare situations. (We've never seen the crashes in practice, but we can't be sure given how the VM behaves on those versions.) This PR also adds a (temporary, expected to be deprecated) setting that can be used to force the legacy profiler to be used instead of the new CPU Profiling 2.0 profiler. We do not recommend the use of this setting unless instructed by support. This setting can be set via code (`c.profiling.advanced.force_enable_legacy_profiler`) or via an environment variable (`DD_PROFILING_FORCE_ENABLE_LEGACY`). **Motivation**: Release CPU Profiling 2.0 to all customers. **Additional Notes**: I plan to explore an idea to optionally make CPU Profiling 2.0 sample without signals (by instead behaving like the legacy profiler). This should allow us to fully replace the legacy profiler with CPU Profiling 2.0 even for the customers affected by the caveats above (with some trade-offs, to be explored). **How to test the change?**: Validate that when running the profiler, the new CPU Profiling 2.0 profiler gets used. The easiest way to do so is by looking at the profiler UX, and seeing if selecting "CPU Time by Thread Name" in the right sidebar shows thread names. (Alternatively -- inspect the profile and check if `_native_sampling_loop` is shown).
… compiler warning
The funny thing about this one is that this one is caused by `#pragma GCC diagnostic ignored "-Wexpansion-to-defined"` on GCC versions that don't the above warning. So as far as I understood it, this is what happened: 1. On a modern GCC, we had a warning caused by the Ruby headers (the "expansion-to-defined" warning). We couldn't do anything about it, so we ignored it. 2. On an older GCC, the warning from #1 is not known, so it emits a different warning: a warning that it doesn't know "expansion-to-defined". 3. So we now need to ignore the unknown pragma warning... ... lol computers.
Codecov Report
@@ Coverage Diff @@
## master #2702 +/- ##
=======================================
Coverage 98.04% 98.04%
=======================================
Files 1199 1199
Lines 65754 65757 +3
Branches 2927 2925 -2
=======================================
+ Hits 64467 64472 +5
+ Misses 1287 1285 -2
... and 4 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
#pragma GCC diagnostic ignored "-Wpragmas" | ||
#pragma GCC diagnostic ignored "-Wexpansion-to-defined" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these being ignored?
Would it make sense to write a short comment about each of these GCC diagnostic ignored
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is "hiding" the explanation, which is on lines 15-18. All of these are warnings that show up in the internal Ruby headers and that unfortunately we can't do a lot about.
I considered adding more specific details, but to be honest, adding something like
/*
Silences
.rvm/gems/ruby-2.4.10/gems/debase-ruby_core_source-3.2.0/lib/debase/ruby_core_source/ruby-2.4.3-p205/internal.h:991:1: warning: ‘const’ attribute on function returning ‘void’ [-Wattributes]
991 | CONSTFUNC(void rb_gc_mark_encodings(void));
| ^~~~~~~~~
doesn't add a lot beyond "yeah, there's some specific pattern in the internal headers that some specific version of gcc generates a warning about, and we can't do anything about it >_>"
Any suggestions? Do you think it's still worth including these details, even though it's not something we can act on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @ivoanjo!
Thanks @marcotc for the suggestion.
…ofiler **What does this PR do?**: As part of #2702 (more specifically, in a7d3b73) we added a check for the `mysql2` gem which avoided turning on the new profiler when we detected it. This check was a bit heavy-handed because the known incompatibility between the `mysql2` gem and the profiler only happens if the `mysql2` gem is using a legacy version of the `libmysqlclient` C library. (Specifically, a version prior to 8.0.0) In this PR, we improve the check to _actually_ check the `libmysqlclient` version. Thus, we will only fall back to using the legacy profiler as needed, not always as before. **Motivation**: The `mysql2` gem is a commonly-used database driver for Ruby (and Ruby on Rails) applications, and we don't want these customers to get stuck using the old profiler. **Additional Notes**: As part of the check, we need to `require 'mysql2'` during profiler initialization (since we need to call a method provided by the gem). In the rare just-in-case situation that this causes issues and customers don't want this to happen, this PR also introduces a new option to disable this behavior: * `DD_PROFILING_SKIP_MYSQL2_CHECK` via environment variable * `c.profiling.advanced.skip_mysql2_check` via code **How to test the change?**: Test coverage is included. To test this with the actual `mysql2` gem, our CI images have a modern version of `libmysqlclient` (specifically, the mariadb variant): ``` $ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1" DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using a compatible version of the `libmysqlclient` library (10.5.15) DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:58:in `block in start') Starting thread for: #<Datadog::Profiling::Collectors::CpuAndWallTimeWorker:0x00005e979dd5e498> ``` ...and you can use an older Ubuntu 18.04 image to see the failing case: ``` $ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1" DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using an incompatible version of the `libmysqlclient` library (5.7.41) WARN -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:170:in `enable_new_profiler?') Falling back to legacy profiler because an incompatible version of the mysql2 gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the new profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above. DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/workers/async.rb:130:in `start_worker') Starting thread for: #<Datadog::Profiling::Collectors::OldStack:0x00005604f7ac46b0> ```
…ofiler **What does this PR do?**: As part of #2702 (more specifically, in a7d3b73) we added a check for the `mysql2` gem which avoided turning on the new profiler when we detected it. This check was a bit heavy-handed because the known incompatibility between the `mysql2` gem and the profiler only happens if the `mysql2` gem is using a legacy version of the `libmysqlclient` C library. (Specifically, a version prior to 8.0.0) In this PR, we improve the check to _actually_ check the `libmysqlclient` version. Thus, we will only fall back to using the legacy profiler as needed, not always as before. **Motivation**: The `mysql2` gem is a commonly-used database driver for Ruby (and Ruby on Rails) applications, and we don't want these customers to get stuck using the old profiler. **Additional Notes**: As part of the check, we need to `require 'mysql2'` during profiler initialization (since we need to call a method provided by the gem). In the rare just-in-case situation that this causes issues and customers don't want this to happen, this PR also introduces a new option to disable this behavior: * `DD_PROFILING_SKIP_MYSQL2_CHECK` via environment variable * `c.profiling.advanced.skip_mysql2_check` via code **How to test the change?**: Test coverage is included. To test this with the actual `mysql2` gem, our CI images have a modern version of `libmysqlclient` (specifically, the mariadb variant): ``` $ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1" DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using a compatible version of the `libmysqlclient` library (10.5.15) DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:58:in `block in start') Starting thread for: #<Datadog::Profiling::Collectors::CpuAndWallTimeWorker:0x00005e979dd5e498> ``` ...and you can use an older Ubuntu 18.04 image to see the failing case: ``` $ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1" DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using an incompatible version of the `libmysqlclient` library (5.7.41) WARN -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:170:in `enable_new_profiler?') Falling back to legacy profiler because an incompatible version of the mysql2 gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the new profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above. DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/workers/async.rb:130:in `start_worker') Starting thread for: #<Datadog::Profiling::Collectors::OldStack:0x00005604f7ac46b0> ```
…ng 2.0 without signals **What does this PR do?**: This PR adds a new mode to the profiler, which we're calling the "no signals workaround". **Motivation**: This mode is a workaround for a few incompatibility issues with the new CPU Profiling 2.0 profiler. In #2702, when we made CPU Profiling 2.0 the default mode, we listed two cases where we were falling back to the old profiler: * [Some gems use native libraries that don't handle the `EINTR` error code that can be caused by a signal handler interrupting a system call](https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#unexpected-run-time-failures-and-errors-from-ruby-gems-that-use-native-extensions-in-dd-trace-rb-1110) * Ruby 2.5 or below are missing an API that allows the profiler to detect if the current thread is holding on to the Global VM Lock in the signal handler, which we suspect may lead to very rare crashes The objective of the "no signals workaround" is to allow the new CPU Profiling 2.0 to be used by customers in the above situations, both of which are related to signal handling. Thus, we're able to migrate every customer to the new codebase, allowing them to access the new features provided by the profiler, and allowing us to completely retire the old profiler. (I'll still keep the old profiler alive for a couple more releases, but the plan is to delete it soon). **Additional Notes**: The "no signals workaround" mode works by making the CPU Profiling 2.0 profiler behave in a similar way to the legacy profiler when deciding to take samples. That is, the CPU Profiling 2.0 profiler by default runs in a background thread that **does not hold the global VM lock** and thus is running in parallel with the Ruby application. When it decides it wants to take a sample, it sends a unix SIGPROF signal to the currently-active Ruby thread, temporarily interrupting what that thread is doing to ask the VM to take a sample. This allows us to get higher quality data, because it's the profiler that decides when it wants to take a sample and interrupts the code to do so. The "no signals workaround" instead modifies the above to have the profiler **grab the global VM lock** when it wants to take a sample, instead of interrupting the currently-active Ruby thread. This means that instead of sampling immediately, the profiler needs to wait for a Ruby VM scheduling switch point, and for the Ruby VM scheduler to actually schedule our thread to execute. This means that the profiler may need to wait dozens or hundreds of milliseconds to sample. But... it works! And the above is effectively how the old profiler worked -- since it was written in pure Ruby code, it always needed to wait for its turn to run. Implementing the new behavior is not a lot of work because we already had the scaffolding to do it as part of being able to sample in periods of idleness (see #2468). So most of the work on this was validating and adding the needed configuration. **How to test the change?**: Validate that in the situations described above where the old profiler would be used no longer happen -- instead, the new profiler gets used, and the "no signals workaround" gets enabled.
…ng 2.0 without signals **What does this PR do?**: This PR adds a new mode to the profiler, which we're calling the "no signals" workaround. **Motivation**: This mode is a workaround for a few incompatibility issues with the new CPU Profiling 2.0 profiler. In #2702, when we made CPU Profiling 2.0 the default mode, we listed two cases where we were falling back to the legacy profiler: * [Some gems use native libraries that don't handle the `EINTR` error code that can be caused by a signal handler interrupting a system call](https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#unexpected-run-time-failures-and-errors-from-ruby-gems-that-use-native-extensions-in-dd-trace-rb-1110) * Ruby 2.5 or below are missing an API that allows the profiler to detect if the current thread is holding on to the Global VM Lock in the signal handler, which we suspect may lead to very rare crashes The objective of the "no signals" workaround is to allow the new CPU Profiling 2.0 to be used by customers in the above situations, both of which are related to signal handling. Thus, we're able to migrate every customer to the new codebase, allowing them to access the new features provided by the profiler, and allowing us to completely retire the legacy profiler. (I'll still keep the legacy profiler alive for a couple more releases, but the plan is to delete it soon). **Additional Notes**: The "no signals workaround" mode works by making the CPU Profiling 2.0 profiler behave in a similar way to the legacy profiler when deciding to take samples. That is, the CPU Profiling 2.0 profiler by default runs in a background thread that **does not hold the global VM lock** and thus is running in parallel with the Ruby application. When it decides it wants to take a sample, it sends a unix SIGPROF signal to the currently-active Ruby thread, temporarily interrupting what that thread is doing to ask the VM to take a sample. This allows us to get higher quality data, because it's the profiler that decides when it wants to take a sample and interrupts the code to do so. The "no signals workaround" instead modifies the above to have the profiler **grab the global VM lock** when it wants to take a sample, instead of interrupting the currently-active Ruby thread. This means that instead of sampling immediately, the profiler needs to wait for a Ruby VM scheduling switch point, and for the Ruby VM scheduler to actually schedule our thread to execute. This means that the profiler may need to wait dozens or hundreds of milliseconds to sample. But... it works! And the above is effectively how the legacy profiler worked -- since it was written in pure Ruby code, it always needed to wait for its turn to run. Implementing the new behavior is not a lot of work because we already had the scaffolding to do it as part of being able to sample in periods of idleness (see #2468). So most of the work on this was validating and adding the needed configuration. **How to test the change?**: Validate that in the situations described above where the legacy profiler would be used no longer happen -- instead, the new profiler gets used, and the "no signals workaround" gets enabled.
What does this PR do?:
This PR graduates the new CPU Profiling 2.0 profiler from being in beta to being GA (general availability).
For reference on the advantages brought by the new profiler, check the dd-trace-rb 1.10.0 release notes.
Whenever profiling is enabled, the CPU Profiling 2.0 profiler will now be used by default unless:
The
mysql2
gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above.Because currently we don't (yet?) have a way to detect which version of libmysqlclient is in use by the
mysql2
gem, we opt for disabling CPU Profiling 2.0 in these situations, and emit a warning message telling why this has happened.The profiler is running on Ruby 2.5 or below. These Ruby versions are missing an API that allows the profiler to detect the currently-active Ruby thread. We have deployed a workaround, but suspect that it may lead to crashes in extremely rare situations. (We've never seen the crashes in practice, but we can't be sure given how the VM behaves on those versions.)
This PR also adds a (temporary, expected to be deprecated) setting that can be used to force the legacy profiler to be used instead of the new CPU Profiling 2.0 profiler.
We do not recommend the use of this setting unless instructed by support. This setting can be set via code (
c.profiling.advanced.force_enable_legacy_profiler
) or via an environment variable (DD_PROFILING_FORCE_ENABLE_LEGACY
).Motivation:
Release CPU Profiling 2.0 to all customers.
Additional Notes:
I plan to explore an idea to optionally make CPU Profiling 2.0 sample without signals (by instead behaving like the legacy profiler). This should allow us to fully replace the legacy profiler with CPU Profiling 2.0 even for the customers affected by the caveats above (with some trade-offs, to be explored).
This PR has been opened on top of #2700 to avoid merge conflicts, but is otherwise completely independent from it.
How to test the change?:
Validate that when running the profiler, the new CPU Profiling 2.0 profiler gets used. The easiest way to do so is by looking at the profiler UX, and seeing if selecting "CPU Time by Thread Name" in the right sidebar shows thread names. (Alternatively -- inspect the profile and check if
_native_sampling_loop
is shown).