Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-7360] Enable CPU Profiling 2.0 by default #2702

Merged
merged 15 commits into from
Mar 21, 2023

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Mar 20, 2023

What does this PR do?:

This PR graduates the new CPU Profiling 2.0 profiler from being in beta to being GA (general availability).

For reference on the advantages brought by the new profiler, check the dd-trace-rb 1.10.0 release notes.

Whenever profiling is enabled, the CPU Profiling 2.0 profiler will now be used by default unless:

  1. The mysql2 gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above.

    Because currently we don't (yet?) have a way to detect which version of libmysqlclient is in use by the mysql2 gem, we opt for disabling CPU Profiling 2.0 in these situations, and emit a warning message telling why this has happened.

  2. The profiler is running on Ruby 2.5 or below. These Ruby versions are missing an API that allows the profiler to detect the currently-active Ruby thread. We have deployed a workaround, but suspect that it may lead to crashes in extremely rare situations. (We've never seen the crashes in practice, but we can't be sure given how the VM behaves on those versions.)

This PR also adds a (temporary, expected to be deprecated) setting that can be used to force the legacy profiler to be used instead of the new CPU Profiling 2.0 profiler.

We do not recommend the use of this setting unless instructed by support. This setting can be set via code (c.profiling.advanced.force_enable_legacy_profiler) or via an environment variable (DD_PROFILING_FORCE_ENABLE_LEGACY).

Motivation:

Release CPU Profiling 2.0 to all customers.

Additional Notes:

I plan to explore an idea to optionally make CPU Profiling 2.0 sample without signals (by instead behaving like the legacy profiler). This should allow us to fully replace the legacy profiler with CPU Profiling 2.0 even for the customers affected by the caveats above (with some trade-offs, to be explored).

This PR has been opened on top of #2700 to avoid merge conflicts, but is otherwise completely independent from it.

How to test the change?:

Validate that when running the profiler, the new CPU Profiling 2.0 profiler gets used. The easiest way to do so is by looking at the profiler UX, and seeing if selecting "CPU Time by Thread Name" in the right sidebar shows thread names. (Alternatively -- inspect the profile and check if _native_sampling_loop is shown).

ivoanjo added 14 commits March 20, 2023 12:26
This check is a relic of older times when the profiling native extension
did not exist.

Nowadays, the checks for JRuby and Windows and non-linux platforms have
moved to the `Datadog::Profiling.unsupported_reason`, and they will
come from `native_library_compilation_skipped?` (via the
`NativeExtensionHelpers` module).

Thus, on those platforms, the profiling code flow doesn't even try to
run the `Setup` task (so the warning would not be shown).

For the macOS setup, we're working on moving the indication that
macOS has no cpu-time to the Datadog UX itself. But I think it's OK
to drop this now, because running on macOS is anyway something that
involves manual work (such as compiling your own libdatadog), so
it's not something we currently expect customers to go through.
As we're preparing to default CPU Profiling 2.0 to on, let's document
that the `max_events` feature is deprecated, like the legacy profiler
will be soon.
This setting (also available as an environment variable) will be wired
up so that customers can opt-out of CPU Profiling 2.0.
This method will be used to decide when to (and when not to) enable
CPU Profiling 2.0. It is part of the change to enable CPU Profiling 2.0
by default.
We've built enough trust on the feature that we don't need to put up
the banner when it's enabled.
**What does this PR do?**:

This PR graduates the new CPU Profiling 2.0 profiler from being in beta
to being GA (general availability).

For reference on the advantages brought by the new profiler, check
the [dd-trace-rb 1.10.0 release
notes](https://github.com/DataDog/dd-trace-rb/releases/tag/v1.10.0).

Whenever profiling is enabled, the CPU Profiling 2.0 profiler will now
be used by default unless:

1. The `mysql2` gem is installed. Older versions of libmysqlclient (the
   C library used by the mysql2 gem) have a bug in their signal handling
   code that the profiler can trigger. This bug
   (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient
   versions 8.0.0 and above.

   Because currently we don't (yet?) have a way to detect which version
   of libmysqlclient is in use by the `mysql2` gem, we opt for disabling
   CPU Profiling 2.0 in these situations, and emit a warning message
   telling why this has happened.

2. The profiler is running on Ruby 2.5 or below. These Ruby versions
   are missing an API that allows the profiler to detect the
   currently-active Ruby thread. We have deployed a workaround, but
   suspect that it may lead to crashes in extremely rare situations.
   (We've never seen the crashes in practice, but we can't be sure
   given how the VM behaves on those versions.)

This PR also adds a (temporary, expected to be deprecated) setting
that can be used to force the legacy profiler to be used instead of
the new CPU Profiling 2.0 profiler.

We do not recommend the use of this setting unless instructed by
support. This setting can be set via code
(`c.profiling.advanced.force_enable_legacy_profiler`) or via
an environment variable (`DD_PROFILING_FORCE_ENABLE_LEGACY`).

**Motivation**:

Release CPU Profiling 2.0 to all customers.

**Additional Notes**:

I plan to explore an idea to optionally make CPU Profiling 2.0
sample without signals (by instead behaving like the legacy
profiler). This should allow us to fully replace the legacy
profiler with CPU Profiling 2.0 even for the customers affected
by the caveats above (with some trade-offs, to be explored).

**How to test the change?**:

Validate that when running the profiler, the new CPU Profiling 2.0
profiler gets used. The easiest way to do so is by looking at the
profiler UX, and seeing if selecting "CPU Time by Thread Name" in
the right sidebar shows thread names. (Alternatively -- inspect
the profile and check if `_native_sampling_loop` is shown).
The funny thing about this one is that this one is caused by
`#pragma GCC diagnostic ignored "-Wexpansion-to-defined"` on GCC
versions that don't the above warning.

So as far as I understood it, this is what happened:
1. On a modern GCC, we had a warning caused by the Ruby headers (the
   "expansion-to-defined" warning). We couldn't do anything about it,
   so we ignored it.
2. On an older GCC, the warning from #1 is not known, so it emits
   a different warning: a warning that it doesn't know
   "expansion-to-defined".
3. So we now need to ignore the unknown pragma warning...
   ... lol computers.
@ivoanjo ivoanjo requested a review from a team March 20, 2023 16:17
@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Mar 20, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #2702 (147f8b6) into master (c0b33ac) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2702   +/-   ##
=======================================
  Coverage   98.04%   98.04%           
=======================================
  Files        1199     1199           
  Lines       65754    65757    +3     
  Branches     2927     2925    -2     
=======================================
+ Hits        64467    64472    +5     
+ Misses       1287     1285    -2     
Impacted Files Coverage Δ
lib/datadog/profiling/tasks/setup.rb 93.54% <ø> (-2.20%) ⬇️
spec/datadog/profiling/tasks/setup_spec.rb 100.00% <ø> (ø)
lib/datadog/core/configuration/settings.rb 99.21% <100.00%> (+0.01%) ⬆️
lib/datadog/profiling/component.rb 96.15% <100.00%> (+0.80%) ⬆️
spec/datadog/core/configuration/settings_spec.rb 100.00% <100.00%> (ø)
spec/datadog/profiling/component_spec.rb 99.46% <100.00%> (+0.10%) ⬆️

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +22 to +23
#pragma GCC diagnostic ignored "-Wpragmas"
#pragma GCC diagnostic ignored "-Wexpansion-to-defined"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these being ignored?
Would it make sense to write a short comment about each of these GCC diagnostic ignored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is "hiding" the explanation, which is on lines 15-18. All of these are warnings that show up in the internal Ruby headers and that unfortunately we can't do a lot about.

I considered adding more specific details, but to be honest, adding something like

/*
Silences
.rvm/gems/ruby-2.4.10/gems/debase-ruby_core_source-3.2.0/lib/debase/ruby_core_source/ruby-2.4.3-p205/internal.h:991:1: warning: ‘constattribute on function returningvoid’ [-Wattributes]
  991 | CONSTFUNC(void rb_gc_mark_encodings(void));
      | ^~~~~~~~~

doesn't add a lot beyond "yeah, there's some specific pattern in the internal headers that some specific version of gcc generates a warning about, and we can't do anything about it >_>"

Any suggestions? Do you think it's still worth including these details, even though it's not something we can act on?

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, @ivoanjo!
:shipit:

@ivoanjo ivoanjo merged commit cb6f3a4 into master Mar 21, 2023
@ivoanjo ivoanjo deleted the ivoanjo/prof-7360-cpu-profiling-2.0-ga branch March 21, 2023 10:47
@github-actions github-actions bot added this to the 1.11.0 milestone Mar 21, 2023
ivoanjo added a commit that referenced this pull request Apr 11, 2023
…ofiler

**What does this PR do?**:

As part of #2702 (more specifically, in a7d3b73) we added a check
for the `mysql2` gem which avoided turning on the new profiler when we
detected it.

This check was a bit heavy-handed because the known incompatibility
between the `mysql2` gem and the profiler only happens if the
`mysql2` gem is using a legacy version of the `libmysqlclient` C
library. (Specifically, a version prior to 8.0.0)

In this PR, we improve the check to _actually_ check the
`libmysqlclient` version. Thus, we will only fall back to using the
legacy profiler as needed, not always as before.

**Motivation**:

The `mysql2` gem is a commonly-used database driver for Ruby
(and Ruby on Rails) applications, and we don't want these customers to
get stuck using the old profiler.

**Additional Notes**:

As part of the check, we need to `require 'mysql2'` during profiler
initialization (since we need to call a method provided by the gem).

In the rare just-in-case situation that this causes issues
and customers don't want this to happen, this PR also introduces a
new option to disable this behavior:

* `DD_PROFILING_SKIP_MYSQL2_CHECK` via environment variable
* `c.profiling.advanced.skip_mysql2_check` via code

**How to test the change?**:

Test coverage is included.

To test this with the actual `mysql2` gem, our CI images have a modern
version of `libmysqlclient` (specifically, the mariadb variant):

```
$ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1"
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using a compatible version of the `libmysqlclient` library (10.5.15)
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:58:in `block in start') Starting thread for: #<Datadog::Profiling::Collectors::CpuAndWallTimeWorker:0x00005e979dd5e498>
```

...and you can use an older Ubuntu 18.04 image to see the failing case:

```
$ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1"
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using an incompatible version of the `libmysqlclient` library (5.7.41)
 WARN -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:170:in `enable_new_profiler?') Falling back to legacy profiler because an incompatible version of the mysql2 gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the new profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above.
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/workers/async.rb:130:in `start_worker') Starting thread for: #<Datadog::Profiling::Collectors::OldStack:0x00005604f7ac46b0>
```
ivoanjo added a commit that referenced this pull request Apr 11, 2023
…ofiler

**What does this PR do?**:

As part of #2702 (more specifically, in a7d3b73) we added a check
for the `mysql2` gem which avoided turning on the new profiler when we
detected it.

This check was a bit heavy-handed because the known incompatibility
between the `mysql2` gem and the profiler only happens if the
`mysql2` gem is using a legacy version of the `libmysqlclient` C
library. (Specifically, a version prior to 8.0.0)

In this PR, we improve the check to _actually_ check the
`libmysqlclient` version. Thus, we will only fall back to using the
legacy profiler as needed, not always as before.

**Motivation**:

The `mysql2` gem is a commonly-used database driver for Ruby
(and Ruby on Rails) applications, and we don't want these customers to
get stuck using the old profiler.

**Additional Notes**:

As part of the check, we need to `require 'mysql2'` during profiler
initialization (since we need to call a method provided by the gem).

In the rare just-in-case situation that this causes issues
and customers don't want this to happen, this PR also introduces a
new option to disable this behavior:

* `DD_PROFILING_SKIP_MYSQL2_CHECK` via environment variable
* `c.profiling.advanced.skip_mysql2_check` via code

**How to test the change?**:

Test coverage is included.

To test this with the actual `mysql2` gem, our CI images have a modern
version of `libmysqlclient` (specifically, the mariadb variant):

```
$ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1"
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using a compatible version of the `libmysqlclient` library (10.5.15)
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started
DEBUG -- ddtrace: [ddtrace] (/app/lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:58:in `block in start') Starting thread for: #<Datadog::Profiling::Collectors::CpuAndWallTimeWorker:0x00005e979dd5e498>
```

...and you can use an older Ubuntu 18.04 image to see the failing case:

```
$ DD_PROFILING_ENABLED=true DD_TRACE_DEBUG=true bundle exec ddtracerb exec ruby -e "sleep 1"
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:205:in `compatible_libmysqlclient_version?') Requiring `mysql2` to check if the `libmysqlclient` version it uses is compatible with profiling
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:218:in `compatible_libmysqlclient_version?') The `mysql2` gem is using an incompatible version of the `libmysqlclient` library (5.7.41)
 WARN -- ddtrace: [ddtrace] (/working/lib/datadog/profiling/component.rb:170:in `enable_new_profiler?') Falling back to legacy profiler because an incompatible version of the mysql2 gem is installed. Older versions of libmysqlclient (the C library used by the mysql2 gem) have a bug in their signal handling code that the new profiler can trigger. This bug (https://bugs.mysql.com/bug.php?id=83109) is fixed in libmysqlclient versions 8.0.0 and above.
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/configuration/components.rb:92:in `startup!') Profiling started
DEBUG -- ddtrace: [ddtrace] (/working/lib/datadog/core/workers/async.rb:130:in `start_worker') Starting thread for: #<Datadog::Profiling::Collectors::OldStack:0x00005604f7ac46b0>
```
@lloeki lloeki modified the milestones: 1.11.0, 1.11.0.beta1 Apr 14, 2023
ivoanjo added a commit that referenced this pull request May 24, 2023
…ng 2.0 without signals

**What does this PR do?**:

This PR adds a new mode to the profiler, which we're calling the
"no signals workaround".

**Motivation**:

This mode is a workaround for a few incompatibility issues with the new
CPU Profiling 2.0 profiler. In #2702, when we made CPU Profiling 2.0
the default mode, we listed two cases where we were falling back to the
old profiler:

* [Some gems use native libraries that don't handle the `EINTR` error
  code that can be caused by a signal handler interrupting a system
  call](https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#unexpected-run-time-failures-and-errors-from-ruby-gems-that-use-native-extensions-in-dd-trace-rb-1110)
* Ruby 2.5 or below are missing an API that allows the profiler to
  detect if the current thread is holding on to the Global VM Lock
  in the signal handler, which we suspect may lead to very rare crashes

The objective of the "no signals workaround" is to allow the new
CPU Profiling 2.0 to be used by customers in the above situations,
both of which are related to signal handling.

Thus, we're able to migrate every customer to the new codebase,
allowing them to access the new features provided by the profiler,
and allowing us to completely retire the old profiler.

(I'll still keep the old profiler alive for a couple more releases,
but the plan is to delete it soon).

**Additional Notes**:

The "no signals workaround" mode works by making the CPU Profiling 2.0
profiler behave in a similar way to the legacy profiler when deciding
to take samples.

That is, the CPU Profiling 2.0 profiler by default runs in a
background thread that **does not hold the global VM lock** and thus
is running in parallel with the Ruby application.

When it decides it wants to take a sample, it sends a unix SIGPROF
signal to the currently-active Ruby thread, temporarily interrupting what
that thread is doing to ask the VM to take a sample.

This allows us to get higher quality data, because it's the profiler
that decides when it wants to take a sample and interrupts the code
to do so.

The "no signals workaround" instead modifies the above to have the
profiler **grab the global VM lock** when it wants to take a sample,
instead of interrupting the currently-active Ruby thread.

This means that instead of sampling immediately, the profiler needs
to wait for a Ruby VM scheduling switch point, and for the Ruby VM
scheduler to actually schedule our thread to execute.

This means that the profiler may need to wait dozens or hundreds
of milliseconds to sample. But... it works!

And the above is effectively how the old profiler worked -- since
it was written in pure Ruby code, it always needed to wait for its
turn to run.

Implementing the new behavior is not a lot of work because we already
had the scaffolding to do it as part of being able to sample in
periods of idleness (see #2468). So most of the work on this was
validating and adding the needed configuration.

**How to test the change?**:

Validate that in the situations described above where the old
profiler would be used no longer happen -- instead, the new
profiler gets used, and the "no signals workaround" gets enabled.
ivoanjo added a commit that referenced this pull request May 24, 2023
…ng 2.0 without signals

**What does this PR do?**:

This PR adds a new mode to the profiler, which we're calling the
"no signals" workaround.

**Motivation**:

This mode is a workaround for a few incompatibility issues with the new
CPU Profiling 2.0 profiler. In #2702, when we made CPU Profiling 2.0
the default mode, we listed two cases where we were falling back to the
legacy profiler:

* [Some gems use native libraries that don't handle the `EINTR` error
  code that can be caused by a signal handler interrupting a system
  call](https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#unexpected-run-time-failures-and-errors-from-ruby-gems-that-use-native-extensions-in-dd-trace-rb-1110)
* Ruby 2.5 or below are missing an API that allows the profiler to
  detect if the current thread is holding on to the Global VM Lock
  in the signal handler, which we suspect may lead to very rare crashes

The objective of the "no signals" workaround is to allow the new
CPU Profiling 2.0 to be used by customers in the above situations,
both of which are related to signal handling.

Thus, we're able to migrate every customer to the new codebase,
allowing them to access the new features provided by the profiler,
and allowing us to completely retire the legacy profiler.

(I'll still keep the legacy profiler alive for a couple more releases,
but the plan is to delete it soon).

**Additional Notes**:

The "no signals workaround" mode works by making the CPU Profiling 2.0
profiler behave in a similar way to the legacy profiler when deciding
to take samples.

That is, the CPU Profiling 2.0 profiler by default runs in a
background thread that **does not hold the global VM lock** and thus
is running in parallel with the Ruby application.

When it decides it wants to take a sample, it sends a unix SIGPROF
signal to the currently-active Ruby thread, temporarily interrupting what
that thread is doing to ask the VM to take a sample.

This allows us to get higher quality data, because it's the profiler
that decides when it wants to take a sample and interrupts the code
to do so.

The "no signals workaround" instead modifies the above to have the
profiler **grab the global VM lock** when it wants to take a sample,
instead of interrupting the currently-active Ruby thread.

This means that instead of sampling immediately, the profiler needs
to wait for a Ruby VM scheduling switch point, and for the Ruby VM
scheduler to actually schedule our thread to execute.

This means that the profiler may need to wait dozens or hundreds
of milliseconds to sample. But... it works!

And the above is effectively how the legacy profiler worked -- since
it was written in pure Ruby code, it always needed to wait for its
turn to run.

Implementing the new behavior is not a lot of work because we already
had the scaffolding to do it as part of being able to sample in
periods of idleness (see #2468). So most of the work on this was
validating and adding the needed configuration.

**How to test the change?**:

Validate that in the situations described above where the legacy
profiler would be used no longer happen -- instead, the new
profiler gets used, and the "no signals workaround" gets enabled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants