Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020

Merged
merged 33 commits into from
Oct 24, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Oct 23, 2024

Change log entry

Add setting to lower heap profiling memory use/latency by cleaning up young objects after Ruby GC

What does this PR do?

This PR adds a new mechanism to lower heap profiling overhead (memory and latency): the ability to clean young objects being tracked by the heap profiler after a Ruby GC cycle runs, rather than waiting for serialization time.

This mechanism is currently off by default and controlled by the setting c.profiling.advanced.heap_clean_after_gc_enabled / DD_PROFILING_HEAP_CLEAN_AFTER_GC_ENABLED

We plan to run a bit more validation before enabling it by default.

Motivation:

By doing these cleanups, we lower memory usage (we don't need to wait until until next serialization to notice an object was already collected), and latency, because there's less work to be done at serialization time.

Additional Notes:

This PR started from #3906; the main change from that earlier PR is that the heap recorder clean after GC behavior has been changed to be a best-effort mechanism (e.g. it's optional and not required for correctness).

How to test the change?

This change includes test coverage. I've also added a new benchmarking configuration to evaluate this new setting using our usual harness.

AlexJF and others added 30 commits October 23, 2024 11:19
[PROF-9470] Remove unneded include

[PROF-9470] Cleanup after GC, major/minor based
To simplify the heap cleanup after GC PR, I'm changing the mechanism
to become best-effort, e.g. something that exists to reduce memory and
(hopefully) serialization latency, but that is not required for
correctness.

As the mechanism becomes best-effort, we no longer need to track
that we missed an update.
This commit changes the "heap cleanup after GC" to be best-effort:

* The stack recorder always triggers a full heap update before
  serialization (thus I removed the logic we were using to make a
  decision here)

* There's a new `heap_recorder_update_young_objects` method that
  is used to trigger the best-effort pass on young objects.

I realize we're at a weird mid-point: by making the mechanism
best-effort, it also means that if it's not working, we won't notice it.

I plan to add tests to catch this once I finish refactoring this
functionality.
This was broken at some point during our experiments (it's always
`false`) and is no longer needed now that the heap cleanup after GC
is best-effort.
… comes from outside

This allows us to simplify some of our logic and metrics, since we're
no longer deciding wether to do a full pass or not in this method.
This method becomes an internal detail, that either gets used via

* `heap_recorder_update_young_objects` => best-effort cleanup of young
  objects
* `heap_recorder_prepare_iteration` => full pass before serialization
Now that we know there's going to be a full update before serialization,
let's skip doing object size measurements during young object updates.

Doing this speeds up young object updates (less work), and potentially
doesn't particularly slow down full updates because we redo object size
measurements anyway.
When the heap recorder was tracking internal VM objects, because such
objects have no klass, they cannot be inspected using regular APIs.

Trying to do that results in a segfault as Ruby does not check before
trying to access that info.

To work around the issue, make sure to use the VM special object
printing API for debugging (`rb_obj_info`) which handles these
cases fine.

(Note that this issue was only for tests, because we only attempt
to print object information in this situation.)
Because heap cleanup after GC is a best-effort mechanism, we need to
have some kind of test that ensures it runs, otherwise we could
accidentally disable or break it and never realize.
This ensures we have coverage for this behavior, as otherwise our age
filtering was incorrect we wouldn't be able to spot it.
This mirrors what we've done for the `Collectors::ThreadContext`: by
introducing a `.for_testing` helper with a bunch of defaults, it's
easier to add new arguments to the class without having to fix a bunch
of tests to pass in the new extra argument; while at the same time
forcing production code to still pass in that argument.
We've done this conversion for other classes, and it seems a way nicer
way of allowing new arguments to be added without massive boilerplate.
This new setting, off by default, will allow us to easily test the
feature and decide on a rollout strategy.
We're no longer regularly looking at this data, so let's trim down
a bit the number of variants we're using.
This will allow us to validate the expected performance improvements
provided by this feature.
@ivoanjo ivoanjo added performance Involves performance (e.g. CPU, memory, etc) profiling Involves Datadog profiling labels Oct 23, 2024
@ivoanjo ivoanjo requested review from a team as code owners October 23, 2024 10:31
@ivoanjo ivoanjo requested a review from AlexJF October 23, 2024 10:31
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 97.82609% with 3 lines in your changes missing coverage. Please review.

Project coverage is 97.85%. Comparing base (38f6cc1) to head (dd653ce).
Report is 473 commits behind head on master.

Files with missing lines Patch % Lines
...filing/collectors/cpu_and_wall_time_worker_spec.rb 92.85% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4020      +/-   ##
==========================================
- Coverage   97.86%   97.85%   -0.01%     
==========================================
  Files        1319     1319              
  Lines       79144    79271     +127     
  Branches     3927     3929       +2     
==========================================
+ Hits        77451    77572     +121     
- Misses       1693     1699       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice test suite 👍

Comment on lines 494 to 496
// TODO: Discuss with Alex -- should we separate these out between young objects only and full updates?
// Tracking them together in this way seems to be muddying the waters -- young object updates will have more objects
// skipped, and different mixes of alive/dead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split the stats in dd653ce

@@ -330,6 +327,8 @@ static VALUE _native_new(VALUE klass) {
// Note: Any exceptions raised from this note until the TypedData_Wrap_Struct call will lead to the state memory
// being leaked.

state->heap_clean_after_gc_enabled = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current status of this is default false, why are we setting to true here? I know it'll get overriden by _native_initiatlize but seems unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right, in hindsight this may be a bit misleading. Changed in 6494eee .

Comment on lines +936 to +944
# Let's replace the test_object reference with another object, so that the original one can be GC'd
test_object = Object.new # rubocop:disable Lint/UselessAssignment

# Force an update to happen on the next GC
Datadog::Profiling::StackRecorder::Testing._native_heap_recorder_reset_last_update(recorder)

GC.start

test_object_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From previous flakes we may need to redo this a bunch of times just in case we have a ref in a register somewhere keeping it alive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wondered about that as well.

In this case I'm explicitly allocating a new object + changing the local variable + and this is in a separate method which was somewhat deliberate as "a bunch of random actions to hopefully flush any hidden/leftover references".

I guess we'll see if we get lucky or not?

Comment on lines +831 to +833
# Let's replace the test_object reference with another object, so that the original one can be GC'd
test_object = Object.new # rubocop:disable Lint/UselessAssignment
GC.start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar flake potential as above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See my note on the other one 😄 )

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 23, 2024

Note: The dd-gitlab/benchmarks step is failing because I've refactored the benchmarks to use StackRecorder.for_testing and thus running the benchmarks from this PR against the master branch does not work.

Once this PR is merged, the new for_testing method will also be available in the master branch so the benchmarks will start working again.

This avoids "muddying" the waters for our stats, as young object updates
will have more objects skipped, and different mixes of alive/dead.
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEBTBTM (Looks Even Better Than Before To Me)

@ivoanjo ivoanjo merged commit 1a6e255 into master Oct 24, 2024
267 of 269 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-9470-heap-cleanup-after-gc branch October 24, 2024 09:12
@github-actions github-actions bot added this to the 2.5.0 milestone Oct 24, 2024
@Strech Strech mentioned this pull request Nov 5, 2024
ivoanjo added a commit that referenced this pull request Nov 7, 2024
…ault

**What does this PR do?**

This PR changes the optimization added in #4020 to be enabled by
default.

I've collected a fresh set of benchmarking results for this feature in
[this google doc](https://docs.google.com/document/d/143jmyzB7rMJ9W2hKN0JoDbjo2m3oCVCzvPToHVjLRAM/edit?tab=t.0#heading=h.f00wz5x8kwg6).

The TL;DR is that results seem to be... very close. E.g. sometimes we
slightly improve things, but often the numbers seem too close to tell.

But on the other hand this also means that there are no regressions,
and thus no reason not to enable the feature by default.

**Motivation:**

As a recap, without this optimization, the Ruby heap profiler works
by sampling allocated objects, collecting and keeping metadata about
these objects (stack trace, etc). Then, at serialization time
(every 60 seconds), the profiler checks which objects are still alive;
any objects still alive get included in the heap profile; any objects
that have since been garbage collected get their metadata dropped.

The above scheme has a weak-point: some objects are allocated and
almost immediately become garbage collected. Because the profiler
only checks for object liveness at serialization time, this can mean
that in the extreme, an object born and collected at the beginning
of the profiling period can still be tracked for almost 60 seconds
until the profiler finally figures out that the object is no longer
alive.

This has two consequences:
1. The profiler uses more memory, since it’s collecting metadata
   for already-dead objects
2. The profiler has more work to do at the end of the 60-second
    period – it needs to check an entire 60 seconds of sampled objects

The heap profiling clean after GC optimization adds an extra
mechanism that, based on Ruby GC activity, triggers periodic checking
of young objects (e.g. objects that have been alive for few GC
generations). Thus:

a. The profiler identifies and clears garbage objects faster,
   thus overall needing less memory
b. The profiler has less work to do at the end of the 60-second
    period
    ...trading it off with a smaller periodic pass

**Additional Notes:**

I've also removed the separate benchmarking configuration, to
avoid having too many long-running benchmarking variants.

**How to test the change?**

I've updated the specs for the setting, and the optimization itself
has existing test coverage that was added back in #4020.
ivoanjo added a commit that referenced this pull request Nov 26, 2024
… found

**What does this PR do?**

This PR fixes a bug introduced in #4020, specifically in f581076 .

We started using the `rb_obj_info` to print debug information about
objects in some cases, BUT I failed to notice that this API is not
really available on Ruby 2.5 and 3.3 (but is on all others, which
is why it tripped me).

This manifested in the following error reported by a customer:

> WARN -- datadog: [datadog] Profiling was requested but is not
> supported, profiling disabled: There was an error loading the
> profiling native extension due to 'RuntimeError Failure to load
> datadog_profiling_native_extension.3.3.5_x86_64-linux-musl due
> to Error relocating
> /app/vendor/bundle/ruby/3.3.0/gems/datadog-2.6.0/lib/datadog/profiling/../../datadog_profiling_native_extension.3.3.5_x86_64-linux-musl.so:
> rb_obj_info: symbol not found' at
> '/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.6.0/lib/datadog/profiling/load_native_extension.rb:41:in `<main>''

This PR fixes this issue by never referencing `rb_obj_info` on
those Rubies. Since this API is only used for printing information
during errors, this should be fine (and is better than the alternative
of not printing info on any Rubies).

**Motivation:**

Fix profiling not loading in certain situations on Ruby 2.5 and 3.3.

**Additional Notes:**

Interestingly, this issue did not show up on glibc systems. I guess
musl libc is being a bit more eager about trying to resolve symbols?

**How to test the change?**

This change includes test coverage. Disabling the added check
in `extconf.rb` will produce a failing test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Involves performance (e.g. CPU, memory, etc) profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants