[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020

ivoanjo · 2024-10-23T10:31:36Z

Change log entry

Add setting to lower heap profiling memory use/latency by cleaning up young objects after Ruby GC

What does this PR do?

This PR adds a new mechanism to lower heap profiling overhead (memory and latency): the ability to clean young objects being tracked by the heap profiler after a Ruby GC cycle runs, rather than waiting for serialization time.

This mechanism is currently off by default and controlled by the setting c.profiling.advanced.heap_clean_after_gc_enabled / DD_PROFILING_HEAP_CLEAN_AFTER_GC_ENABLED

We plan to run a bit more validation before enabling it by default.

Motivation:

By doing these cleanups, we lower memory usage (we don't need to wait until until next serialization to notice an object was already collected), and latency, because there's less work to be done at serialization time.

Additional Notes:

This PR started from #3906; the main change from that earlier PR is that the heap recorder clean after GC behavior has been changed to be a best-effort mechanism (e.g. it's optional and not required for correctness).

How to test the change?

This change includes test coverage. I've also added a new benchmarking configuration to evaluate this new setting using our usual harness.

[PROF-9470] Remove unneded include [PROF-9470] Cleanup after GC, major/minor based

To simplify the heap cleanup after GC PR, I'm changing the mechanism to become best-effort, e.g. something that exists to reduce memory and (hopefully) serialization latency, but that is not required for correctness. As the mechanism becomes best-effort, we no longer need to track that we missed an update.

This commit changes the "heap cleanup after GC" to be best-effort: * The stack recorder always triggers a full heap update before serialization (thus I removed the logic we were using to make a decision here) * There's a new `heap_recorder_update_young_objects` method that is used to trigger the best-effort pass on young objects. I realize we're at a weird mid-point: by making the mechanism best-effort, it also means that if it's not working, we won't notice it. I plan to add tests to catch this once I finish refactoring this functionality.

This was broken at some point during our experiments (it's always `false`) and is no longer needed now that the heap cleanup after GC is best-effort.

… comes from outside This allows us to simplify some of our logic and metrics, since we're no longer deciding wether to do a full pass or not in this method.

This method becomes an internal detail, that either gets used via * `heap_recorder_update_young_objects` => best-effort cleanup of young objects * `heap_recorder_prepare_iteration` => full pass before serialization

Now that we know there's going to be a full update before serialization, let's skip doing object size measurements during young object updates. Doing this speeds up young object updates (less work), and potentially doesn't particularly slow down full updates because we redo object size measurements anyway.

When the heap recorder was tracking internal VM objects, because such objects have no klass, they cannot be inspected using regular APIs. Trying to do that results in a segfault as Ruby does not check before trying to access that info. To work around the issue, make sure to use the VM special object printing API for debugging (`rb_obj_info`) which handles these cases fine. (Note that this issue was only for tests, because we only attempt to print object information in this situation.)

Because heap cleanup after GC is a best-effort mechanism, we need to have some kind of test that ensures it runs, otherwise we could accidentally disable or break it and never realize.

This ensures we have coverage for this behavior, as otherwise our age filtering was incorrect we wouldn't be able to spot it.

This mirrors what we've done for the `Collectors::ThreadContext`: by introducing a `.for_testing` helper with a bunch of defaults, it's easier to add new arguments to the class without having to fix a bunch of tests to pass in the new extra argument; while at the same time forcing production code to still pass in that argument.

We've done this conversion for other classes, and it seems a way nicer way of allowing new arguments to be added without massive boilerplate.

This new setting, off by default, will allow us to easily test the feature and decide on a rollout strategy.

We're no longer regularly looking at this data, so let's trim down a bit the number of variants we're using.

This will allow us to validate the expected performance improvements provided by this feature.

codecov-commenter · 2024-10-23T10:43:10Z

Codecov Report

Attention: Patch coverage is 97.82609% with 3 lines in your changes missing coverage. Please review.

Project coverage is 97.85%. Comparing base (38f6cc1) to head (dd653ce).
Report is 473 commits behind head on master.

Files with missing lines	Patch %	Lines
...filing/collectors/cpu_and_wall_time_worker_spec.rb	92.85%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4020      +/-   ##
==========================================
- Coverage   97.86%   97.85%   -0.01%     
==========================================
  Files        1319     1319              
  Lines       79144    79271     +127     
  Branches     3927     3929       +2     
==========================================
+ Hits        77451    77572     +121     
- Misses       1693     1699       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexJF

LGTM! Nice test suite 👍

AlexJF · 2024-10-23T10:37:01Z

ext/datadog_profiling_native_extension/heap_recorder.c

+  // TODO: Discuss with Alex -- should we separate these out between young objects only and full updates?
+  // Tracking them together in this way seems to be muddying the waters -- young object updates will have more objects
+  // skipped, and different mixes of alive/dead


Yes, good point.

Split the stats in dd653ce

AlexJF · 2024-10-23T10:42:35Z

ext/datadog_profiling_native_extension/stack_recorder.c

@@ -330,6 +327,8 @@ static VALUE _native_new(VALUE klass) {
  // Note: Any exceptions raised from this note until the TypedData_Wrap_Struct call will lead to the state memory
  // being leaked.

+  state->heap_clean_after_gc_enabled = true;


Given the current status of this is default false, why are we setting to true here? I know it'll get overriden by _native_initiatlize but seems unnecessary?

Yeah, you're right, in hindsight this may be a bit misleading. Changed in 6494eee .

AlexJF · 2024-10-23T10:53:19Z

spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb

+          # Let's replace the test_object reference with another object, so that the original one can be GC'd
+          test_object = Object.new # rubocop:disable Lint/UselessAssignment
+
+          # Force an update to happen on the next GC
+          Datadog::Profiling::StackRecorder::Testing._native_heap_recorder_reset_last_update(recorder)
+
+          GC.start
+
+          test_object_id


From previous flakes we may need to redo this a bunch of times just in case we have a ref in a register somewhere keeping it alive?

Yeah, I wondered about that as well.

In this case I'm explicitly allocating a new object + changing the local variable + and this is in a separate method which was somewhat deliberate as "a bunch of random actions to hopefully flush any hidden/leftover references".

I guess we'll see if we get lucky or not?

AlexJF · 2024-10-23T10:56:11Z

spec/datadog/profiling/stack_recorder_spec.rb

+            # Let's replace the test_object reference with another object, so that the original one can be GC'd
+            test_object = Object.new # rubocop:disable Lint/UselessAssignment
+            GC.start


Similar flake potential as above?

(See my note on the other one 😄 )

ivoanjo · 2024-10-23T10:58:56Z

Note: The dd-gitlab/benchmarks step is failing because I've refactored the benchmarks to use StackRecorder.for_testing and thus running the benchmarks from this PR against the master branch does not work.

Once this PR is merged, the new for_testing method will also be available in the master branch so the benchmarks will start working again.

…actual default

This avoids "muddying" the waters for our stats, as young object updates will have more objects skipped, and different mixes of alive/dead.

AlexJF

LEBTBTM (Looks Even Better Than Before To Me)

…ault **What does this PR do?** This PR changes the optimization added in #4020 to be enabled by default. I've collected a fresh set of benchmarking results for this feature in [this google doc](https://docs.google.com/document/d/143jmyzB7rMJ9W2hKN0JoDbjo2m3oCVCzvPToHVjLRAM/edit?tab=t.0#heading=h.f00wz5x8kwg6). The TL;DR is that results seem to be... very close. E.g. sometimes we slightly improve things, but often the numbers seem too close to tell. But on the other hand this also means that there are no regressions, and thus no reason not to enable the feature by default. **Motivation:** As a recap, without this optimization, the Ruby heap profiler works by sampling allocated objects, collecting and keeping metadata about these objects (stack trace, etc). Then, at serialization time (every 60 seconds), the profiler checks which objects are still alive; any objects still alive get included in the heap profile; any objects that have since been garbage collected get their metadata dropped. The above scheme has a weak-point: some objects are allocated and almost immediately become garbage collected. Because the profiler only checks for object liveness at serialization time, this can mean that in the extreme, an object born and collected at the beginning of the profiling period can still be tracked for almost 60 seconds until the profiler finally figures out that the object is no longer alive. This has two consequences: 1. The profiler uses more memory, since it’s collecting metadata for already-dead objects 2. The profiler has more work to do at the end of the 60-second period – it needs to check an entire 60 seconds of sampled objects The heap profiling clean after GC optimization adds an extra mechanism that, based on Ruby GC activity, triggers periodic checking of young objects (e.g. objects that have been alive for few GC generations). Thus: a. The profiler identifies and clears garbage objects faster, thus overall needing less memory b. The profiler has less work to do at the end of the 60-second period ...trading it off with a smaller periodic pass **Additional Notes:** I've also removed the separate benchmarking configuration, to avoid having too many long-running benchmarking variants. **How to test the change?** I've updated the specs for the setting, and the optimization itself has existing test coverage that was added back in #4020.

… found **What does this PR do?** This PR fixes a bug introduced in #4020, specifically in f581076 . We started using the `rb_obj_info` to print debug information about objects in some cases, BUT I failed to notice that this API is not really available on Ruby 2.5 and 3.3 (but is on all others, which is why it tripped me). This manifested in the following error reported by a customer: > WARN -- datadog: [datadog] Profiling was requested but is not > supported, profiling disabled: There was an error loading the > profiling native extension due to 'RuntimeError Failure to load > datadog_profiling_native_extension.3.3.5_x86_64-linux-musl due > to Error relocating > /app/vendor/bundle/ruby/3.3.0/gems/datadog-2.6.0/lib/datadog/profiling/../../datadog_profiling_native_extension.3.3.5_x86_64-linux-musl.so: > rb_obj_info: symbol not found' at > '/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.6.0/lib/datadog/profiling/load_native_extension.rb:41:in `<main>'' This PR fixes this issue by never referencing `rb_obj_info` on those Rubies. Since this API is only used for printing information during errors, this should be fine (and is better than the alternative of not printing info on any Rubies). **Motivation:** Fix profiling not loading in certain situations on Ruby 2.5 and 3.3. **Additional Notes:** Interestingly, this issue did not show up on glibc systems. I guess musl libc is being a bit more eager about trying to resolve symbols? **How to test the change?** This change includes test coverage. Disabling the added check in `extconf.rb` will produce a failing test.

AlexJF and others added 30 commits October 23, 2024 11:19

[PROF-9470] Align heap recorder cleanup with GC activity

8a38d67

[PROF-9470] Remove unneded include [PROF-9470] Cleanup after GC, major/minor based

[PROF-9470] Continue skipping updates of age == 0

c5d6bcf

[PROF-9470] Fixing issues

a789481

[PROF-9470] Improvements

1d44add

Fix inline position

4226d26

More improvements

639b856

More debugging to help with failing test troubleshooting

d87b7ed

Remove inspection since it turns out not to be that safe

f0df1a0

Ensure heap recorder is in a deterministic state on spec

5efbab5

Re-add commented-out inspection

0ef68dd

Minor: Remove unused import

abf7166

Minor: Tweak text in comment

67ae0c8

Minor: Remove unneeded #include

b2cc184

Remove last_update_major_gc_since as it was not working

e90ee7d

This was broken at some point during our experiments (it's always `false`) and is no longer needed now that the heap cleanup after GC is best-effort.

Simplify heap_recorder_update logic now that full_update decision…

585d6e4

… comes from outside This allows us to simplify some of our logic and metrics, since we're no longer deciding wether to do a full pass or not in this method.

Hide heap_recorder_update from heap recorder API

6f650d6

This method becomes an internal detail, that either gets used via * `heap_recorder_update_young_objects` => best-effort cleanup of young objects * `heap_recorder_prepare_iteration` => full pass before serialization

Minor: Tweak exception error message

ba35aed

Add TODO on metrics to discuss with @AlexJF

adb210d

Add integration-style spec for heap cleanup after GC

395da1b

Because heap cleanup after GC is a best-effort mechanism, we need to have some kind of test that ensures it runs, otherwise we could accidentally disable or break it and never realize.

Add unit test for heap cleanup after GC only testing young objects

0343ba4

This ensures we have coverage for this behavior, as otherwise our age filtering was incorrect we wouldn't be able to spot it.

Convert StackRecorder._native_initialize to use kw args

d5368c5

We've done this conversion for other classes, and it seems a way nicer way of allowing new arguments to be added without massive boilerplate.

Introduce setting to control heap clean after GC

770b1aa

This new setting, off by default, will allow us to easily test the feature and decide on a rollout strategy.

Remove "only-profiling-alloc-ddprof" testing configuration

ab1832a

We're no longer regularly looking at this data, so let's trim down a bit the number of variants we're using.

Add benchmarking variant for heap with clean after GC

ce1fb24

This will allow us to validate the expected performance improvements provided by this feature.

Add test coverage for not updating heap half-way during serialization

8aadb81

Add test coverage for minimum period between heap updates

4c659a5

ivoanjo added performance Involves performance (e.g. CPU, memory, etc) profiling Involves Datadog profiling labels Oct 23, 2024

ivoanjo requested review from a team as code owners October 23, 2024 10:31

ivoanjo requested a review from AlexJF October 23, 2024 10:31

ivoanjo mentioned this pull request Oct 23, 2024

[PROF-9470] Align heap recorder cleanup with GC activity #3906

Closed

AlexJF approved these changes Oct 23, 2024

View reviewed changes

ivoanjo added 2 commits October 23, 2024 13:20

Minor: Align heap_clean_after_gc_enabled native default value with …

6494eee

…actual default

Minor: Split heap recorder stats into young/non-young

dd653ce

This avoids "muddying" the waters for our stats, as young object updates will have more objects skipped, and different mixes of alive/dead.

AlexJF approved these changes Oct 23, 2024

View reviewed changes

ivoanjo merged commit 1a6e255 into master Oct 24, 2024
267 of 269 checks passed

ivoanjo deleted the ivoanjo/prof-9470-heap-cleanup-after-gc branch October 24, 2024 09:12

github-actions bot added this to the 2.5.0 milestone Oct 24, 2024

Strech mentioned this pull request Nov 5, 2024

Bump to version 2.5.0 #4067

Merged

ivoanjo mentioned this pull request Nov 7, 2024

[PROF-9470] Enable "heap clean after GC" profiler optimization by default #4085

Merged

ivoanjo mentioned this pull request Nov 26, 2024

[PROF-10967] Fix profiler not loading due to "rb_obj_info" symbol not found #4161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020

[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020

ivoanjo commented Oct 23, 2024

codecov-commenter commented Oct 23, 2024 •

edited

Loading

AlexJF left a comment

AlexJF Oct 23, 2024

ivoanjo Oct 23, 2024

AlexJF Oct 23, 2024

ivoanjo Oct 23, 2024

AlexJF Oct 23, 2024

ivoanjo Oct 23, 2024

AlexJF Oct 23, 2024

ivoanjo Oct 23, 2024

ivoanjo commented Oct 23, 2024

AlexJF left a comment

[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020

[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020

Conversation

ivoanjo commented Oct 23, 2024

codecov-commenter commented Oct 23, 2024 • edited Loading

Codecov Report

AlexJF left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo commented Oct 23, 2024

AlexJF left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 23, 2024 •

edited

Loading