-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9470] Align heap recorder cleanup with GC activity (second try) #4020
Conversation
[PROF-9470] Remove unneded include [PROF-9470] Cleanup after GC, major/minor based
To simplify the heap cleanup after GC PR, I'm changing the mechanism to become best-effort, e.g. something that exists to reduce memory and (hopefully) serialization latency, but that is not required for correctness. As the mechanism becomes best-effort, we no longer need to track that we missed an update.
This commit changes the "heap cleanup after GC" to be best-effort: * The stack recorder always triggers a full heap update before serialization (thus I removed the logic we were using to make a decision here) * There's a new `heap_recorder_update_young_objects` method that is used to trigger the best-effort pass on young objects. I realize we're at a weird mid-point: by making the mechanism best-effort, it also means that if it's not working, we won't notice it. I plan to add tests to catch this once I finish refactoring this functionality.
This was broken at some point during our experiments (it's always `false`) and is no longer needed now that the heap cleanup after GC is best-effort.
… comes from outside This allows us to simplify some of our logic and metrics, since we're no longer deciding wether to do a full pass or not in this method.
This method becomes an internal detail, that either gets used via * `heap_recorder_update_young_objects` => best-effort cleanup of young objects * `heap_recorder_prepare_iteration` => full pass before serialization
Now that we know there's going to be a full update before serialization, let's skip doing object size measurements during young object updates. Doing this speeds up young object updates (less work), and potentially doesn't particularly slow down full updates because we redo object size measurements anyway.
When the heap recorder was tracking internal VM objects, because such objects have no klass, they cannot be inspected using regular APIs. Trying to do that results in a segfault as Ruby does not check before trying to access that info. To work around the issue, make sure to use the VM special object printing API for debugging (`rb_obj_info`) which handles these cases fine. (Note that this issue was only for tests, because we only attempt to print object information in this situation.)
Because heap cleanup after GC is a best-effort mechanism, we need to have some kind of test that ensures it runs, otherwise we could accidentally disable or break it and never realize.
This ensures we have coverage for this behavior, as otherwise our age filtering was incorrect we wouldn't be able to spot it.
This mirrors what we've done for the `Collectors::ThreadContext`: by introducing a `.for_testing` helper with a bunch of defaults, it's easier to add new arguments to the class without having to fix a bunch of tests to pass in the new extra argument; while at the same time forcing production code to still pass in that argument.
We've done this conversion for other classes, and it seems a way nicer way of allowing new arguments to be added without massive boilerplate.
This new setting, off by default, will allow us to easily test the feature and decide on a rollout strategy.
We're no longer regularly looking at this data, so let's trim down a bit the number of variants we're using.
This will allow us to validate the expected performance improvements provided by this feature.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4020 +/- ##
==========================================
- Coverage 97.86% 97.85% -0.01%
==========================================
Files 1319 1319
Lines 79144 79271 +127
Branches 3927 3929 +2
==========================================
+ Hits 77451 77572 +121
- Misses 1693 1699 +6 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice test suite 👍
// TODO: Discuss with Alex -- should we separate these out between young objects only and full updates? | ||
// Tracking them together in this way seems to be muddying the waters -- young object updates will have more objects | ||
// skipped, and different mixes of alive/dead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split the stats in dd653ce
@@ -330,6 +327,8 @@ static VALUE _native_new(VALUE klass) { | |||
// Note: Any exceptions raised from this note until the TypedData_Wrap_Struct call will lead to the state memory | |||
// being leaked. | |||
|
|||
state->heap_clean_after_gc_enabled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the current status of this is default false, why are we setting to true
here? I know it'll get overriden by _native_initiatlize
but seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right, in hindsight this may be a bit misleading. Changed in 6494eee .
# Let's replace the test_object reference with another object, so that the original one can be GC'd | ||
test_object = Object.new # rubocop:disable Lint/UselessAssignment | ||
|
||
# Force an update to happen on the next GC | ||
Datadog::Profiling::StackRecorder::Testing._native_heap_recorder_reset_last_update(recorder) | ||
|
||
GC.start | ||
|
||
test_object_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From previous flakes we may need to redo this a bunch of times just in case we have a ref in a register somewhere keeping it alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wondered about that as well.
In this case I'm explicitly allocating a new object + changing the local variable + and this is in a separate method which was somewhat deliberate as "a bunch of random actions to hopefully flush any hidden/leftover references".
I guess we'll see if we get lucky or not?
# Let's replace the test_object reference with another object, so that the original one can be GC'd | ||
test_object = Object.new # rubocop:disable Lint/UselessAssignment | ||
GC.start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar flake potential as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See my note on the other one 😄 )
Note: The Once this PR is merged, the new |
This avoids "muddying" the waters for our stats, as young object updates will have more objects skipped, and different mixes of alive/dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEBTBTM (Looks Even Better Than Before To Me)
…ault **What does this PR do?** This PR changes the optimization added in #4020 to be enabled by default. I've collected a fresh set of benchmarking results for this feature in [this google doc](https://docs.google.com/document/d/143jmyzB7rMJ9W2hKN0JoDbjo2m3oCVCzvPToHVjLRAM/edit?tab=t.0#heading=h.f00wz5x8kwg6). The TL;DR is that results seem to be... very close. E.g. sometimes we slightly improve things, but often the numbers seem too close to tell. But on the other hand this also means that there are no regressions, and thus no reason not to enable the feature by default. **Motivation:** As a recap, without this optimization, the Ruby heap profiler works by sampling allocated objects, collecting and keeping metadata about these objects (stack trace, etc). Then, at serialization time (every 60 seconds), the profiler checks which objects are still alive; any objects still alive get included in the heap profile; any objects that have since been garbage collected get their metadata dropped. The above scheme has a weak-point: some objects are allocated and almost immediately become garbage collected. Because the profiler only checks for object liveness at serialization time, this can mean that in the extreme, an object born and collected at the beginning of the profiling period can still be tracked for almost 60 seconds until the profiler finally figures out that the object is no longer alive. This has two consequences: 1. The profiler uses more memory, since it’s collecting metadata for already-dead objects 2. The profiler has more work to do at the end of the 60-second period – it needs to check an entire 60 seconds of sampled objects The heap profiling clean after GC optimization adds an extra mechanism that, based on Ruby GC activity, triggers periodic checking of young objects (e.g. objects that have been alive for few GC generations). Thus: a. The profiler identifies and clears garbage objects faster, thus overall needing less memory b. The profiler has less work to do at the end of the 60-second period ...trading it off with a smaller periodic pass **Additional Notes:** I've also removed the separate benchmarking configuration, to avoid having too many long-running benchmarking variants. **How to test the change?** I've updated the specs for the setting, and the optimization itself has existing test coverage that was added back in #4020.
… found **What does this PR do?** This PR fixes a bug introduced in #4020, specifically in f581076 . We started using the `rb_obj_info` to print debug information about objects in some cases, BUT I failed to notice that this API is not really available on Ruby 2.5 and 3.3 (but is on all others, which is why it tripped me). This manifested in the following error reported by a customer: > WARN -- datadog: [datadog] Profiling was requested but is not > supported, profiling disabled: There was an error loading the > profiling native extension due to 'RuntimeError Failure to load > datadog_profiling_native_extension.3.3.5_x86_64-linux-musl due > to Error relocating > /app/vendor/bundle/ruby/3.3.0/gems/datadog-2.6.0/lib/datadog/profiling/../../datadog_profiling_native_extension.3.3.5_x86_64-linux-musl.so: > rb_obj_info: symbol not found' at > '/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.6.0/lib/datadog/profiling/load_native_extension.rb:41:in `<main>'' This PR fixes this issue by never referencing `rb_obj_info` on those Rubies. Since this API is only used for printing information during errors, this should be fine (and is better than the alternative of not printing info on any Rubies). **Motivation:** Fix profiling not loading in certain situations on Ruby 2.5 and 3.3. **Additional Notes:** Interestingly, this issue did not show up on glibc systems. I guess musl libc is being a bit more eager about trying to resolve symbols? **How to test the change?** This change includes test coverage. Disabling the added check in `extconf.rb` will produce a failing test.
Change log entry
Add setting to lower heap profiling memory use/latency by cleaning up young objects after Ruby GC
What does this PR do?
This PR adds a new mechanism to lower heap profiling overhead (memory and latency): the ability to clean young objects being tracked by the heap profiler after a Ruby GC cycle runs, rather than waiting for serialization time.
This mechanism is currently off by default and controlled by the setting
c.profiling.advanced.heap_clean_after_gc_enabled
/DD_PROFILING_HEAP_CLEAN_AFTER_GC_ENABLED
We plan to run a bit more validation before enabling it by default.
Motivation:
By doing these cleanups, we lower memory usage (we don't need to wait until until next serialization to notice an object was already collected), and latency, because there's less work to be done at serialization time.
Additional Notes:
This PR started from #3906; the main change from that earlier PR is that the heap recorder clean after GC behavior has been changed to be a best-effort mechanism (e.g. it's optional and not required for correctness).
How to test the change?
This change includes test coverage. I've also added a new benchmarking configuration to evaluate this new setting using our usual harness.