-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9476] Reduce heap profiling overhead by using libdatadog's managed string storage #4331
[PROF-9476] Reduce heap profiling overhead by using libdatadog's managed string storage #4331
Conversation
Still reviewing what to do with them
At some point in the past, class name gathering was optional. This is now removed, but we still had this vestigial `optional_class_name` variable + passing a pointer to it, rather than passing the `class_name` as a value. This commit cleans this up and makes all APIs require/assume there's a `class_name` for an allocated object. (This enables a few other cleanups I want to make next.)
… a pointer Having a pointer back to the stack recorder memory holding the string storage seems a bit weird, although this is a small detail.
**What does this PR do?** This PR updates the crasktracker C code to build with the latest libdatadog changes (in main) that will become part of libdatadog 15. **Motivation:** I'm working on a libdatadog branch and had to do this to unblock my work, so I decided I'll create a PR with it so nobody needs to repeat this work. **Additional Notes:** I'm opening this PR as draft as we shouldn't merge this until libdatadog 15 is out. **How to test the change?** Existing test coverage is enough to validate this.
This is shared by both the stack recorder as well as the heap recorder, so there's no reason for it to live in either and should instead live in a more common location. (I also expect usage of this API may spread to more components in the future.)
A lot of these comments were leftovers from when the last thing we did in `_native_new` was the call to `TypedData_Wrap_Struct`. Refactoring work done over time, as well as evolution of the libdatadog APIs meant we could allocate the Ruby object sooner, and thus a lot of these comments became not relevant.
We don't actually need to keep re-interning these strings in the current version of the string storage; I suspect this was a leftover of the time when the string storage automatically cleaned up strings that were not used in a certain generation, but this is no longer the case. In fact, now that libdatadog relies on counting intern/unintern calls to know which strings to keep, calling `intern` forever on these strings would eventually lead to an overflow, and libdatadog would start returning errors on the calls to `intern`. (In turn, on our side, we'd raise the exception and thus stop profiling which is not amazing)
…ly ids now With the move to interned ids in libdatadog, the same stacks are represented by the same ids, which means the same underlying bytes. Thus, rather than having per-item comparisons and whatnot (which we needed because before we had pointers to strings, and strings were not interned/unique), we can now compare and hash the whole stack at once as a binary blob. This is crucially missing tests! I'll come back to add them, we definitely want some test coverage on this key element.
…ide heap_stack In the current state of the heap profiler, a heap_record is 1:1 with a heap_stack so let's consolidate them.
…to heap_record Refactoring complete, now only `heap_record` is left.
This test no longer makes sense, since it was testing that the two different approaches we had to stack hashing were correct and in sync, and all this was removed now that we use libdatadog's managed string storage.
Since these stacks go through a very different code path, these tests validate that the round trip goes fine.
**What does this PR do?** This PR includes the changes documented in the "Releasing a new version to rubygems.org" part of the README: https://github.com/datadog/libdatadog/tree/main/ruby#releasing-a-new-version-to-rubygemsorg **Motivation:** Enable Ruby to use libdatadog v16.0.1. Of particular interest, this includes improvements to crashtracking and the managed string table needed by DataDog/dd-trace-rb#4331 . **Additional Notes:** N/A **How to test the change?** I've tested this release locally using the changes in DataDog/dd-trace-rb#4353 . As a reminder, new libdatadog releases don't get automatically picked up by dd-trace-rb, so the PR that bumps the Ruby profiler will also test this release against all supported Ruby versions.
**What does this PR do?** This PR includes the changes documented in the "Releasing a new version to rubygems.org" part of the README: https://github.com/datadog/libdatadog/tree/main/ruby#releasing-a-new-version-to-rubygemsorg **Motivation:** Enable Ruby to use libdatadog v16.0.1. Of particular interest, this includes improvements to crashtracking and the managed string table needed by DataDog/dd-trace-rb#4331 . **Additional Notes:** N/A **How to test the change?** I've tested this release locally using the changes in DataDog/dd-trace-rb#4353 . As a reminder, new libdatadog releases don't get automatically picked up by dd-trace-rb, so the PR that bumps the Ruby profiler will also test this release against all supported Ruby versions.
**What does this PR do?** This PR upgrades the datadog gem to use libdatadog 16.0.1. It includes a few changes to match breaking API updates in crashtracking. **Motivation:** Libdatadog 16 is needed to unblock #4331 . This version also brings a few crashtracking improvements. **Change log entry** Yes. Upgrade libdatadog dependency to 16.0.1 **Additional Notes:** As usual, I'm opening this PR as a draft as libdatadog 16.0.1 is not yet available on rubygems.org, and I'll come back to re-trigger CI and mark this as non-draft once it is. **How to test the change?** Our existing test coverage includes libdatadog testing, so a green C is good here :)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4331 +/- ##
==========================================
+ Coverage 97.75% 97.77% +0.01%
==========================================
Files 1351 1362 +11
Lines 82684 84824 +2140
Branches 4197 4411 +214
==========================================
+ Hits 80827 82933 +2106
- Misses 1857 1891 +34 ☔ View full report in Codecov by Sentry. |
Rather than comparing intern with intern_all (less useful now that DataDog/libdatadog#844 is merged), let's keep this around as a more generic measurement of managed string storage performance.
👍 Updated PR description with results! |
**What does this PR do?** This reverts commit 569bf56. **Motivation:** We temporarily disabled this benchmark in #4331 for the reasons explained in the comment. Once that PR is in master we can re-enable the benchmark. **Additional Notes:** This sharp edge of our benchmarks setup is annoying, but I'll fight it another day... **How to test the change?** Validate this benchmark is now running in the benchmarking platform results page.
BenchmarksBenchmark execution time: 2025-02-18 11:32:43 Comparing candidate commit de97855 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 33 metrics, 2 unstable metrics. |
Given our current benchmark results, we've decided to use some of the headroom we gained with the managed string storage branch to improve heap sampling results. Thus, we'll take a heap sample for every object that the allocation profiler samples, rather than every 10th object. We still expose this control -- via both code and environment variable, so it can be reset back to 10 on a case-by-case basis. BUT if you're considering raising this value, please let us know about it -- we'd love to know your use-case.
Update: As discussed, I've pushed a commit to set |
What does this PR do?
This PR changes the heap profiler to use the new libdatadog managed string storage API added in DataDog/libdatadog#725 . (And thanks to Alex for starting this work in #3628 )
Prior to this PR, the heap profiler was doing all of the work of keeping stack traces around for the objects being tracked. This involved keeping copies of all strings for path / function names, and also having a clever deduplication mechanism that allowed us to dedup stack traces without first copying all of the strings.
The libdatadog managed string storage allows us to simplify this work by giving us a string table we can use to store strings and trade them for long-lived ids. Then these ids can be used then adding samples to libdatadog, instead of direct pointers to the strings.
This has a bunch of advantages:
We were not previously deduplicating strings in the heap profiler, only whole stacks.
Thus if we had
/path/a.rb:method_a
->/path/b.rb:method_b
->/path/c.rb:method_c
multiple times, we would deduplicate them. But if we also had/path/a.rb:method_a
->/path/b.rb:method_b
, then we'd store yet another copy of these strings. This is no longer the case when using the new API.Representing stacks as arrays of ids means deduplication, hashing and comparison is much faster -- we're only comparing ids and not the contents of strings.
When feeding stacks again to libdatadog, it's cheaper to use ids since libdatadog doesn't need to keep re-hashing the strings to de-duplicate them; it can use the ids to do that.
This does mean that this PR is a bit... on the noisy side in terms of changes to the heap profiler. Specifically, it takes advantage of most of the simplification opportunities that managed string stable allows:
heap_record
andheap_stack
structures as they no longer need to exist independently, and then rips out as much code as possible based on that mergeUpdate: As discussed, I've pushed a commit to set
experimental_heap_sample_rate
to 1, raising the heap sampling rate from its previous default of 10.Motivation:
The key objective of this work is allowing us to reduce the memory and cpu overhead of heap profiling.
Earlier versions of this PR already showed reduced memory usage. I decided to open this PR already and do benchmarking in parallel, so I don't yet have updated numbers yet.See below.
Change log entry
Yes. Reduce overhead of heap profiling
Additional Notes:
Currently CI is broken for this PR because it depends on DataDog/libdatadog#844 which has not yet been merged into libdatadog.
Obviously I plan to make sure that PR is in libdatadog, and that master is using the latest libdatadog before merging this PR, but I don't think it makes sense to hold off on review until that's the case.
How to test the change?
The existing test coverage was already quite reasonable. I added a few more tests to make sure that our stack recording/retrieval/deduping was in good shape, as well as a benchmark for the efficiency of the interning operations.
Benchmarks:
I wrote up my results in this doc (Datadog-internal). Here's a quick recap of what I saw.
High load gitlab:
Here, we’re comparing 4 configurations:
Rails app:
Here, we’re comparing 5 configurations: