Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9476] Reduce heap profiling overhead by using libdatadog's managed string storage #4331

Merged
merged 39 commits into from
Feb 18, 2025

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jan 30, 2025

What does this PR do?

This PR changes the heap profiler to use the new libdatadog managed string storage API added in DataDog/libdatadog#725 . (And thanks to Alex for starting this work in #3628 )

Prior to this PR, the heap profiler was doing all of the work of keeping stack traces around for the objects being tracked. This involved keeping copies of all strings for path / function names, and also having a clever deduplication mechanism that allowed us to dedup stack traces without first copying all of the strings.

The libdatadog managed string storage allows us to simplify this work by giving us a string table we can use to store strings and trade them for long-lived ids. Then these ids can be used then adding samples to libdatadog, instead of direct pointers to the strings.

This has a bunch of advantages:

  1. We were not previously deduplicating strings in the heap profiler, only whole stacks.

    Thus if we had /path/a.rb:method_a -> /path/b.rb:method_b -> /path/c.rb:method_c multiple times, we would deduplicate them. But if we also had /path/a.rb:method_a -> /path/b.rb:method_b, then we'd store yet another copy of these strings. This is no longer the case when using the new API.

  2. Representing stacks as arrays of ids means deduplication, hashing and comparison is much faster -- we're only comparing ids and not the contents of strings.

  3. When feeding stacks again to libdatadog, it's cheaper to use ids since libdatadog doesn't need to keep re-hashing the strings to de-duplicate them; it can use the ids to do that.

This does mean that this PR is a bit... on the noisy side in terms of changes to the heap profiler. Specifically, it takes advantage of most of the simplification opportunities that managed string stable allows:

  • It merges heap_record and heap_stack structures as they no longer need to exist independently, and then rips out as much code as possible based on that merge
  • It simplifies stack comparison and hashing

Update: As discussed, I've pushed a commit to set experimental_heap_sample_rate to 1, raising the heap sampling rate from its previous default of 10.

Motivation:

The key objective of this work is allowing us to reduce the memory and cpu overhead of heap profiling.

Earlier versions of this PR already showed reduced memory usage. I decided to open this PR already and do benchmarking in parallel, so I don't yet have updated numbers yet.

See below.

Change log entry

Yes. Reduce overhead of heap profiling

Additional Notes:

Currently CI is broken for this PR because it depends on DataDog/libdatadog#844 which has not yet been merged into libdatadog.

Obviously I plan to make sure that PR is in libdatadog, and that master is using the latest libdatadog before merging this PR, but I don't think it makes sense to hold off on review until that's the case.

How to test the change?

The existing test coverage was already quite reasonable. I added a few more tests to make sure that our stack recording/retrieval/deduping was in good shape, as well as a benchmark for the efficiency of the interning operations.


Benchmarks:

I wrote up my results in this doc (Datadog-internal). Here's a quick recap of what I saw.

High load gitlab:

Here, we’re comparing 4 configurations:

  • baseline: Unmodified application, no profiling at all
  • only-profiling-heap: Application running cpu + wall-time + allocation + heap profiling, without managed string storage; using heap sampling defaults
  • only-profiling-heap-mss: Same as above, but using managed string storage branch
  • only-profiling-heap-mss-sr1: Same as above, but with heap profiling sampling 1:1 with allocation profiler

Configuration Min memory RSS (MiB) Max memory RSS (MiB) Avg memory RSS (MiB) AVG virtual memory (GiB)
baseline worker 0 893 908 900 1.4
baseline worker 1 894 911 902 1.4
only-profiling-heap worker 0 959 985 972 1.7
only-profiling-heap worker 1 959 980 970 1.7
only-profiling-heap-mss worker 0 937 964 955 1.6
only-profiling-heap-mss worker 1 934 962 951 1.6
only-profiling-heap-mss-sr1 worker 0 956 980 968 1.6
only-profiling-heap-mss-sr1 worker 1 946 979 965 1.6

image

Rails app:

Here, we’re comparing 5 configurations:

  • baseline: Unmodified application, no profiling at all
  • candidate-heap-profiling: Application running cpu + wall-time + allocation + heap profiling, without managed string storage; using heap sampling defaults
  • candidate-heap-profiler-mss: Same as above, but using managed string storage branch
  • candidate-heap-profiler-mss-sr1: Same as above, but with heap profiling sampling 1:1 with allocation profiler
  • candidate-heap-profiling-sr1: Additional data point: what if we enabled 1:1 sampling the allocation profiler without the managed string storage?

image

image

AlexJF and others added 29 commits January 27, 2025 09:35
Still reviewing what to do with them
At some point in the past, class name gathering was optional. This is
now removed, but we still had this vestigial `optional_class_name`
variable + passing a pointer to it, rather than passing the `class_name`
as a value.

This commit cleans this up and makes all APIs require/assume there's a
`class_name` for an allocated object.

(This enables a few other cleanups I want to make next.)
… a pointer

Having a pointer back to the stack recorder memory holding the string
storage seems a bit weird, although this is a small detail.
**What does this PR do?**

This PR updates the crasktracker C code to build with the latest
libdatadog changes (in main) that will become part of libdatadog 15.

**Motivation:**

I'm working on a libdatadog branch and had to do this to unblock
my work, so I decided I'll create a PR with it so nobody needs to
repeat this work.

**Additional Notes:**

I'm opening this PR as draft as we shouldn't merge this until
libdatadog 15 is out.

**How to test the change?**

Existing test coverage is enough to validate this.
This is shared by both the stack recorder as well as the heap recorder,
so there's no reason for it to live in either and should instead live
in a more common location.

(I also expect usage of this API may spread to more components in
the future.)
A lot of these comments were leftovers from when the last thing we did
in `_native_new` was the call to `TypedData_Wrap_Struct`.

Refactoring work done over time, as well as evolution of the libdatadog
APIs meant we could allocate the Ruby object sooner, and thus a lot
of these comments became not relevant.
We don't actually need to keep re-interning these strings in the
current version of the string storage; I suspect this was a leftover
of the time when the string storage automatically cleaned up strings
that were not used in a certain generation, but this is no longer
the case.

In fact, now that libdatadog relies on counting intern/unintern
calls to know which strings to keep, calling `intern` forever on
these strings would eventually lead to an overflow, and libdatadog
would start returning errors on the calls to `intern`.

(In turn, on our side, we'd raise the exception and thus stop
profiling which is not amazing)
…ly ids now

With the move to interned ids in libdatadog, the same stacks are
represented by the same ids, which means the same underlying bytes.

Thus, rather than having per-item comparisons and whatnot (which we
needed because before we had pointers to strings, and strings were not
interned/unique), we can now compare and hash the whole stack at once
as a binary blob.

This is crucially missing tests! I'll come back to add them, we
definitely want some test coverage on this key element.
…ide heap_stack

In the current state of the heap profiler, a heap_record is 1:1 with
a heap_stack so let's consolidate them.
…to heap_record

Refactoring complete, now only `heap_record` is left.
This test no longer makes sense, since it was testing that the
two different approaches we had to stack hashing were correct and in
sync, and all this was removed now that we use libdatadog's
managed string storage.
Since these stacks go through a very different code path, these tests
validate that the round trip goes fine.
@ivoanjo ivoanjo requested a review from a team as a code owner January 30, 2025 13:46
ivoanjo added a commit to DataDog/libdatadog that referenced this pull request Feb 7, 2025
**What does this PR do?**

This PR includes the changes documented in the "Releasing a new version
to rubygems.org" part of the README:
https://github.com/datadog/libdatadog/tree/main/ruby#releasing-a-new-version-to-rubygemsorg

**Motivation:**

Enable Ruby to use libdatadog v16.0.1. Of particular interest, this
includes improvements to crashtracking and the managed string table
needed by DataDog/dd-trace-rb#4331 .

**Additional Notes:**

N/A

**How to test the change?**

I've tested this release locally using the changes in
DataDog/dd-trace-rb#4353 .

As a reminder, new libdatadog releases don't get automatically picked up
by dd-trace-rb, so the PR that bumps the Ruby profiler will also test
this release against all supported Ruby versions.
ivoanjo added a commit to DataDog/libdatadog that referenced this pull request Feb 10, 2025
**What does this PR do?**

This PR includes the changes documented in the "Releasing a new version
to rubygems.org" part of the README:
https://github.com/datadog/libdatadog/tree/main/ruby#releasing-a-new-version-to-rubygemsorg

**Motivation:**

Enable Ruby to use libdatadog v16.0.1. Of particular interest, this
includes improvements to crashtracking and the managed string table
needed by DataDog/dd-trace-rb#4331 .

**Additional Notes:**

N/A

**How to test the change?**

I've tested this release locally using the changes in
DataDog/dd-trace-rb#4353 .

As a reminder, new libdatadog releases don't get automatically picked up
by dd-trace-rb, so the PR that bumps the Ruby profiler will also test
this release against all supported Ruby versions.
ivoanjo added a commit that referenced this pull request Feb 10, 2025
**What does this PR do?**

This PR upgrades the datadog gem to use libdatadog 16.0.1.

It includes a few changes to match breaking API updates in
crashtracking.

**Motivation:**

Libdatadog 16 is needed to unblock
#4331 .

This version also brings a few crashtracking improvements.

**Change log entry**

Yes. Upgrade libdatadog dependency to 16.0.1

**Additional Notes:**

As usual, I'm opening this PR as a draft as libdatadog 16.0.1 is not
yet available on rubygems.org, and I'll come back to re-trigger CI
and mark this as non-draft once it is.

**How to test the change?**

Our existing test coverage includes libdatadog testing, so a green C
is good here :)
@codecov-commenter
Copy link

codecov-commenter commented Feb 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.77%. Comparing base (6021b3d) to head (de97855).
Report is 89 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4331      +/-   ##
==========================================
+ Coverage   97.75%   97.77%   +0.01%     
==========================================
  Files        1351     1362      +11     
  Lines       82684    84824    +2140     
  Branches     4197     4411     +214     
==========================================
+ Hits        80827    82933    +2106     
- Misses       1857     1891      +34     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Rather than comparing intern with intern_all (less useful now that
DataDog/libdatadog#844 is merged), let's keep
this around as a more generic measurement of managed string storage
performance.
@ivoanjo
Copy link
Member Author

ivoanjo commented Feb 13, 2025

Don't forget to update the changelog once benchmarking results are ready. (e.g. claim both memory and cpu improvements)

👍 Updated PR description with results!

ivoanjo added a commit that referenced this pull request Feb 13, 2025
**What does this PR do?**

This reverts commit 569bf56.

**Motivation:**

We temporarily disabled this benchmark in
#4331
for the reasons explained in the comment.

Once that PR is in master we can re-enable the benchmark.

**Additional Notes:**

This sharp edge of our benchmarks setup is annoying, but I'll fight
it another day...

**How to test the change?**

Validate this benchmark is now running in the benchmarking
platform results page.
@pr-commenter
Copy link

pr-commenter bot commented Feb 13, 2025

Benchmarks

Benchmark execution time: 2025-02-18 11:32:43

Comparing candidate commit de97855 in PR branch ivoanjo/prof-9476-managed-string-storage-try2 with baseline commit 6021b3d in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 33 metrics, 2 unstable metrics.

Given our current benchmark results, we've decided to use some of the
headroom we gained with the managed string storage branch to improve
heap sampling results.

Thus, we'll take a heap sample for every object that the allocation
profiler samples, rather than every 10th object.

We still expose this control -- via both code and environment variable,
so it can be reset back to 10 on a case-by-case basis.

BUT if you're considering raising this value, please let us know about
it -- we'd love to know your use-case.
@github-actions github-actions bot added the core Involves Datadog core libraries label Feb 18, 2025
@ivoanjo
Copy link
Member Author

ivoanjo commented Feb 18, 2025

Update: As discussed, I've pushed a commit to set experimental_heap_sample_rate to 1, raising the heap sampling rate from its previous default of 10.

@ivoanjo ivoanjo merged commit 744c421 into master Feb 18, 2025
524 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-9476-managed-string-storage-try2 branch February 18, 2025 13:22
@github-actions github-actions bot added this to the 2.11.0 milestone Feb 18, 2025
ivoanjo added a commit that referenced this pull request Feb 19, 2025
…ed-string-storage-try2"

This reverts commit 744c421, reversing
changes made to e9adecb.
ivoanjo added a commit that referenced this pull request Feb 19, 2025
…profiling

[PROF-11394] Revert changes to heap profiling in PRs #4401, #4376, #4331
ivoanjo added a commit that referenced this pull request Mar 5, 2025
…aged-string-storage-try2"

This reverts commit ded9fcb,
thus re-applying the changes from #4331.
@ivoanjo ivoanjo removed the core Involves Datadog core libraries label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants