Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profiles/follow up: location references in sample #4307

Open
florianl opened this issue Mar 6, 2024 · 18 comments
Open

profiles/follow up: location references in sample #4307

florianl opened this issue Mar 6, 2024 · 18 comments
Labels
sig-issue A specific SIG should look into this before discussing at the spec spec:profiling Related to the specification/profiling directory

Comments

@florianl
Copy link
Contributor

florianl commented Mar 6, 2024

This is a follow up for open-telemetry/oteps#239 (comment) around message Sample and its use of location_index, locations_start_index and locations_length:

https://github.com/open-telemetry/oteps/blob/dc619dfc70f174ef31caf90f14e8b00600da4049/text/profiles/0239-profiles-data-model.md?plain=1#L518-L527

As an example, consider the following stack in a folded format:

foo;bar;baz 100
abc;def 200
foo;bar 300
abc;ghi 400
foo;bar;qux 500

Like in most stack traces, the base frames are similar, but there is a variation in the leaf frames. To reflect this, the last two traces use different leaf frames, ghi and qux.

Should the resulting sample look like the following?

sample:
  - locations_start_index: 0
    locations_length: 3
    value:
      - 100
  - locations_start_index: 3
    locations_length: 2
    value:
      - 200
  - locations_start_index: 0
    locations_length: 2
    value:
      - 300
  - locations_start_index: 5
    locations_length: 2
    value:
      - 400
  - locations_start_index: 7
    locations_length: 3
    value:
      - 500
location_indices:
  - 0 # foo
  - 1 # bar
  - 2 # baz
  - 3 # abc
  - 4 # def
  - 3 # abc 
  - 5 # ghi
  - 0 # foo
  - 1 # bar
  - 6 # qux
location:
  - line:
      - function_index: 0 # foo
  - line:
      - function_index: 1 # bar
  - line:
      - function_index: 2 # baz
  - line:
      - function_index: 3 # abc
  - line:
      - function_index: 4 # def
  - line:
      - function_index: 5 # ghi
   - line:
      - function_index: 6 # qux
function:
  - name: 1 # foo
  - name: 2 # bar
  - name: 3 # baz
  - name: 4 # abc
  - name: 5 # def
  - name: 6 # ghi
  - name: 7 # qux

In particular for deep stack traces with a high number of similar frames and where only leaf frames are different, the use of locations_start_index, locations_length with location_indices will get more complex than the (deprecated) location_index which just holds a list of IDs into the location table.

The original pprof message Sample does also not use the _start_index / _length approach. From my understanding all messages of type Sample within the same Profile groups stack traces from the same origin/with the same attributes.
For a different set of attributes, I think, a dedicated Profile should be preferred with its own attributes.

An alternative, to allow sharing Mapping, Location and Function information between stack traces with different attributes would be to move these three tables one layer up into ProfileContainer, so that they can be referenced from each Profile.

While the variety of leaf frames is usually high and attributes are often more static, can we remove the deprecated label from location_index in message Sample and let the user either set location_index or location_start_index with locations_length?

@arminru
Copy link
Member

arminru commented Mar 25, 2024

cc @open-telemetry/profiling-maintainers @open-telemetry/profiling-approvers

@mtwo
Copy link
Member

mtwo commented Mar 25, 2024

Comment from the OTel maintainer meeting: could / should this be moved to a comment on the current Profiling PR in the OTLP repository?

@florianl
Copy link
Contributor Author

This issue is linked in https://github.com/open-telemetry/opentelemetry-proto/pull/534/files#r1561128746. As this particular issue is relevant to the specification, I did open the issue in this repository.

@felixge
Copy link
Member

felixge commented Apr 11, 2024

Thanks for raising this.

In particular for deep stack traces with a high number of similar frames and where only leaf frames are different,

That should be most CPU profiles, right? @petethepig IIRC you had some benchmarks that showed the efficiency of this new encoding of stack traces. Did you use realistic CPU profiling data?

If this new approach is not a clear win in the majority of situations, we should remove it.

@florianl
Copy link
Contributor Author

That should be most CPU profiles, right?

Most of my experiments are with CPU profiles.
For profiles that focus on lock contention or memory allocation, the situation might be slightly different. I can imagine that for such profiles the leaf frame is similar more often. But the described problem should also in this case be the same, if memory allocations or lock acquisition happens in a stack with a high number of frames.

@athre0z
Copy link
Member

athre0z commented Apr 15, 2024

We could simply make both locations_start_index and locations_length repeated fields: this would allow implementations to de-duplicate prefixes and should be even more efficient than just listing all indices all the time.

For example if you had two traces that only vary in the leaf:

trace_foo:
  0) libc_entry_point
  1) main
  2) run_my_app
  3) do_fancy_stuff
  4) do_boring_stuff
  5) strcpy
  6) strlen

trace_bar:
  0) libc_entry_point
  1) main
  2) run_my_app
  3) do_fancy_stuff
  4) do_boring_stuff
  5) strcpy
  6) memcpy

Then you could create locations like so:

locations:
  0) libc_entry_point
  1) main
  2) run_my_app
  3) do_fancy_stuff
  4) do_boring_stuff
  5) strcpy
  6) strlen
  7) memcpy

And then encode the reference like this:

trace_foo:
  locations_start_index: [0]
  locations_length: [7]

trace_bar:
  locations_start_index: [0, 7]
  locations_length: [6, 1]

@felixge
Copy link
Member

felixge commented Apr 22, 2024

@athre0z interesting idea! Do you have an algorithm in mind for encoding the data in this way?

A bit of a meta comment: I think it's difficult to evaluate different stack trace encoding schemas without some alignment on how we value encoding vs decoding efficiency, compression, as well as overall complexity. Additionally I suspect that we're reinventing well-known tree encoding formats here (the above looks trie-ish?), and that there is a lot more prior art that we could explore.

@athre0z
Copy link
Member

athre0z commented Apr 22, 2024

Yeah, this is definitely tree-ish: we're essentially trying to encode a flamegraph tree efficiently. For optimal density we'd probably want some sort of prefix tree structure. That being said, I'm not sure whether we're willing to pay the compute price of maintaining one in the profiler.

The algorithm that I had in mind for use with the repeated fields falls more into the "simple and hopefully good enough" category: define some chunk size, split traces by that size and then keep a hash-LRU of chunks that we've seen previously. Should provide a good amount of dedup at very little compute / memory overhead. Implementations that wish to roll something fancier can do that as well.

@felixge
Copy link
Member

felixge commented Apr 22, 2024

Your algorithm sounds like it could work nicely. That being said, I see two paths forward:

  1. Try out new ideas like yours and make sure we get the evaluation right this time (it seems like we didn't the first time around).
  2. Go back to pprof's encoding, which is not ideal, but is simpler to encode/decode and keeps us more compatible with pprof.

What do you think?

@athre0z
Copy link
Member

athre0z commented Apr 26, 2024

I don't really have a strong opinion on this. Intuitively I'd guess that this location list may make up a significant portion of message size, but these things tend to be hard to guess. Makes me wish for a protobuf message size profiler that attributes size consumed to message fields. Bonus points if it could also do it for compressed message size!

Whether "keeping more compatible with pprof" is a priority, IMHO, depends on whether Google decides to donate the format to OTel or not. If pprof keeps evolving independently, then we'll find ourselves in a C/C++ kind of situation where newer C versions gained features that C++ doesn't have, and it'll just be pure pain to somehow keep things compatible. In that case I'd prefer to intentionally break compatibility to avoid the misconception of interoperability without transpiling.

@aalexand
Copy link
Member

Additionally I suspect that we're reinventing well-known tree encoding formats here (the above looks trie-ish?), and that there is a lot more prior art that we could explore.

One approach I used a couple of times for encoding a tree is with two arrays where one array represents index into the parent node and another array contains the reference to the actual payload. In case of pprof profile this would look something like

message Profile {
  ...
  // ID of the location corresponding to this calltree node.
  // Note that len(calltree_location_ids) == len(calltree_parent_indices).
  repeated uint64 calltree_location_ids = <id>;
  // Index of the parent calltree node or -1 if root.
  repeated int64 calltree_parent_indices = <id+1>;
}
...
message Sample {
  // Index of the calltree node as defined by
  // Profile.{calltree_location_ids,calltree_parent_indices} or -1 if no call stack.
  int64 calltree_index = <id>;
}

Note that the structure-of-arrays approach is used here to minimize allocations at protobuf serialization / deserialization level. Alternative array-of-structures approach would contain a single array of a special Calltree message type in the Profile message, but that would mean allocating a fair amount of small objects which is what the proposed approach tries to avoid.

One additional nice thing about this representation is that many data processing algorithms for profiling data operate on call trees and storing the data in this encoding even in memory is convenient enough. For example, something like focus filter in pprof which filters the flame graph to stacks where any frame matches a specified regexp can be implemented as an efficient two-pass algorithm where first proper calltree nodes are selected and then tree is traversed to drop nodes that are filtered out at the first pass or are a descendent of such node.

Just as a thought.

@rockdaboot
Copy link

Your algorithm sounds like it could work nicely. That being said, I see two paths forward:

1. Try out new ideas like yours and make sure we get the evaluation right this time (it seems like we didn't the first time around).

2. Go back to pprof's encoding, which is not ideal, but is simpler to encode/decode and keeps us more compatible with pprof.

What do you think?

As the protocol is mostly used for transport (and possibly storing), can we assume that compression like gzip or zstd is applied on top? Evaluation results should also contain message size with compression applied to see whether complexity of in-protocol optimization is worth it or not. Just saying because additional compression hasn't been mentioned explicitly in the comments (but maybe this is too obvious 😅).

@felixge
Copy link
Member

felixge commented Aug 7, 2024

@aalexand thanks, this definitely sounds interesting and worth exploring 👍 .

@rockdaboot I haven't heard anybody argue against applying gzip. OTLP receiver MUST support it according to the spec. For profiling clients we can, and I think SHOULD, encourage it.

So yeah, we should consider compression in our measurements. While doing this we should also agree on some measure of utility that takes into consideration throughput and compression ratio. For some prior art see this research from my colleagues @jbachorik and @thegreystone. (They evaluated only compression algorithms, but I think we could use a similar approach to think about the utility of in-protocol optimizations).

@rockdaboot
Copy link

rockdaboot commented Aug 8, 2024

@felixge I started creating a list of requirements to compare changes in the protocol including different compression types. We can use the ebpf profiler for recording and replaying traces in order to get somewhat reproducible results (hard when GC kicks in and/or exact timing plays a role). But I don't want to "capture" this issue, I'll open an issue against the ebpf profiler in a while.

@felixge
Copy link
Member

felixge commented Aug 12, 2024

@rockdaboot: @petethepig had a repo with benchmarks at some point that used a collection of pprof recordings to evaluate different protocol ideas. While I'm not sure how representative the data was, I generally like the idea of building a collection of profiling data files that we can use for comparison purposes. I can probably get some real data from our prod systems and anonymize if as well.

@rockdaboot
Copy link

@felixge @petethepig The thoughts collected at open-telemetry/opentelemetry-ebpf-profiler#110 and the existing work should well combine. Maybe we can use the issue to discuss, collect further thoughts and track progress.

@athre0z
Copy link
Member

athre0z commented Aug 12, 2024

As the protocol is mostly used for transport (and possibly storing), can we assume that compression like gzip or zstd is applied on top? Evaluation results should also contain message size with compression applied to see whether complexity of in-protocol optimization is worth it or not. Just saying because additional compression hasn't been mentioned explicitly in the comments (but maybe this is too obvious 😅).

Yeah: we'd expect the impact of such a deduplication schema on the compressed message size to be a lot smaller than on the uncompressed size. However, the uncompressed message size is also important because:

  • the maximum message size accepted by the gRPC server library is checked based on it (default: 4MiB)
  • the memory required to deserialize the message scales with it
  • splitting a message that exceeds the 4MiB into multiple ones requires starting with a fresh string table

@trask trask transferred this issue from open-telemetry/oteps Nov 19, 2024
@trask trask added sig-issue A specific SIG should look into this before discussing at the spec spec:profiling Related to the specification/profiling directory labels Nov 19, 2024
@trask
Copy link
Member

trask commented Nov 19, 2024

(transferred to specification repository as part of #4284)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig-issue A specific SIG should look into this before discussing at the spec spec:profiling Related to the specification/profiling directory
Projects
None yet
Development

No branches or pull requests

8 participants