-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor tail-sampling processor - Decision cache #31583
Comments
Pinging code owners for processor/tailsampling: @jpkrohling. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Drop decisions only need a single bit -- "we dropped this", so there's benefit to splitting the decision cache into a Bloom or Cuckoo filter for drops and a separate cache for keep (attaching metadata to spans like sampling probability may require the keep cache to be more than a filter). Cache size should be configurable. |
@kentquirk can you elaborate a bit more on why we'd need to capture sampling probability per-trace? My understanding is that by entering a trace id into the "sampled" cache we're stating that all spans for this trace should be sampled. |
@kentquirk, do you want to work on this one? I can assign this to you. I'd like to work on converting the metrics to use OTel, which could be helpful in adding observability to the cache. |
@djluck You're correct that the probability applies to all spans in a trace, but sampling probability / adjusted count is of interest when analyzing the results. When I see a trace I want to know "how many traces like this occurred?" For a simple example, if I normally sample at 10% but keep all errors, I need to be able to multiply non-error traces by 10 in order to calculate the appropriate fraction of errors. To do that, I need to know what the sampling probability was. That data needs to be attached to the outgoing spans. @jpkrohling You can assign it to me. It might end up getting split into a couple of pieces. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Hi @kentquirk @jpkrohling this is certainly something I think this processor could use, and would help with early sampling decisions and preventing many spans from being stored in memory for longer than needed. The discussions around bloom/cuckoo filter are definitely on the right track. Looking through Refinery's code, it uses a dual cuckoo-filter mechanism which seems to be quite memory efficient and relatively low rate of false-positives. Another possible implementation is using an LRU cache. To make it twice as memory efficient you could also just store the right half of the trace id as a As my team is in the midst of implementing tail sampling at quite a large scale, I'd be interested in taking a look at this in the coming weeks. |
I would prefer if someone from Honeycomb could implement the cuckoo filter, to prevent any misunderstandings on the code ending up similar to Refinery's, and while I thought LRU would be the optimal one except for the size, your suggestion of using only the right-half of the IDs is wonderful! |
For now, I'll implement the cache as a circular buffer, and we can have alternative implementations right after that. |
**Description:** Adds simple LRU decision cache for sampled trace IDs. The design makes it easy to add another cache for non-sampled IDs. It does not save any other information other than the trace ID that is sampled. It only holds the right half of the trace ID (as a uint64) in the cache. By default the cache remains no-op. Only when the user configures the cache size will the cache become active. **Link to tracking Issue:** #31583 **Testing:** * unit tests on new code * test in `processor_decision_test.go` to test that a trace that was sampled, cached, but the span data was dropped persists a "keep" decision. **Documentation:** Added description to README
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
…6040) Re-opening #33722 so it can be finished. I've addressed @jpkrohling's open comments from that PR. Description: Adds a decision cache for non sampled IDs. This is off-by-default. It's the same as the sampled IDs cache but is separate and has a separate size. **Link to tracking Issue: #31583 Testing: Unit test added. Cache type itself already tested. Documentation: Updated the readme with config options and descriptions. --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
I believe this has been implemented by @jamesmoessis. |
…en-telemetry#36040) Re-opening open-telemetry#33722 so it can be finished. I've addressed @jpkrohling's open comments from that PR. Description: Adds a decision cache for non sampled IDs. This is off-by-default. It's the same as the sampled IDs cache but is separate and has a separate size. **Link to tracking Issue: open-telemetry#31583 Testing: Unit test added. Cache type itself already tested. Documentation: Updated the readme with config options and descriptions. --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
…en-telemetry#36040) Re-opening open-telemetry#33722 so it can be finished. I've addressed @jpkrohling's open comments from that PR. Description: Adds a decision cache for non sampled IDs. This is off-by-default. It's the same as the sampled IDs cache but is separate and has a separate size. **Link to tracking Issue: open-telemetry#31583 Testing: Unit test added. Cache type itself already tested. Documentation: Updated the readme with config options and descriptions. --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
…en-telemetry#36040) Re-opening open-telemetry#33722 so it can be finished. I've addressed @jpkrohling's open comments from that PR. Description: Adds a decision cache for non sampled IDs. This is off-by-default. It's the same as the sampled IDs cache but is separate and has a separate size. **Link to tracking Issue: open-telemetry#31583 Testing: Unit test added. Cache type itself already tested. Documentation: Updated the readme with config options and descriptions. --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
…en-telemetry#36040) Re-opening open-telemetry#33722 so it can be finished. I've addressed @jpkrohling's open comments from that PR. Description: Adds a decision cache for non sampled IDs. This is off-by-default. It's the same as the sampled IDs cache but is separate and has a separate size. **Link to tracking Issue: open-telemetry#31583 Testing: Unit test added. Cache type itself already tested. Documentation: Updated the readme with config options and descriptions. --------- Co-authored-by: Juraci Paixão Kröhling <juraci@kroehling.de>
The decision cache can be a regular cached, where the key is the trace ID and the value is the decision. We can use a ring buffer to limit the number of entries in this cache, but this can be an implementation detail. Whenever a new span arrives, we'd run it through the cache and immediately release/drop based on a previous decision.
The text was updated successfully, but these errors were encountered: