-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decision cache for "keep" trace IDs #33533
Decision cache for "keep" trace IDs #33533
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start, thank you!
@@ -0,0 +1,14 @@ | |||
package decisioncache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two suggestions about the interface:
- perhaps this can be a more generic cache, receiving a key and value types as parameters?
- a delete operation, which might be noop for the lru implementation
It would then look like this:
package cache
type Cache[T comparable, V any] interface {
// Get returns the value for the given key if it exists in the cache.
Get(key T) (V, bool)
// Put sets the value for the given key.
Put(key T, value V)
// Delete deletes the value for the given key.
Delete(key T)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, I think we could use this to evenetually replace the idToTrace map as well, if we see a better performance with the other implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further idea, perhaps for a follow-up PR: make the caches observable. We probably want to know the following metrics:
- cache hits
- cache misses
- cache size
This could either be a type CacheMetrics struct
, make the cache interfaced return those counters/gauge, or make the individual cache implementations accept the TelemetryBuilder instance we create in the new processor function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good ideas, working on them now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the interface suggestions make sense, but after playing around with the code I think it's easier to use just [V any]
as the only generic, since the trace ID will always be the key, and we can make optimisations around that in the implementation layer if it's structured like this (like using an uint64 instead of pcommon.TraceID internally). This would work for replacing idToTrace as well.
I like the idea of the observable caches as well. Would be happy to do a follow up PR for that.
Edit: I've pushed it up
processor/tailsamplingprocessor/internal/decisioncache/lru_cache.go
Outdated
Show resolved
Hide resolved
processor/tailsamplingprocessor/internal/decisioncache/lru_cache_test.go
Outdated
Show resolved
Hide resolved
@@ -45,6 +45,11 @@ The following configuration options can also be modified: | |||
- `decision_wait` (default = 30s): Wait time since the first span of a trace before making a sampling decision | |||
- `num_traces` (default = 50000): Number of traces kept in memory. | |||
- `expected_new_traces_per_sec` (default = 0): Expected number of new traces (helps in allocating data structures) | |||
- `decision_cache` (default = `sampled_cache_size: 0`): Configures amount of trace IDs to be kept in an LRU cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned it before, but why not a decision cache where the decision is the value get back from the cache? We have two states already, sampled and not sampled. The boolean we have already being added to the cache could serve this purpose. Or is the point to allow users to keep a bigger cache for sampled than for not sampled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was to have separate caches for sampled/not sampled. This means you can have different sizes and performance characteristics for each. I thought this comment mentioned it well: #31583 (comment)
For example the "do not keep" cache can be different, like a cuckoo filter. I also probably want to keep the different decisions for different lengths of times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation definitely belongs to the readme!
processor/tailsamplingprocessor/internal/decisioncache/lru_cache.go
Outdated
Show resolved
Hide resolved
1b3463c
to
f3bcac9
Compare
Thanks @jpkrohling for the review and top suggestions. I've addressed the suggestions (some with slight alterations). |
add dep nop cache impl
fix comment typo
fix metric
generation, imports, formatting ok
f3bcac9
to
e19917f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this as is, as I think it's a great feature to have already. The last comments can be addressed in a follow-up PR.
@@ -2,6 +2,8 @@ tail_sampling: | |||
decision_wait: 10s | |||
num_traces: 100 | |||
expected_new_traces_per_sec: 10 | |||
decision_cache: | |||
sampled_cache_size: 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If our recommendation is to keep the number of items an order of magnitude higher, we should make this way bigger than the num traces in this example.
- `decision_cache` (default = `sampled_cache_size: 0`): Configures amount of trace IDs to be kept in an LRU cache, | ||
persisting the "keep" decisions for traces that may have already been released from memory. | ||
By default, the size is 0 and the cache is inactive. | ||
If using, configure this as much higher than `num_traces` so decisions for trace IDs are kept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The beauty of this cache is that it's memory hit is predictable: it's 8 bytes for the uint64 plus one byte for the boolean, per entry (plus the overhead from the cache implementation itself). So, we could find what that overhead is per entry, and add some guidance here. Like:
if you want to allocate 100MB to this cache, you can set this value to 10_000_000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, can add this as a follow up
Description:
Adds simple LRU decision cache for sampled trace IDs.
The design makes it easy to add another cache for non-sampled IDs.
It does not save any other information other than the trace ID that is sampled. It only holds the right half of the trace ID (as a uint64) in the cache.
By default the cache remains no-op. Only when the user configures the cache size will the cache become active.
Link to tracking Issue: #31583
Testing:
processor_decision_test.go
to test that a trace that was sampled, cached, but the span data was dropped persists a "keep" decision.Documentation: Added description to README