-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spanmetricsprocessor logic #2012
Conversation
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Codecov Report
@@ Coverage Diff @@
## master #2012 +/- ##
==========================================
+ Coverage 90.39% 90.45% +0.06%
==========================================
Files 394 394
Lines 19347 19459 +112
==========================================
+ Hits 17489 17602 +113
+ Misses 1397 1396 -1
Partials 461 461
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Signed-off-by: albertteoh <albert.teoh@logz.io>
The "setup" step failed in CI; I suspect it's flaky. Any chance somebody can rerun this please? Or I'm happy to do it myself if given the permissions to do so; currently the rerun button is disabled for me. |
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
|
||
availableMetricsExporters = append(availableMetricsExporters, k.Name()) | ||
|
||
p.logger.Info("Looking for spanmetrics exporter from available exporters", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: log.Debug is likely more suitable here.
p.updateCallMetrics(key) | ||
p.lock.Unlock() | ||
|
||
latency := float64(span.EndTime()-span.StartTime()) / float64(time.Millisecond.Nanoseconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latency := float64(span.EndTime()-span.StartTime()) / float64(time.Millisecond.Nanoseconds()) | |
latencyInMilliseconds := float64(span.EndTime()-span.StartTime()) / float64(time.Millisecond.Nanoseconds()) |
// CacheNode represents a node in the Cache Trie. | ||
type CacheNode struct { | ||
children map[string]*CacheNode | ||
metricKey *string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a pointer and not just a string value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default nil
pointer is used as a sentinel value to indicate if a dimension node in the trie cache contains the cached metricKey
; example use:
Do you have any concerns around this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything prevents to use the empty string as the sentinel value? That will avoid unnecessary indirection, which results in the string slice header being allocated on the heap instead of passed by value, thus increasing the number of heap allocations and garbage to be collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I've learnt something new :). Nope, I think using an empty string would be an acceptable sentinel value too; it wouldn't make sense to have an empty metric key.
As above, this may not even be a concern if we no longer need the cache. :)
func (p *processorImp) buildKey(serviceName string, span pdata.Span) (string, error) { | ||
d := p.getDimensions(serviceName, span) | ||
p.lock.Lock() | ||
lastDimension := p.metricsKeyCache.InsertDimensions(d...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of metricsKeyCache is not clear to me. Is it done to avoid serializing to json on every buildKey call? Is there any evidence this is less expensive? Also why json? Can we serialize trivially by concatenating dimensions and values sequentially? Would that be less expensive than json and most importantly less expensive than the trie-based cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can see benchmarking of the cache was done, which is good. It is not clear why not use a much simpler serialization instead of json. It may be potentially faster than the cache itself and eliminate the need for the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The initial reason for marshalling to JSON is because it's convenient to unmarshal to the map representation again as in this line of code then use this map to build the metric labels with dpLatency.LabelsMap().InitFromMap(kmap)
.
I agree that concatenating dimensions and values sequentially would be a much simpler approach and do away with the complex caching mechanism. I avoided this option initially because I was focusing on the challenges of un/marshalling this concatenated string, but I realized that we could simply use this key to lookup the pre-built map of labels, which is probably what you were thinking of. :)
I'll try it out and benchmark the implementation to understand the performance implication.
Signed-off-by: albertteoh <albert.teoh@logz.io>
@tigrannajaryan I believe I've addressed your comments. Good news also is with your proposed approach, there was a 60% performance improvement based on my benchmarks, compared with the original (before any caching) implementation. It's a 45% performance improvement from my more complicated trie-based cache as well :). Thanks again for the great suggestion! |
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
b3c501f
to
bfee7b8
Compare
@tigrannajaryan please let me know if there are any other issues that I have not addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Merging as is, feel free to address the nits in a future PR.
|
||
func concatDimensionValue(metricKeyBuilder *strings.Builder, value string, prefixSep bool) { | ||
// It's worth noting that from pprof benchmarks, WriteString is the most expensive operation of this processor. | ||
// Specifically, the need to grow the underlying []byte slice to make room for the appended string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we probably can easily have a fairly good idea about the expected size of the final string and can preallocate it. Can be a future improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try stringBuilder.Grow(n)
to initialize the capacity before concatenating strings, but it seemed to increase CPU time. It seems the larger proportion of time taken in Grow
is the creation of a new slice, whereas growing the existing slice seemed to be cheaper; I don't completely understand why tbh.
I've attached screenshots of the flamegraphs in case you were interested.
index := sort.SearchFloat64s(p.latencyBounds, latencyInMilliseconds) | ||
|
||
p.lock.Lock() | ||
key := buildKey(serviceName, span, p.dimensions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this line can be before the lock, I do not see it touching any shared data.
@tigrannajaryan thank you for the thorough review. It has definitely made the quality of this component a lot better, and I've learnt a lot along the way. 🙏 I'll address the nit relating to the lock. As for the string builder, I'll leave that for later as I'm not sure if that'll be a quick change. |
Signed-off-by: albertteoh <albert.teoh@logz.io> **Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012. **Link to tracking Issue:** Fixes #403 **Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.
**Description:** as discussed in the past, we are moving the tail-based sampling processor to the contrib repository. IMPORTANT: there shouldn't be a release of the collector/contrib until the contrib counterpart is merged! **Link to tracking Issue:** Partially addresses #2020. **Testing:** regular test. **Documentation:** Removed processor from readme. Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
**Description:** Adds spanmetricsprocessor logic as a follow-up to #1917. **Link to tracking Issue:** #403 **Testing:** - Unit and integration tested locally (by adding to `cmd/otelcontribcol/components.go`). - Performance tested (added metrics key cache to improve performance) and added benchmark function. **Documentation:** Add relevant documentation to exported structs and functions.
Signed-off-by: albertteoh <albert.teoh@logz.io> **Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012. **Link to tracking Issue:** Fixes #403 **Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.
* Interface stability documentation * Update versioning policy * Update CONTRIBUTING * Document how to extend immutable interfaces * Markdown format VERSIONING changes
Description: Adds spanmetricsprocessor logic as a follow-up to #1917.
Link to tracking Issue: #403
Testing:
cmd/otelcontribcol/components.go
).Documentation: Add relevant documentation to exported structs and functions.