Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spanmetricsprocessor logic #2012

Merged
merged 12 commits into from
Jan 28, 2021

Conversation

albertteoh
Copy link
Contributor

Description: Adds spanmetricsprocessor logic as a follow-up to #1917.

Link to tracking Issue: #403

Testing:

  • Unit and integration tested locally (by adding to cmd/otelcontribcol/components.go).
  • Performance tested (added metrics key cache to improve performance) and added benchmark function.

Documentation: Add relevant documentation to exported structs and functions.

albertteoh added 4 commits January 12, 2021 16:20
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh requested a review from a team January 13, 2021 12:10
@codecov
Copy link

codecov bot commented Jan 13, 2021

Codecov Report

Merging #2012 (bfee7b8) into master (5010cd7) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2012      +/-   ##
==========================================
+ Coverage   90.39%   90.45%   +0.06%     
==========================================
  Files         394      394              
  Lines       19347    19459     +112     
==========================================
+ Hits        17489    17602     +113     
+ Misses       1397     1396       -1     
  Partials      461      461              
Flag Coverage Δ
integration 69.84% <ø> (+0.06%) ⬆️
unit 89.24% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
processor/spanmetricsprocessor/processor.go 100.00% <100.00%> (ø)
processor/groupbytraceprocessor/event.go 96.80% <0.00%> (+0.79%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5010cd7...bfee7b8. Read the comment docs.

Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh
Copy link
Contributor Author

The "setup" step failed in CI; I suspect it's flaky. Any chance somebody can rerun this please? Or I'm happy to do it myself if given the permissions to do so; currently the rerun button is disabled for me.

albertteoh added 3 commits January 14, 2021 14:51
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
processor/spanmetricsprocessor/metricsdimensions/cache.go Outdated Show resolved Hide resolved

availableMetricsExporters = append(availableMetricsExporters, k.Name())

p.logger.Info("Looking for spanmetrics exporter from available exporters",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: log.Debug is likely more suitable here.

p.updateCallMetrics(key)
p.lock.Unlock()

latency := float64(span.EndTime()-span.StartTime()) / float64(time.Millisecond.Nanoseconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
latency := float64(span.EndTime()-span.StartTime()) / float64(time.Millisecond.Nanoseconds())
latencyInMilliseconds := float64(span.EndTime()-span.StartTime()) / float64(time.Millisecond.Nanoseconds())

processor/spanmetricsprocessor/processor.go Show resolved Hide resolved
// CacheNode represents a node in the Cache Trie.
type CacheNode struct {
children map[string]*CacheNode
metricKey *string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a pointer and not just a string value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default nil pointer is used as a sentinel value to indicate if a dimension node in the trie cache contains the cached metricKey; example use:

https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/2012/files#diff-b91e685e0c165315d86cef17341af406f7fc6cff3e7b7a9e67f006f0f4e63764R96

Do you have any concerns around this approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything prevents to use the empty string as the sentinel value? That will avoid unnecessary indirection, which results in the string slice header being allocated on the heap instead of passed by value, thus increasing the number of heap allocations and garbage to be collected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I've learnt something new :). Nope, I think using an empty string would be an acceptable sentinel value too; it wouldn't make sense to have an empty metric key.

As above, this may not even be a concern if we no longer need the cache. :)

processor/spanmetricsprocessor/metricsdimensions/cache.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/metricsdimensions/cache.go Outdated Show resolved Hide resolved
func (p *processorImp) buildKey(serviceName string, span pdata.Span) (string, error) {
d := p.getDimensions(serviceName, span)
p.lock.Lock()
lastDimension := p.metricsKeyCache.InsertDimensions(d...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of metricsKeyCache is not clear to me. Is it done to avoid serializing to json on every buildKey call? Is there any evidence this is less expensive? Also why json? Can we serialize trivially by concatenating dimensions and values sequentially? Would that be less expensive than json and most importantly less expensive than the trie-based cache?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can see benchmarking of the cache was done, which is good. It is not clear why not use a much simpler serialization instead of json. It may be potentially faster than the cache itself and eliminate the need for the cache.

Copy link
Contributor Author

@albertteoh albertteoh Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The initial reason for marshalling to JSON is because it's convenient to unmarshal to the map representation again as in this line of code then use this map to build the metric labels with dpLatency.LabelsMap().InitFromMap(kmap).

I agree that concatenating dimensions and values sequentially would be a much simpler approach and do away with the complex caching mechanism. I avoided this option initially because I was focusing on the challenges of un/marshalling this concatenated string, but I realized that we could simply use this key to lookup the pre-built map of labels, which is probably what you were thinking of. :)

I'll try it out and benchmark the implementation to understand the performance implication.

processor/spanmetricsprocessor/processor.go Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Show resolved Hide resolved
@tigrannajaryan tigrannajaryan self-assigned this Jan 19, 2021
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh
Copy link
Contributor Author

albertteoh commented Jan 22, 2021

@tigrannajaryan I believe I've addressed your comments. Good news also is with your proposed approach, there was a 60% performance improvement based on my benchmarks, compared with the original (before any caching) implementation. It's a 45% performance improvement from my more complicated trie-based cache as well :). Thanks again for the great suggestion!

Signed-off-by: albertteoh <albert.teoh@logz.io>
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
albertteoh added 2 commits January 26, 2021 17:41
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh
Copy link
Contributor Author

@tigrannajaryan please let me know if there are any other issues that I have not addressed.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging as is, feel free to address the nits in a future PR.


func concatDimensionValue(metricKeyBuilder *strings.Builder, value string, prefixSep bool) {
// It's worth noting that from pprof benchmarks, WriteString is the most expensive operation of this processor.
// Specifically, the need to grow the underlying []byte slice to make room for the appended string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we probably can easily have a fairly good idea about the expected size of the final string and can preallocate it. Can be a future improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try stringBuilder.Grow(n) to initialize the capacity before concatenating strings, but it seemed to increase CPU time. It seems the larger proportion of time taken in Grow is the creation of a new slice, whereas growing the existing slice seemed to be cheaper; I don't completely understand why tbh.

I've attached screenshots of the flamegraphs in case you were interested.

Screen Shot 2021-01-29 at 12 15 33 pm
Screen Shot 2021-01-29 at 12 14 57 pm

index := sort.SearchFloat64s(p.latencyBounds, latencyInMilliseconds)

p.lock.Lock()
key := buildKey(serviceName, span, p.dimensions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this line can be before the lock, I do not see it touching any shared data.

@tigrannajaryan tigrannajaryan merged commit 8ca2e9b into open-telemetry:main Jan 28, 2021
@albertteoh albertteoh deleted the atm-part-1 branch January 29, 2021 00:14
@albertteoh
Copy link
Contributor Author

@tigrannajaryan thank you for the thorough review. It has definitely made the quality of this component a lot better, and I've learnt a lot along the way. 🙏

I'll address the nit relating to the lock. As for the string builder, I'll leave that for later as I'm not sure if that'll be a quick change.

tigrannajaryan pushed a commit that referenced this pull request Jan 29, 2021
Signed-off-by: albertteoh <albert.teoh@logz.io>

**Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012.

**Link to tracking Issue:** Fixes #403

**Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.
dyladan referenced this pull request in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
**Description:** as discussed in the past, we are moving the tail-based sampling processor to the contrib repository.

IMPORTANT: there shouldn't be a release of the collector/contrib until the contrib counterpart is merged!

**Link to tracking Issue:** Partially addresses #2020.

**Testing:** regular test.

**Documentation:** Removed processor from readme.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
pmatyjasek-sumo referenced this pull request in pmatyjasek-sumo/opentelemetry-collector-contrib Apr 28, 2021
**Description:** Adds spanmetricsprocessor logic as a follow-up to #1917.

**Link to tracking Issue:** #403

**Testing:**
- Unit and integration tested locally (by adding to `cmd/otelcontribcol/components.go`).
- Performance tested (added metrics key cache to improve performance) and added benchmark function.

**Documentation:** Add relevant documentation to exported structs and functions.
pmatyjasek-sumo referenced this pull request in pmatyjasek-sumo/opentelemetry-collector-contrib Apr 28, 2021
Signed-off-by: albertteoh <albert.teoh@logz.io>

**Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012.

**Link to tracking Issue:** Fixes #403

**Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.
ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022
* Interface stability documentation

* Update versioning policy

* Update CONTRIBUTING

* Document how to extend immutable interfaces

* Markdown format VERSIONING changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants