-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLO - http request external duration and error source #989
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think this will work that's fine by me.
I've never subtracted histogram from another histogram so that's why I'm curios.
"github.com/prometheus/client_golang/prometheus/promauto" | ||
) | ||
|
||
var duration = promauto.NewHistogramVec(prometheus.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since histogram don't you need to specify buckets? Have you verified subtracting histogram from histogram works as you imagine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wasn't he thinking taking the full latency of a trace and subtracting downstream span latencies and then update latency histogram, e.g. it's not prometheus doing the subtraction and you get one metric.
So my question is more, given you have two different latency histogram metrics how easy is it to reliably subtract one from the other and how does that scale/perform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested if it subtracting histograms works reliably, but the way this histogram is set up should replicate what Tempo does with the metrics generator. I think @svennergr mentioned he was doing something similar with histograms in the ElasticSearch datasource as well, so he may have some guidance on whether this will be accurate or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM feel free to evaluate. Suggested some minor changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! I have some minor suggestions but the code LGTM! 🚀
"github.com/prometheus/client_golang/prometheus/promauto" | ||
) | ||
|
||
var duration = promauto.NewHistogramVec(prometheus.HistogramOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested if it subtracting histograms works reliably, but the way this histogram is set up should replicate what Tempo does with the metrics generator. I think @svennergr mentioned he was doing something similar with histograms in the ElasticSearch datasource as well, so he may have some guidance on whether this will be accurate or not
Co-authored-by: Giuseppe Guerra <giuseppe.guerra@grafana.com>
What this PR does / why we need it:
Provides Middleware for Http Clients to capture external duration as a custom metric. Since we already capture total duration, we can subtract these to get plugin duration.
In discussion with @wbrowne, the API Server will expose this new metric/get's scraped into Prometheus, and we can create SLOs based on duration and error source.
Also, capture error source here as a label. Then we don't have to pass error source around.
We will take the same approach in SQLDS. So minimal changes will be required per plugin.
Ref https://github.com/grafana/enterprise-datasources/issues/730
Ref https://github.com/grafana/enterprise-datasources/issues/731