-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove stale metrics #164
Remove stale metrics #164
Conversation
Tests are broken for now because of changes in Elements fields. Will fix it at the daylight. |
Awesome, thank you for contributing! It will take me a little while to go through the code changes in detail, but I wanted to say that I think the general approach is good. |
This is a great idea. Really hoping to see it implemented soon. I work in an environment where our central infrastructure is in the cloud, but our clients are behind corporate firewalls. We're developing an application-specific plugin that will send metrics to Prometheus - but we keep running into this problem. The pushgateway - and now, statsd - currently "cache" metrics indefinitely. This makes it difficult to determine if a client has gone down - because their metrics are being persisted indefinitely. Here's hoping we'll see a TTL feature added soon! |
One workaround for this specific question and situation is to push a metric
that is the current (unix) time from all clients, then you can see when
that stops updating / gets too far away from the server time().
…On Tue, Nov 27, 2018, 09:56 R0quef0rt ***@***.***> wrote:
This is a great idea. Really hoping to see it implemented soon.
I work in an environment where our central infrastructure is in the cloud,
but our clients are behind corporate firewalls. We're developing an
application-specific plugin that will send metrics to Prometheus - but we
keep running into this problem.
The pushgateway - and now, statsd - currently "cache" metrics
indefinitely. This makes it difficult to determine if a client has gone
down - because their metrics are being persisted indefinitely.
Here's hoping we'll see a TTL feature added soon!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAICBgNOO_rwtYaDB8NI7LUUsT3OKz-sks5uzP4ngaJpZM4YcZUb>
.
|
Tests are good now. The problem was in
|
@matthiasr Any comments? statsd_exporter with this patch now lives on 5 production clusters and works well. |
sorry, I got distracted and then forgot. the title still says WIP? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Only one concern: The exporter allows live-reloading of the configuration, so the label set or TTL of a given metric can change during the lifetime of the exporter. What happens in that case?
Could you please add tests for the added mapping fields, and tests that demonstrate/verify the expiry behaviour?
Please update the documentation (README) as well.
What happens when there are two mappings with different TTLs mapping to the same metric name / label? That is a perfectly valid situation, and we need to at least document it. |
Done
Done
Modified ttl for a metric will not be in use until call to handleEvent with a new value for metric. In theory one would want to expire a stale metric with
|
@matthiasr some comments about tests. Previous algorithm with getting a Metric instance and call a Write works good for simple cases such as test for 1 value or test values in one routine. So I use a Gatherer which is thread safe and to be more close to what a scraper get. Method |
I think for now it's okay to document this and leave it be. If someone really needs instant expiration, they can restart or implement that signalling. |
The FSM finds exactly one mapping for a given statsd metric name. The Prometheus metric name and labels are an output of that. I think what will happen in this case is that whichever mapping last matched will win. Again, I think that's okay as long as we document it. |
Nevermind my last comment, I was confusing myself about what level TTLs are applied at. I don't expect anyone to have multiple mappings to the same Prometheus metric name, labels and label values, so we don't need to cover that case. If you're really mapping multiple event streams to the same metric, all bets are off, but last-event-wins is what you get either way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soooo cloooose …
The go.mod merge conflict is from #171 which fixed the build after Go 1.11.4 was released. Could you rebase once more please?
The README looks almost good, I proposed some edits for language, I would like to strike the mention of Kubernetes (in that case, TTLs are the wrong solution), and I wrote up #164 (comment)
Do you think you can change the tests to not be wall-clock-dependent? That tends to make them flaky in the long run, which I would really like to avoid because it will make future contributor's lives hard.
I'd like to add more detail, but I'm not sure I understand the implications just yet, and they will probably change with #164. Signed-off-by: Matthias Rampke <mr@soundcloud.com>
- use MetricVec family instead of Metric - dynamic label values instead of ConstLabels - use dto.Metric to gain histrogram value in exporter_test - remove hash calculations Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
- ttl is hardcoded — should be in mapping.yaml - works with metrics without labels Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
The branch is rebased. New prometheus client brings an error "Histogram is not Observer". All changes in README are applied, paragraph about k8s is deleted, your suggestion is added. Tests are not depended on time.Sleep:
|
Thank you very much! |
Signed-off-by: Matthias Rampke <mr@soundcloud.com>
This patch implements clearing of metrics with variable labels with per-mapping configurable timeout (ttl). Our monitoring installation is suffered from the behavior desribed in #129. There are metrics from Kubernetes pods with constantly changing values for labels. For example, metric
ingress_nginx_upstream_retries_count
with labelpod
and changing pod name as a value for this label. There is no sense to store metrics with old pod names as they are never repeated. Removing metrics with this stale values will not cause any damage to subsequent aggregations or multiple prometheus servers.The patch is divided by 3 parts to simplify a review:
ttl
todefault
section and tomapping
sections.ttl
is a timeout in seconds for stale metrics. Exporter saves each labels values set and updates last time it receives metric with these labels values. If no metric with the same labels values received forttl
seconds, then statsd_exporter stops reporting metric with this labels values set.Mark this as WIP. It would be great to hear your thoughts on this.