Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove stale metrics #164

Merged
merged 8 commits into from
Dec 20, 2018
Merged

Remove stale metrics #164

merged 8 commits into from
Dec 20, 2018

Conversation

diafour
Copy link
Contributor

@diafour diafour commented Nov 13, 2018

This patch implements clearing of metrics with variable labels with per-mapping configurable timeout (ttl). Our monitoring installation is suffered from the behavior desribed in #129. There are metrics from Kubernetes pods with constantly changing values for labels. For example, metric ingress_nginx_upstream_retries_count with label pod and changing pod name as a value for this label. There is no sense to store metrics with old pod names as they are never repeated. Removing metrics with this stale values will not cause any damage to subsequent aggregations or multiple prometheus servers.

The patch is divided by 3 parts to simplify a review:

  • replace Metrics with Collectors to gain ability of deletion metrics. So Elements in CounterContainer stores CounterVecs instead of Counters.
  • implement labels values storage in Exporter to detect metrics staleness
  • implement configurable timeout (ttl). Add key ttl to default section and to mapping sections.

ttl is a timeout in seconds for stale metrics. Exporter saves each labels values set and updates last time it receives metric with these labels values. If no metric with the same labels values received for ttl seconds, then statsd_exporter stops reporting metric with this labels values set.

Mark this as WIP. It would be great to hear your thoughts on this.

@diafour
Copy link
Contributor Author

diafour commented Nov 13, 2018

Tests are broken for now because of changes in Elements fields. Will fix it at the daylight.

@matthiasr
Copy link
Contributor

Awesome, thank you for contributing! It will take me a little while to go through the code changes in detail, but I wanted to say that I think the general approach is good.

exporter.go Outdated Show resolved Hide resolved
exporter.go Outdated Show resolved Hide resolved
exporter.go Outdated Show resolved Hide resolved
exporter.go Show resolved Hide resolved
exporter.go Outdated Show resolved Hide resolved
@R0quef0rt
Copy link

This is a great idea. Really hoping to see it implemented soon.

I work in an environment where our central infrastructure is in the cloud, but our clients are behind corporate firewalls. We're developing an application-specific plugin that will send metrics to Prometheus - but we keep running into this problem.

The pushgateway - and now, statsd - currently "cache" metrics indefinitely. This makes it difficult to determine if a client has gone down - because their metrics are being persisted indefinitely.

Here's hoping we'll see a TTL feature added soon!

@matthiasr
Copy link
Contributor

matthiasr commented Nov 27, 2018 via email

@diafour
Copy link
Contributor Author

diafour commented Nov 27, 2018

Tests are good now. The problem was in TestHistogramUnits test with mocking of Histogram metric. Resolve it with Write a dto.Metric as in a client_golang https://github.com/prometheus/client_golang/blob/master/prometheus/examples_test.go#L497-L498

hashNameAndLabels is used to create a key for label values in intermediate storage (Exporter.labelValues)

labelNames is got proper name

@diafour
Copy link
Contributor Author

diafour commented Dec 7, 2018

@matthiasr Any comments? statsd_exporter with this patch now lives on 5 production clusters and works well.

@matthiasr
Copy link
Contributor

sorry, I got distracted and then forgot. the title still says WIP?

Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Only one concern: The exporter allows live-reloading of the configuration, so the label set or TTL of a given metric can change during the lifetime of the exporter. What happens in that case?

Could you please add tests for the added mapping fields, and tests that demonstrate/verify the expiry behaviour?

Please update the documentation (README) as well.

exporter.go Outdated Show resolved Hide resolved
exporter_test.go Outdated Show resolved Hide resolved
@matthiasr
Copy link
Contributor

What happens when there are two mappings with different TTLs mapping to the same metric name / label? That is a perfectly valid situation, and we need to at least document it.

@diafour
Copy link
Contributor Author

diafour commented Dec 13, 2018

Could you please add tests for the added mapping fields, and tests that demonstrate/verify the expiry behaviour?

Done

Please update the documentation (README) as well.

Done

The exporter allows live-reloading of the configuration, so the label set or TTL of a given metric can change during the lifetime of the exporter. What happens in that case?

Modified ttl for a metric will not be in use until call to handleEvent with a new value for metric. In theory one would want to expire a stale metric with ttl: 30m without reload by decreasing ttl to 1s — this is not possible now. Should we handle such situation? I think it requires some kind of signalization about config reloads from main to exporter.

What happens when there are two mappings with different TTLs mapping to the same metric name / label? That is a perfectly valid situation, and we need to at least document it.

handleEvent use mapper.GetMapping which use FSM.GetMapping to get a *MetricMapping by metric name and metric type. This mapping defines ttl to use. As I can see labels are not used to find a
mapping. Can you tell what is an algorithm to find the mapping in FSM? May be you have an example of config to run some tests?

@diafour diafour changed the title WIP: Remove stale metrics Remove stale metrics Dec 13, 2018
@diafour
Copy link
Contributor Author

diafour commented Dec 17, 2018

@matthiasr some comments about tests. Previous algorithm with getting a Metric instance and call a Write works good for simple cases such as test for 1 value or test values in one routine. So I use a Gatherer which is thread safe and to be more close to what a scraper get. Method Gather returns many additional metrics but for test purposes it is not important.

@matthiasr
Copy link
Contributor

Should we handle such situation? I think it requires some kind of signalization about config reloads from main to exporter.

I think for now it's okay to document this and leave it be. If someone really needs instant expiration, they can restart or implement that signalling.

@matthiasr
Copy link
Contributor

Can you tell what is an algorithm to find the mapping in FSM

The FSM finds exactly one mapping for a given statsd metric name. The Prometheus metric name and labels are an output of that. I think what will happen in this case is that whichever mapping last matched will win. Again, I think that's okay as long as we document it.

@matthiasr
Copy link
Contributor

Nevermind my last comment, I was confusing myself about what level TTLs are applied at. I don't expect anyone to have multiple mappings to the same Prometheus metric name, labels and label values, so we don't need to cover that case. If you're really mapping multiple event streams to the same metric, all bets are off, but last-event-wins is what you get either way.

Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soooo cloooose …

The go.mod merge conflict is from #171 which fixed the build after Go 1.11.4 was released. Could you rebase once more please?

The README looks almost good, I proposed some edits for language, I would like to strike the mention of Kubernetes (in that case, TTLs are the wrong solution), and I wrote up #164 (comment)

Do you think you can change the tests to not be wall-clock-dependent? That tends to make them flaky in the long run, which I would really like to avoid because it will make future contributor's lives hard.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
exporter.go Outdated Show resolved Hide resolved
exporter_test.go Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
matthiasr pushed a commit that referenced this pull request Dec 17, 2018
I'd like to add more detail, but I'm not sure I understand the
implications just yet, and they will probably change with #164.

Signed-off-by: Matthias Rampke <mr@soundcloud.com>
- use MetricVec family instead of Metric
- dynamic label values instead of ConstLabels
- use dto.Metric to gain histrogram value in exporter_test
- remove hash calculations

Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
- ttl is hardcoded — should be in mapping.yaml
- works with metrics without labels

Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
Signed-off-by: Ivan Mikheykin <ivan.mikheykin@flant.com>
@diafour
Copy link
Contributor Author

diafour commented Dec 19, 2018

The branch is rebased. New prometheus client brings an error "Histogram is not Observer". Get methods for Histogram and Summary now return prometheus.Observer to fix this issue.

All changes in README are applied, paragraph about k8s is deleted, your suggestion is added.

Tests are not depended on time.Sleep:

  • time.Now and time.NewTracker are wrapped within clock package to be overridden in tests
  • time.Sleep was used to wait for handleEvent to finish. Making events channel synchronous works well too.

@matthiasr
Copy link
Contributor

Thank you very much!

@matthiasr matthiasr merged commit 7364c6f into prometheus:master Dec 20, 2018
matthiasr pushed a commit that referenced this pull request Dec 20, 2018
Signed-off-by: Matthias Rampke <mr@soundcloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants