Add default prometheus metrics to clients #652

krav · 2022-03-14T15:55:52Z

This adds Prometheus client metrics equivalent to the existing statsd ones.

Caveat

I've added an identifying name to each of the client classes that produce metrics, defaulted in each case except for the thrift client where it can be inferred from the class. In the case of instantiating two clients of the same class the last to be instantiated would otherwise be the only one to report metrics. However if clients are dynamically created at runtime this can cause labels to disappear and reappear in the prometheus server on restart.

fishy · 2022-03-14T18:58:11Z

baseplate/clients/thrift.py

+    PROM_PREFIX = "bp_thrift_pool"
+    PROM_LABELS = ["client_cls"]
+
+    promTotalConnections = Gauge(
+        f"{PROM_PREFIX}_size",
+        "Number of connections in this thrift pool",
+        PROM_LABELS,
+    )
+
+    promUsedConnections = Gauge(
+        f"{PROM_PREFIX}_in_use",
+        "Number of connections currently in use in this thrift pool",
+        PROM_LABELS,
+    )


🔕 In Baseplate.go the equivalent (more or less, this is not really 100% mapped between go and py code) of these 2 gauges are: thriftbp_client_pool_allocated_clients and thriftbp_client_pool_active_connections with thrift_slug label.

these are not core metrics defined in baseplate spec and do not have to be the same between go and py (as the implementation details differ), but just for your reference as it's likely still beneficial to make them as consistent as possible.

Yes, it would be useful to determine if these metrics are equivalent enough to the go versions. Having them use the same prefix would be nice, as it would make comparing / alerting / dashboards easier.

I did look at the redis implementation and they're unfortunately very different, but none of the others. Thrift is a likely candidate to be similar.

JessicaGreben

It might be nice to add some tests for these prom metrics.

JessicaGreben · 2022-03-17T17:36:24Z

baseplate/clients/memcache/__init__.py

+        pool = self.pooled_client.client_pool
+        self.promTotalConnections.labels(name).set_function(lambda: pool.max_size)
+        self.promFreeConnections.labels(name).set_function(lambda: len(pool.free))
+        self.promUsedConnections.labels(name).set_function(lambda: len(pool.used))


why are the metrics set in the init func instead of in the report_memcache_runtime_metrics method?

do we need the label "pool=memcache" when the metric name is "bp_memcached_pool"?

I think the pool label is there for if a service is connected to multiple memcached services.

For example a service could be connected to pool="stalecache" and pool="thing"

Yes, that’s what I tried to explain in the caveat section above. If you were to add the metric to the registry when creating the class, that is you call ‘.labels’ there, then that will be shared for each instance of the class regardless of what it connects to.

This is seemingly a problem with the statsd metric too.

why are the metrics set in the init func instead of in the report_memcache_runtime_metrics method?

Because that function is called periodically to push statsd data, while the prometheus metrics are made on demand.

for prom "made on demand" meaning a new instance of the class is initialized?

No, when you get /metrics the function in set_function is called to create the number.

i see. yes thats great.

JessicaGreben · 2022-03-17T17:54:06Z

baseplate/clients/memcache/__init__.py

+        self.promTotalConnections.labels(name).set_function(lambda: pool.max_size)
+        self.promFreeConnections.labels(name).set_function(lambda: len(pool.free))
+        self.promUsedConnections.labels(name).set_function(lambda: len(pool.used))
+
    def report_memcache_runtime_metrics(self, batch: metrics.Client) -> None:


i wonder why this memcache method is report_memcache_runtime_metrics but all the other clients the equivalent method is report_runtime_metrics. I wonder why the difference.

Hm, this might be a bug unless it's called outside of the baseplate. It's not referenced anywhere, while those with the latter name are called from

baseplate.py/baseplate/__init__.py

Line 525 in 48bcb0b

elif hasattr(value, "report_runtime_metrics"):

Heh https://github.snooguts.net/reddit/reddit-service-post/blob/b0dbc37074a4299fdb916eaddd6b6cb1fe1fd568/post/__init__.py#L132

lol. i see. thanks for the explanation

JessicaGreben

This PR looks good. Optionally we can add a few tests if you want. Also we can make the thrift metrics more similar to baseplate.go. Thanks for this!

JessicaGreben · 2022-03-22T19:02:32Z

tests/unit/clients/memcache_tests.py

+            )
+        )
+        metric = ctx.promTotalConnections.collect()
+        self.assertEqual(metric[0].samples[0].value, float(max_pool_size))


would it be useful to assert the labels are set correctly?

baseplate/clients/memcache/__init__.py

MelissaCole · 2022-03-30T18:18:50Z

baseplate/clients/memcache/__init__.py

+        PROM_LABELS,
+    )
+
+    def __init__(self, pooled_client: PooledClient, name: str = "memcache"):


🔕 Would None or "" work as a default since the metric name already has "memcache" in it?

Leaving the label empty removes which can make the queries a bit more difficult. We can rename it "default" maybe?

MelissaCole · 2022-03-30T18:27:22Z

baseplate/clients/redis.py

+    PROM_LABELS = ["pool"]
+
+    totalConnections = Gauge(
+        f"{PROM_PREFIX}_connections",


Thoughts on making this (and clustered redis) f"{PROM_PREFIX}_size" for consistency with memcached and sqlalchemy implementations?

I wonder if we shouldn't do it the other way around, since it describes the contents of the pool.

The redis client in baseplate.go seems to have connections_total, connections_idle. Having something end in _total signals that it's a Counter instead of a Gauge, so I'd keep that out - but switching the other two.

MelissaCole · 2022-03-30T18:33:54Z

baseplate/clients/sqlalchemy.py

+    )
+
+    promCheckedOutConnections = Gauge(
+        f"{PROM_PREFIX}_checked_out",


Could this be in_use to be consistent with thrift and memcached and free for checked_in?

We usually say active for connection pools.

Yeah, named it like this to be in line with the library vocabulary, but it does not make intuitive sense to me either. Changed it to active/idle.

MelissaCole

lgtm with some questions about naming conventions and consistency

nsheaps

Few updates that you've already discussed still need to be made. Thanks for tackling this!

baseplate/clients/redis_cluster.py

baseplate/clients/thrift.py

add prometheus metrics to clients

She's out on vacation, I'm taking over the review

krav requested a review from a team as a code owner March 14, 2022 15:55

krav requested review from bradengroom and MelissaCole March 14, 2022 15:55

fishy reviewed Mar 14, 2022

View reviewed changes

JessicaGreben self-requested a review March 15, 2022 18:14

JessicaGreben suggested changes Mar 17, 2022

View reviewed changes

JessicaGreben reviewed Mar 17, 2022

View reviewed changes

JessicaGreben approved these changes Mar 17, 2022

View reviewed changes

JessicaGreben reviewed Mar 22, 2022

View reviewed changes

MelissaCole reviewed Mar 30, 2022

View reviewed changes

baseplate/clients/memcache/__init__.py Outdated Show resolved Hide resolved

MelissaCole reviewed Mar 30, 2022

View reviewed changes

MelissaCole previously requested changes Mar 30, 2022

View reviewed changes

krav requested a review from MelissaCole April 19, 2022 13:28

nsheaps self-requested a review April 21, 2022 18:06

nsheaps suggested changes Apr 21, 2022

View reviewed changes

baseplate/clients/redis_cluster.py Outdated Show resolved Hide resolved

baseplate/clients/thrift.py Outdated Show resolved Hide resolved

baseplate/clients/thrift.py Outdated Show resolved Hide resolved

krav added 6 commits April 22, 2022 13:49

add prom metrics to redis client

b1abd9a

add prometheus metrics to clients

add library prefix

736fa84

add tests for redis, memcached

0752e43

rename a few things

f594785

rename

6153f07

rename

2518a57

krav force-pushed the k/metrics branch from 995d2bf to 2518a57 Compare April 22, 2022 12:10

krav requested a review from nsheaps April 22, 2022 12:29

krav added 5 commits April 22, 2022 20:09

redis max_connections refers to max allowed in pool

47fb50d

use idle/active, unit last

5660143

thrift pool.size is max before blocking

d7b398e

sqlalchemy max connections

9cabcbd

redis max_connections refers to max allowed in pool

ec5d571

krav added 2 commits April 22, 2022 20:21

max_size not just size

6b4eae9

redis max not total

112744e

nsheaps approved these changes Apr 25, 2022

View reviewed changes

update tests

429e065

krav requested a review from nsheaps April 26, 2022 11:53

nsheaps approved these changes Apr 26, 2022

View reviewed changes

krav added 2 commits April 26, 2022 21:03

redis max not total

72997c5

select right sample for test

4662d49

krav merged commit f6da834 into reddit:develop Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default prometheus metrics to clients #652

Add default prometheus metrics to clients #652

krav commented Mar 14, 2022 •

edited

Loading

fishy Mar 14, 2022

bjk-reddit Mar 17, 2022

krav Mar 17, 2022

JessicaGreben left a comment

JessicaGreben Mar 17, 2022

JessicaGreben Mar 17, 2022

bjk-reddit Mar 17, 2022 •

edited

Loading

krav Mar 17, 2022

krav Mar 17, 2022

krav Mar 17, 2022

JessicaGreben Mar 17, 2022

krav Mar 17, 2022

JessicaGreben Mar 17, 2022

JessicaGreben Mar 17, 2022

krav Mar 17, 2022

krav Mar 17, 2022

JessicaGreben Mar 17, 2022

JessicaGreben left a comment

JessicaGreben Mar 22, 2022

MelissaCole Mar 30, 2022

krav Apr 15, 2022

MelissaCole Mar 30, 2022

krav Apr 19, 2022

MelissaCole Mar 30, 2022

bjk-reddit Mar 30, 2022

krav Apr 19, 2022

MelissaCole left a comment

nsheaps left a comment

Add default prometheus metrics to clients #652

Add default prometheus metrics to clients #652

Conversation

krav commented Mar 14, 2022 • edited Loading

Caveat

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JessicaGreben left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjk-reddit Mar 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JessicaGreben left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MelissaCole left a comment

Choose a reason for hiding this comment

nsheaps left a comment

Choose a reason for hiding this comment

krav commented Mar 14, 2022 •

edited

Loading

bjk-reddit Mar 17, 2022 •

edited

Loading