-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add default prometheus metrics to clients #652
Conversation
baseplate/clients/thrift.py
Outdated
PROM_PREFIX = "bp_thrift_pool" | ||
PROM_LABELS = ["client_cls"] | ||
|
||
promTotalConnections = Gauge( | ||
f"{PROM_PREFIX}_size", | ||
"Number of connections in this thrift pool", | ||
PROM_LABELS, | ||
) | ||
|
||
promUsedConnections = Gauge( | ||
f"{PROM_PREFIX}_in_use", | ||
"Number of connections currently in use in this thrift pool", | ||
PROM_LABELS, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔕 In Baseplate.go the equivalent (more or less, this is not really 100% mapped between go and py code) of these 2 gauges are: thriftbp_client_pool_allocated_clients
and thriftbp_client_pool_active_connections
with thrift_slug
label.
these are not core metrics defined in baseplate spec and do not have to be the same between go and py (as the implementation details differ), but just for your reference as it's likely still beneficial to make them as consistent as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be useful to determine if these metrics are equivalent enough to the go versions. Having them use the same prefix would be nice, as it would make comparing / alerting / dashboards easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did look at the redis implementation and they're unfortunately very different, but none of the others. Thrift is a likely candidate to be similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to add some tests for these prom metrics.
pool = self.pooled_client.client_pool | ||
self.promTotalConnections.labels(name).set_function(lambda: pool.max_size) | ||
self.promFreeConnections.labels(name).set_function(lambda: len(pool.free)) | ||
self.promUsedConnections.labels(name).set_function(lambda: len(pool.used)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the metrics set in the init func instead of in the report_memcache_runtime_metrics method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the label "pool=memcache" when the metric name is "bp_memcached_pool"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pool
label is there for if a service is connected to multiple memcached services.
For example a service could be connected to pool="stalecache"
and pool="thing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that’s what I tried to explain in the caveat section above. If you were to add the metric to the registry when creating the class, that is you call ‘.labels’ there, then that will be shared for each instance of the class regardless of what it connects to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is seemingly a problem with the statsd metric too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the metrics set in the init func instead of in the report_memcache_runtime_metrics method?
Because that function is called periodically to push statsd data, while the prometheus metrics are made on demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for prom "made on demand" meaning a new instance of the class is initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, when you get /metrics
the function in set_function
is called to create the number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see. yes thats great.
self.promTotalConnections.labels(name).set_function(lambda: pool.max_size) | ||
self.promFreeConnections.labels(name).set_function(lambda: len(pool.free)) | ||
self.promUsedConnections.labels(name).set_function(lambda: len(pool.used)) | ||
|
||
def report_memcache_runtime_metrics(self, batch: metrics.Client) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder why this memcache method is report_memcache_runtime_metrics
but all the other clients the equivalent method is report_runtime_metrics
. I wonder why the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this might be a bug unless it's called outside of the baseplate. It's not referenced anywhere, while those with the latter name are called from
baseplate.py/baseplate/__init__.py
Line 525 in 48bcb0b
elif hasattr(value, "report_runtime_metrics"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol. i see. thanks for the explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good. Optionally we can add a few tests if you want. Also we can make the thrift metrics more similar to baseplate.go. Thanks for this!
tests/unit/clients/memcache_tests.py
Outdated
) | ||
) | ||
metric = ctx.promTotalConnections.collect() | ||
self.assertEqual(metric[0].samples[0].value, float(max_pool_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be useful to assert the labels are set correctly?
PROM_LABELS, | ||
) | ||
|
||
def __init__(self, pooled_client: PooledClient, name: str = "memcache"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔕 Would None
or ""
work as a default since the metric name already has "memcache" in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving the label empty removes which can make the queries a bit more difficult. We can rename it "default" maybe?
baseplate/clients/redis.py
Outdated
PROM_LABELS = ["pool"] | ||
|
||
totalConnections = Gauge( | ||
f"{PROM_PREFIX}_connections", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on making this (and clustered redis) f"{PROM_PREFIX}_size"
for consistency with memcached and sqlalchemy implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we shouldn't do it the other way around, since it describes the contents of the pool.
The redis client in baseplate.go seems to have connections_total
, connections_idle
. Having something end in _total
signals that it's a Counter instead of a Gauge, so I'd keep that out - but switching the other two.
baseplate/clients/sqlalchemy.py
Outdated
) | ||
|
||
promCheckedOutConnections = Gauge( | ||
f"{PROM_PREFIX}_checked_out", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be in_use
to be consistent with thrift and memcached and free
for checked_in
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually say active
for connection pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, named it like this to be in line with the library vocabulary, but it does not make intuitive sense to me either. Changed it to active/idle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with some questions about naming conventions and consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few updates that you've already discussed still need to be made. Thanks for tackling this!
add prometheus metrics to clients
She's out on vacation, I'm taking over the review
This adds Prometheus client metrics equivalent to the existing statsd ones.
Caveat
I've added an identifying name to each of the client classes that produce metrics, defaulted in each case except for the thrift client where it can be inferred from the class. In the case of instantiating two clients of the same class the last to be instantiated would otherwise be the only one to report metrics. However if clients are dynamically created at runtime this can cause labels to disappear and reappear in the prometheus server on restart.