Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default prometheus metrics to clients #652

Merged
merged 16 commits into from
Apr 27, 2022
Merged

Conversation

krav
Copy link
Collaborator

@krav krav commented Mar 14, 2022

This adds Prometheus client metrics equivalent to the existing statsd ones.

Caveat

I've added an identifying name to each of the client classes that produce metrics, defaulted in each case except for the thrift client where it can be inferred from the class. In the case of instantiating two clients of the same class the last to be instantiated would otherwise be the only one to report metrics. However if clients are dynamically created at runtime this can cause labels to disappear and reappear in the prometheus server on restart.

@krav krav requested a review from a team as a code owner March 14, 2022 15:55
@krav krav requested review from bradengroom and MelissaCole March 14, 2022 15:55
Comment on lines 73 to 87
PROM_PREFIX = "bp_thrift_pool"
PROM_LABELS = ["client_cls"]

promTotalConnections = Gauge(
f"{PROM_PREFIX}_size",
"Number of connections in this thrift pool",
PROM_LABELS,
)

promUsedConnections = Gauge(
f"{PROM_PREFIX}_in_use",
"Number of connections currently in use in this thrift pool",
PROM_LABELS,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔕 In Baseplate.go the equivalent (more or less, this is not really 100% mapped between go and py code) of these 2 gauges are: thriftbp_client_pool_allocated_clients and thriftbp_client_pool_active_connections with thrift_slug label.

these are not core metrics defined in baseplate spec and do not have to be the same between go and py (as the implementation details differ), but just for your reference as it's likely still beneficial to make them as consistent as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be useful to determine if these metrics are equivalent enough to the go versions. Having them use the same prefix would be nice, as it would make comparing / alerting / dashboards easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did look at the redis implementation and they're unfortunately very different, but none of the others. Thrift is a likely candidate to be similar.

@JessicaGreben JessicaGreben self-requested a review March 15, 2022 18:14
Copy link
Contributor

@JessicaGreben JessicaGreben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add some tests for these prom metrics.

pool = self.pooled_client.client_pool
self.promTotalConnections.labels(name).set_function(lambda: pool.max_size)
self.promFreeConnections.labels(name).set_function(lambda: len(pool.free))
self.promUsedConnections.labels(name).set_function(lambda: len(pool.used))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the metrics set in the init func instead of in the report_memcache_runtime_metrics method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the label "pool=memcache" when the metric name is "bp_memcached_pool"?

Copy link
Contributor

@bjk-reddit bjk-reddit Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pool label is there for if a service is connected to multiple memcached services.

For example a service could be connected to pool="stalecache" and pool="thing"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s what I tried to explain in the caveat section above. If you were to add the metric to the registry when creating the class, that is you call ‘.labels’ there, then that will be shared for each instance of the class regardless of what it connects to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seemingly a problem with the statsd metric too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the metrics set in the init func instead of in the report_memcache_runtime_metrics method?

Because that function is called periodically to push statsd data, while the prometheus metrics are made on demand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for prom "made on demand" meaning a new instance of the class is initialized?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, when you get /metrics the function in set_function is called to create the number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. yes thats great.

self.promTotalConnections.labels(name).set_function(lambda: pool.max_size)
self.promFreeConnections.labels(name).set_function(lambda: len(pool.free))
self.promUsedConnections.labels(name).set_function(lambda: len(pool.used))

def report_memcache_runtime_metrics(self, batch: metrics.Client) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder why this memcache method is report_memcache_runtime_metrics but all the other clients the equivalent method is report_runtime_metrics. I wonder why the difference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this might be a bug unless it's called outside of the baseplate. It's not referenced anywhere, while those with the latter name are called from

elif hasattr(value, "report_runtime_metrics"):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol. i see. thanks for the explanation

Copy link
Contributor

@JessicaGreben JessicaGreben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good. Optionally we can add a few tests if you want. Also we can make the thrift metrics more similar to baseplate.go. Thanks for this!

)
)
metric = ctx.promTotalConnections.collect()
self.assertEqual(metric[0].samples[0].value, float(max_pool_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be useful to assert the labels are set correctly?

PROM_LABELS,
)

def __init__(self, pooled_client: PooledClient, name: str = "memcache"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔕 Would None or "" work as a default since the metric name already has "memcache" in it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the label empty removes which can make the queries a bit more difficult. We can rename it "default" maybe?

PROM_LABELS = ["pool"]

totalConnections = Gauge(
f"{PROM_PREFIX}_connections",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making this (and clustered redis) f"{PROM_PREFIX}_size" for consistency with memcached and sqlalchemy implementations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we shouldn't do it the other way around, since it describes the contents of the pool.

The redis client in baseplate.go seems to have connections_total, connections_idle. Having something end in _total signals that it's a Counter instead of a Gauge, so I'd keep that out - but switching the other two.

)

promCheckedOutConnections = Gauge(
f"{PROM_PREFIX}_checked_out",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be in_use to be consistent with thrift and memcached and free for checked_in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually say active for connection pools.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, named it like this to be in line with the library vocabulary, but it does not make intuitive sense to me either. Changed it to active/idle.

Copy link

@MelissaCole MelissaCole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with some questions about naming conventions and consistency

@krav krav requested a review from MelissaCole April 19, 2022 13:28
@nsheaps nsheaps self-requested a review April 21, 2022 18:06
Copy link
Contributor

@nsheaps nsheaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few updates that you've already discussed still need to be made. Thanks for tackling this!

baseplate/clients/redis_cluster.py Outdated Show resolved Hide resolved
baseplate/clients/thrift.py Outdated Show resolved Hide resolved
baseplate/clients/thrift.py Outdated Show resolved Hide resolved
@krav krav requested a review from nsheaps April 22, 2022 12:29
@krav krav requested a review from nsheaps April 26, 2022 11:53
@nsheaps nsheaps dismissed MelissaCole’s stale review April 26, 2022 21:19

She's out on vacation, I'm taking over the review

@krav krav merged commit f6da834 into reddit:develop Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants