-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dex: 🪣 add two larger dex buckets #4489
Conversation
807efbd
to
c1b07ca
Compare
fixes #4464. this adds two larger buckets to the dex component's histograms. when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.
c1b07ca
to
5e3bb25
Compare
Why are we manually configuring the buckets? |
@hdevalence because we do not want to use summaries. see the comparison here, importantly it points out that observing values (e.g. updating the summary) is expensive, because quantiles are calculated eagerly. we want to use a histogram, so we should provide the buckets explicitly. as is however, we have some durations that can't be observed well by the existing buckets; anything > 100ms is observed as "larger than 100ms" which is what causes the p99 pictured in #4464. the other problem with summaries is that each quantile must be decided in advance, there isn't a way to later query for e.g. |
Hmm, but we only run the DEX every five seconds, and we have never collected any real performance information from it. How do we know that "greater than 100ms" is meaningful information? Wouldn't it be more appropriate to start with the more expensive and more precise method, collect information in a principled way, and only later move to a lossy bucketing, once we have some understanding of the actual signal? The current bucket sizes are definitely wrong. How can we possibly choose better ones without actually collecting real data? |
we use these buckets for https://github.com/penumbra-zone/penumbra/pull/4489/files#diff-f9527d0bd37d6370be90f6b20108b5e89bfd7be9f4af080a743e183bd97e6baeR51-R56 each of these five metrics. these are a bit of a "one size fits all" style approach, which will make it difficult to tune tightly to any particular histogram. i am opening this in the spirit of providing a short-term, easy fix that gives us some more visibility into the durations that right now we can't see. i agree that we certainly should revisit our metrics (in general) and come up with better configurations / tuning that conform closely to durations we observe in the wild. that sounds like a larger undertaking though, which we can move towards using data we collect after this lands. |
But it doesn't fix the problem. The original issue was me noticing that our current metrics are giving us false information, because of badly configured buckets. False information is much worse than none. We should either
|
#4502 adjusts the buckets to a logarithmically spaced group of buckets. closing this. |
fixes #4464.
this adds two larger buckets to the dex component's histograms.
when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.
✔️ checklist before requesting a review
if this code contains consensus-breaking changes, i have added the "consensus-breaking" label. otherwise, i declare my belief that there are not consensus-breaking changes, for the following reason: