dex: 🪣 add two larger dex buckets #4489

cratelyn · 2024-05-28T13:03:03Z

this adds two larger buckets to the dex component's histograms.

when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.

✔️ checklist before requesting a review

if this code contains consensus-breaking changes, i have added the "consensus-breaking" label. otherwise, i declare my belief that there are not consensus-breaking changes, for the following reason:

this only changes telemetry

fixes #4464. this adds two larger buckets to the dex component's histograms. when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.

hdevalence · 2024-05-28T14:43:29Z

Why are we manually configuring the buckets?

cratelyn · 2024-05-28T14:58:02Z

Why are we manually configuring the buckets?

@hdevalence because we do not want to use summaries. see the comparison here, importantly it points out that observing values (e.g. updating the summary) is expensive, because quantiles are calculated eagerly.

we want to use a histogram, so we should provide the buckets explicitly. as is however, we have some durations that can't be observed well by the existing buckets; anything > 100ms is observed as "larger than 100ms" which is what causes the p99 pictured in #4464.

the other problem with summaries is that each quantile must be decided in advance, there isn't a way to later query for e.g. p50, if that wasn't already included. additionally, summaries aren't aggregable, so they are overall a poorer choice for this use-case.

hdevalence · 2024-05-28T15:03:39Z

Hmm, but we only run the DEX every five seconds, and we have never collected any real performance information from it. How do we know that "greater than 100ms" is meaningful information? Wouldn't it be more appropriate to start with the more expensive and more precise method, collect information in a principled way, and only later move to a lossy bucketing, once we have some understanding of the actual signal?

The current bucket sizes are definitely wrong. How can we possibly choose better ones without actually collecting real data?

cratelyn · 2024-05-28T15:10:42Z

we use these buckets for https://github.com/penumbra-zone/penumbra/pull/4489/files#diff-f9527d0bd37d6370be90f6b20108b5e89bfd7be9f4af080a743e183bd97e6baeR51-R56 each of these five metrics. these are a bit of a "one size fits all" style approach, which will make it difficult to tune tightly to any particular histogram.

i am opening this in the spirit of providing a short-term, easy fix that gives us some more visibility into the durations that right now we can't see.

i agree that we certainly should revisit our metrics (in general) and come up with better configurations / tuning that conform closely to durations we observe in the wild. that sounds like a larger undertaking though, which we can move towards using data we collect after this lands.

hdevalence · 2024-05-28T15:39:34Z

But it doesn't fix the problem. The original issue was me noticing that our current metrics are giving us false information, because of badly configured buckets. False information is much worse than none.

We should either

Use a collection of buckets that are appropriate for all the metrics they are used for
Avoid needing to specify hardcoded buckets in advance.

cratelyn · 2024-05-30T04:35:19Z

#4502 adjusts the buckets to a logarithmically spaced group of buckets. closing this.

cratelyn added A-telemetry Area: Metrics, logging, and other observability-related features A-dex Area: Relates to the dex labels May 28, 2024

cratelyn added this to the Sprint 7 milestone May 28, 2024

cratelyn self-assigned this May 28, 2024

cratelyn had a problem deploying to smoke-test May 28, 2024 13:03 — with GitHub Actions Error

cratelyn force-pushed the kate/add-larger-dex-buckets branch from 807efbd to c1b07ca Compare May 28, 2024 13:54

cratelyn temporarily deployed to smoke-test May 28, 2024 13:54 — with GitHub Actions Inactive

cratelyn force-pushed the kate/add-larger-dex-buckets branch from c1b07ca to 5e3bb25 Compare May 28, 2024 14:06

cratelyn temporarily deployed to smoke-test May 28, 2024 14:06 — with GitHub Actions Inactive

cratelyn marked this pull request as ready for review May 28, 2024 14:42

cratelyn marked this pull request as draft May 29, 2024 07:53

cratelyn closed this May 30, 2024

cratelyn mentioned this pull request Jun 10, 2024

dex: 🎩 investigate search path duration of 100ms #4464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dex: 🪣 add two larger dex buckets #4489

dex: 🪣 add two larger dex buckets #4489

cratelyn commented May 28, 2024

hdevalence commented May 28, 2024

cratelyn commented May 28, 2024

hdevalence commented May 28, 2024

cratelyn commented May 28, 2024 •

edited

Loading

hdevalence commented May 28, 2024

cratelyn commented May 30, 2024

dex: 🪣 add two larger dex buckets #4489

dex: 🪣 add two larger dex buckets #4489

Conversation

cratelyn commented May 28, 2024

✔️ checklist before requesting a review

hdevalence commented May 28, 2024

cratelyn commented May 28, 2024

hdevalence commented May 28, 2024

cratelyn commented May 28, 2024 • edited Loading

hdevalence commented May 28, 2024

cratelyn commented May 30, 2024

cratelyn commented May 28, 2024 •

edited

Loading