Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dex: 🪣 add two larger dex buckets #4489

Closed
wants to merge 1 commit into from

Conversation

cratelyn
Copy link
Contributor

fixes #4464.

this adds two larger buckets to the dex component's histograms.

when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.

✔️ checklist before requesting a review

  • if this code contains consensus-breaking changes, i have added the "consensus-breaking" label. otherwise, i declare my belief that there are not consensus-breaking changes, for the following reason:

    this only changes telemetry

@cratelyn cratelyn added A-telemetry Area: Metrics, logging, and other observability-related features A-dex Area: Relates to the dex labels May 28, 2024
@cratelyn cratelyn added this to the Sprint 7 milestone May 28, 2024
@cratelyn cratelyn self-assigned this May 28, 2024
@cratelyn cratelyn force-pushed the kate/add-larger-dex-buckets branch from 807efbd to c1b07ca Compare May 28, 2024 13:54
fixes #4464.

this adds two larger buckets to the dex component's histograms.

when our dashboards calculate quantiles, we observed signals that some
operations were taking longer than 100ms. to help obtain more accurate
performance data, we add a 1 second and 10 second bucket.
@cratelyn cratelyn force-pushed the kate/add-larger-dex-buckets branch from c1b07ca to 5e3bb25 Compare May 28, 2024 14:06
@cratelyn cratelyn marked this pull request as ready for review May 28, 2024 14:42
@hdevalence
Copy link
Member

Why are we manually configuring the buckets?

@cratelyn
Copy link
Contributor Author

Why are we manually configuring the buckets?

@hdevalence because we do not want to use summaries. see the comparison here, importantly it points out that observing values (e.g. updating the summary) is expensive, because quantiles are calculated eagerly.

we want to use a histogram, so we should provide the buckets explicitly. as is however, we have some durations that can't be observed well by the existing buckets; anything > 100ms is observed as "larger than 100ms" which is what causes the p99 pictured in #4464.

the other problem with summaries is that each quantile must be decided in advance, there isn't a way to later query for e.g. p50, if that wasn't already included. additionally, summaries aren't aggregable, so they are overall a poorer choice for this use-case.

@hdevalence
Copy link
Member

Hmm, but we only run the DEX every five seconds, and we have never collected any real performance information from it. How do we know that "greater than 100ms" is meaningful information? Wouldn't it be more appropriate to start with the more expensive and more precise method, collect information in a principled way, and only later move to a lossy bucketing, once we have some understanding of the actual signal?

The current bucket sizes are definitely wrong. How can we possibly choose better ones without actually collecting real data?

@cratelyn
Copy link
Contributor Author

cratelyn commented May 28, 2024

we use these buckets for https://github.com/penumbra-zone/penumbra/pull/4489/files#diff-f9527d0bd37d6370be90f6b20108b5e89bfd7be9f4af080a743e183bd97e6baeR51-R56 each of these five metrics. these are a bit of a "one size fits all" style approach, which will make it difficult to tune tightly to any particular histogram.

i am opening this in the spirit of providing a short-term, easy fix that gives us some more visibility into the durations that right now we can't see.

i agree that we certainly should revisit our metrics (in general) and come up with better configurations / tuning that conform closely to durations we observe in the wild. that sounds like a larger undertaking though, which we can move towards using data we collect after this lands.

@hdevalence
Copy link
Member

But it doesn't fix the problem. The original issue was me noticing that our current metrics are giving us false information, because of badly configured buckets. False information is much worse than none.

We should either

  1. Use a collection of buckets that are appropriate for all the metrics they are used for
  2. Avoid needing to specify hardcoded buckets in advance.

@cratelyn cratelyn marked this pull request as draft May 29, 2024 07:53
@cratelyn
Copy link
Contributor Author

#4502 adjusts the buckets to a logarithmically spaced group of buckets. closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dex Area: Relates to the dex A-telemetry Area: Metrics, logging, and other observability-related features
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

dex: 🎩 investigate search path duration of 100ms
2 participants