-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Uptime] Improve snapshot timespan handling #58078
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. This patch improves the situation in two ways: 1. In low-cardinality situations, for instance when < 1000 monitors are either up or down we deliver an *exact* count of up/down using the monitor iterator to count the number of either up or down monitors then subtract that from the total. Even in slow situations this shouldn't take longer than ~1s. 2. In high-card situations we do follow the efficient snapshot count path, but filter it to the past 30s instead of the past 5m which is more useful. I've modified the API to report which method was used. I think we'll potentially need to consider adding a note to the UI explaining that things may be more stale in the 30s case, but I'm wondering if that may in fact be overkill. There are no tests yet, I'd like for us to review the general approach before adding those as they will be non-trivial.
andrewvc
added
bug
Fixes for quality problems that affect the customer experience
discuss
Team:Uptime - DEPRECATED
Synthetics & RUM sub-team of Application Observability
v7.6.0
labels
Feb 20, 2020
Pinging @elastic/uptime (Team:uptime) |
💔 Build FailedTest FailuresKibana Pipeline / kibana-xpack-agent / X-Pack API Integration Tests.x-pack/test/api_integration/apis/uptime/rest/snapshot·ts.apis uptime uptime REST endpoints with generated data snapshot count when data is present with low cardinality with timespans included will count all statuses correctlyStandard Out
Stack Trace
Kibana Pipeline / kibana-xpack-agent / X-Pack API Integration Tests.x-pack/test/api_integration/apis/uptime/rest/snapshot·ts.apis uptime uptime REST endpoints with generated data snapshot count when data is present with low cardinality with timespans included will count all statuses correctlyStandard Out
Stack Trace
To update your PR or re-run it, just comment with: |
7 tasks
closing in favor of #58247 |
andrewvc
added a commit
to andrewvc/kibana
that referenced
this pull request
Feb 21, 2020
When generating test data we refresh excessively, this can fill up the ES queues and break the tests if we run massive tests. I originally ran into this with elastic#58078 which I closed due to finding a better approach. While none of our current tests have the scale to expose this problem, we certainly will add tests that do later, so we should keep this change.
andrewvc
added a commit
that referenced
this pull request
Feb 24, 2020
Fixes #58079 This is an improved version of #58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc
added a commit
to andrewvc/kibana
that referenced
this pull request
Feb 24, 2020
Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
7 tasks
andrewvc
added a commit
that referenced
this pull request
Feb 24, 2020
Fixes #58079 This is an improved version of #58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc
added a commit
to andrewvc/kibana
that referenced
this pull request
Feb 24, 2020
…elastic#58389) Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings.
andrewvc
added a commit
that referenced
this pull request
Feb 25, 2020
When generating test data we refresh excessively, this can fill up the ES queues and break the tests if we run massive tests. I originally ran into this with #58078 which I closed due to finding a better approach. While none of our current tests have the scale to expose this problem, we certainly will add tests that do later, so we should keep this change. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
andrewvc
added a commit
to andrewvc/kibana
that referenced
this pull request
Feb 25, 2020
…58285) When generating test data we refresh excessively, this can fill up the ES queues and break the tests if we run massive tests. I originally ran into this with elastic#58078 which I closed due to finding a better approach. While none of our current tests have the scale to expose this problem, we certainly will add tests that do later, so we should keep this change. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
andrewvc
added a commit
that referenced
this pull request
Feb 25, 2020
…58468) When generating test data we refresh excessively, this can fill up the ES queues and break the tests if we run massive tests. I originally ran into this with #58078 which I closed due to finding a better approach. While none of our current tests have the scale to expose this problem, we certainly will add tests that do later, so we should keep this change. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
elasticmachine
added a commit
to dhurley14/kibana
that referenced
this pull request
Feb 25, 2020
…elastic#58389) (elastic#58415) Fixes elastic#58079 This is an improved version of elastic#58078 Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later. This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful. We now use a scripted metric to always count precisely the number of up/down monitors. On my box this could process 400k summary docs in ~600ms. This should scale as shards are added. I attempted to keep memory usage relatively slow by using simple maps of strings. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Fixes for quality problems that affect the customer experience
discuss
release_note:fix
Team:Uptime - DEPRECATED
Synthetics & RUM sub-team of Application Observability
v7.6.1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #58079
Note, this is a bugfix targeting 7.6.1 . I've decided to open this PR directly against 7.6 in the interest of time. We can forward-port this to 7.x / master later.
This patch improves the handling of timespans with snapshot counts. This feature originally worked, but suffered a regression when we increased the default timespan in the query context to 5m. This means that without this patch the counts you get are the maximum total number of monitors that were down over the past 5m, which is not really that useful.
This patch improves the situation by gathering stats from the index and changing the query strategy based on those stats.
take longer than ~1s.
I've modified the API to report which method was used. I think we'll potentially need to consider adding a note to the UI explaining that things may be more stale in the 30s case, but I'm wondering if that may in fact be overkill.
This whole approach may change when we add support for the 'stale' state.
This PR also improves our doc generators in tests to not refresh so often. Without this improvement we'd back up the Elasticsearch queues with needless refreshes, now we only refresh the index once we're done indexing.
Checklist
Delete any items that are not applicable to this PR.
For maintainers