Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Return correct data in get_*_for_project methods #29037

Merged
merged 21 commits into from
Oct 8, 2021

Conversation

loewenheim
Copy link
Contributor

This changes the methods get_counts_for_project and get_durations_for_project on RedisRealtimeMetricsStore to also return information that isn't stored in redis. This happens if no events are recorded in certain time intervals; data only gets written to redis when something happens. To fix this, these methods now compute the keys they expect to be there ahead of time, then get what they can from redis and fill up the rest with default values.

@loewenheim loewenheim requested a review from a team October 4, 2021 13:43
@loewenheim
Copy link
Contributor Author

Type checking fails because I'm not sure what the return type of mock_time should be.

@relaxolotl
Copy link
Contributor

overall this looks great, just left a few comments most of which are nitpicky. thanks for catching this!

@loewenheim
Copy link
Contributor Author

Arpad's comments about the ttls prompted me to think about the expiry logic a bit more and I realized that it's currently unsound.

Example: Let counter_ttl = 30 (in seconds). Assume that at time 3, we record an event. Now what happens when we call get_counts_for_project($project, 35)? As it stands, the first bucket it returns is [0, 10) (35 - counter_ttl rounded down to the nearest multiple of 10), but that bucket expired at time 33!

This isn't especially hard to fix, but we need to decide what the correct behavior should be. In the example above, what should be the first bucket we get back? You can make a case for [0, 10) (it contains events that happened less than counter_ttl seconds ago) and [10, 20) (it's the first bucket that's entirely within counter_ttl seconds of now).

@loewenheim
Copy link
Contributor Author

In my most recent commit, I did some renaming of options/fields/function parameters:

  • histogramduration, because "histogram" doesn't really tell you what this is for
  • ttltime_window because as we discussed yesterday that expresses the intent more clearly.

I also realized that it's totally fine for *_time_window to be less than _bucket_size or even equal to 0, although bucket_size > time_window > 0 probably isn't too useful. But a value of 0 results in the perfectly sensible behavior that only the most recent bucket is returned.

Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally really like this!

Comment on lines +79 to +80
counts = realtime_metrics.get_counts_for_project(project_id, cutoff)
durations = realtime_metrics.get_durations_for_project(project_id, cutoff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rather mixed feelings about plugging the cutoff through instead of implicitly getting the current time in the methods as it was before this PR. I think this is entirely for testing, you could have them Optional, but also you could mock time.time() during testing and things would work nicely. anyway, i don't mind if you'd prefer to keep it this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll reply to this since i was the author of this code - i wanted to snapshot a single point in time defined by the scanning task, and pass that down to all of the tasks it triggers. this is reflected by the fact that this PR passes cutoff from scan_for_suspect_projects down to update_lpq_eligibility, further down to get_x_for_project as you've noted here.

my understanding was that we had two options:

  • let the innermost invocations determine the cutoff themselves, meaning that a single scan_... may trigger multiple update_...s with drifting timestamps. an update_... for project 9 might grab metrics from a time period that's slightly different from project 110's update_...
  • pin timestamps for all update_..s to some time determined by their parent scan_... so that every update_... with a common parent scan_... is making a decision based on metrics from the same period of time. an update_... for project 9 will grab metrics from timestamp 42, and an update_... for project 110 will also grab metrics from timestamp 42 if the same scan_... task triggered them.

i went for the latter option in this case since i find it's a little easier to reason about timestamps and when tasks execute this way. tests are also easier to write without needing to resort to freezing time. i could be overlooking something here though. thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably freezing the time for all projects is not that desired. If somehow the workers of these tasks get severely backlogged they'll start computations on the wrong time and make wrong decisions. Since the decision they make is applied now they should also decide it on the most recent data, not some date from the past.

@loewenheim loewenheim enabled auto-merge (squash) October 8, 2021 14:04
@loewenheim loewenheim merged commit a906594 into master Oct 8, 2021
@loewenheim loewenheim deleted the fix/lpq/get-projects branch October 8, 2021 15:27
@github-actions github-actions bot locked and limited conversation to collaborators Oct 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants