fix: Return correct data in get_*_for_project methods #29037

loewenheim · 2021-10-04T13:43:58Z

This changes the methods get_counts_for_project and get_durations_for_project on RedisRealtimeMetricsStore to also return information that isn't stored in redis. This happens if no events are recorded in certain time intervals; data only gets written to redis when something happens. To fix this, these methods now compute the keys they expect to be there ahead of time, then get what they can from redis and fill up the rest with default values.

loewenheim · 2021-10-04T13:53:20Z

Type checking fails because I'm not sure what the return type of mock_time should be.

src/sentry/processing/realtime_metrics/redis.py

tests/sentry/processing/realtime_metrics/test_redis.py

relaxolotl · 2021-10-05T00:52:08Z

overall this looks great, just left a few comments most of which are nitpicky. thanks for catching this!

src/sentry/processing/realtime_metrics/redis.py

loewenheim · 2021-10-05T13:34:51Z

Arpad's comments about the ttls prompted me to think about the expiry logic a bit more and I realized that it's currently unsound.

Example: Let counter_ttl = 30 (in seconds). Assume that at time 3, we record an event. Now what happens when we call get_counts_for_project($project, 35)? As it stands, the first bucket it returns is [0, 10) (35 - counter_ttl rounded down to the nearest multiple of 10), but that bucket expired at time 33!

This isn't especially hard to fix, but we need to decide what the correct behavior should be. In the example above, what should be the first bucket we get back? You can make a case for [0, 10) (it contains events that happened less than counter_ttl seconds ago) and [10, 20) (it's the first bucket that's entirely within counter_ttl seconds of now).

loewenheim · 2021-10-07T10:29:28Z

In my most recent commit, I did some renaming of options/fields/function parameters:

histogram → duration, because "histogram" doesn't really tell you what this is for
ttl → time_window because as we discussed yesterday that expresses the intent more clearly.

I also realized that it's totally fine for *_time_window to be less than _bucket_size or even equal to 0, although bucket_size > time_window > 0 probably isn't too useful. But a value of 0 results in the perfectly sensible behavior that only the most recent bucket is returned.

flub

generally really like this!

src/sentry/processing/realtime_metrics/redis.py

src/sentry/tasks/low_priority_symbolication.py

flub · 2021-10-07T11:55:20Z

src/sentry/tasks/low_priority_symbolication.py

+    counts = realtime_metrics.get_counts_for_project(project_id, cutoff)
+    durations = realtime_metrics.get_durations_for_project(project_id, cutoff)


I have rather mixed feelings about plugging the cutoff through instead of implicitly getting the current time in the methods as it was before this PR. I think this is entirely for testing, you could have them Optional, but also you could mock time.time() during testing and things would work nicely. anyway, i don't mind if you'd prefer to keep it this way.

i'll reply to this since i was the author of this code - i wanted to snapshot a single point in time defined by the scanning task, and pass that down to all of the tasks it triggers. this is reflected by the fact that this PR passes cutoff from scan_for_suspect_projects down to update_lpq_eligibility, further down to get_x_for_project as you've noted here.

my understanding was that we had two options:

let the innermost invocations determine the cutoff themselves, meaning that a single scan_... may trigger multiple update_...s with drifting timestamps. an update_... for project 9 might grab metrics from a time period that's slightly different from project 110's update_...

pin timestamps for all update_..s to some time determined by their parent scan_... so that every update_... with a common parent scan_... is making a decision based on metrics from the same period of time. an update_... for project 9 will grab metrics from timestamp 42, and an update_... for project 110 will also grab metrics from timestamp 42 if the same scan_... task triggered them.

i went for the latter option in this case since i find it's a little easier to reason about timestamps and when tasks execute this way. tests are also easier to write without needing to resort to freezing time. i could be overlooking something here though. thoughts?

arguably freezing the time for all projects is not that desired. If somehow the workers of these tasks get severely backlogged they'll start computations on the wrong time and make wrong decisions. Since the decision they make is applied now they should also decide it on the most recent data, not some date from the past.

tests/sentry/processing/realtime_metrics/test_redis.py

src/sentry/processing/realtime_metrics/base.py

fix: Return correct data in get_project_* methods

1620adf

loewenheim requested a review from a team October 4, 2021 13:43

relaxolotl reviewed Oct 4, 2021

View reviewed changes

src/sentry/processing/realtime_metrics/redis.py Outdated Show resolved Hide resolved