feat(symbolicator): Add a backoff timer when projects move in/out of the LPQ #29100

relaxolotl · 2021-10-06T01:49:21Z

This adds in a customizable backoff timer to the LPQ. The timer is 5 minutes by default.

The timer is not meant to block all actions related to the LPQ such as metric-gathering after a project has had its eligibility changed recently. This timer is only meant to prevent projects from rapidly leaving and entering the queue in a short period of time.

NATIVE-211

relaxolotl · 2021-10-06T02:01:24Z

src/sentry/processing/realtime_metrics/redis.py

+        # If this successfully completes then the project is expected to be in the set.
+        was_added = int(self.cluster.sadd(LPQ_MEMBERS_KEY, project_id)) > 0
+        self._register_backoffs({project_id})
+        return was_added


something i'm still mulling over right now is whether it should be the responsibility of add_project_to_lpq and remove_project_from_lpq to determine the appropriate action to take based on backoffs, or if it should be the responsibility of their callers (scan_for_suspect_projects and update_lpq_eligibility.

one may argue that the backoff timer may be something that the store should be abstracting away from the callers, and that all callers should care about is to add and remove projects when they feel it is appropriate to do so based on metrics and project membership in the LPQ.

the cost of abstracting this away however is that extra tasks will be spawned to eventually end up as no-op actions because the callers have zero awareness of the existence of the backoff timer.

the tradeoff here is to either lose simplicity at the task level for efficiency gained from being aware of the backoff timer, or for there to be inefficiencies by needlessly spawning celery tasks that do nothing.

I would maybe recommend making it a concern of the caller because:

less magic and surprises. the call does exactly what it says it does.

the backoff should be a business requirement, so it should be exposed prominently?

perf concerns that you mentioned.

I'm afraid I'd argue the opposite: it makes more sense to put it in the abstraction instead of pushing all that complexity around and adding even more methods to this class. I think the trick is in naming things (isn't it always 😉 ), currently these methods are named to directly take an action, instead they can be named to indicate the result of the caller's business logic: mark_low_priority() (and mark_normal_priority).

Saving the computation really doesn't matter: this is a fixed cost which we basically always pay, avoiding this cost in a few exceptional moments is really not important imho.

It also means you can avoid creating yet more public methods on this class. It already has too many of them.

i thought about it a little and i think floris is right that the amount of tasks saved if we pre-emptively check the backoff timer during scanning might be pretty small in the grand scheme of things: assuming a default task frequency of 10s and a backoff timer of 5 minutes, you skip 30 tasks in total whenever a project has recently moved in or out of the LPQ.

we don't expect there to be much traffic when it comes to LPQ membership, which means that in theory we don't cut those 30 tasks too often. the number of projects being scanned easily dwarfs this number, and we anticipate there to be a steady stream of these task-specific projects being created at all times. this conclusion might be different for on-premise setups with a very small number of projects: moving 1 project into the LPQ could mean halving the number of tasks being created every 10s. it would also be neat if we could apply this task-saving logic when a massive number of projects enter the LPQ in an attempt to save on resources in a dire-looking situation, but that's probably not worth the effort to do automatically.

in any case, this axes the perf concern. settings such as the the bucket size and timeout window i feel are requirements similar in nature to that of the backoff timer; the backoff timer is exposed in the same way as the bucket settings, so i think we could make a case for hiding most of this in the store.

i'm unsure if the magic/surprise point can be completely addressed via naming; the methods to assign a project to the LPQ are doing two things, and i'm skeptical that any amount of naming can hide that. it might not be that big of a concern if this is called out in the docstrings, though? it's not as if we're obfuscating the code either.

Swatinem · 2021-10-06T09:11:59Z

src/sentry/processing/realtime_metrics/redis.py

+        # If this successfully completes then the project is expected to be in the set.
+        was_added = int(self.cluster.sadd(LPQ_MEMBERS_KEY, project_id)) > 0
+        self._register_backoffs({project_id})
+        return was_added


I would maybe recommend making it a concern of the caller because:

less magic and surprises. the call does exactly what it says it does.

the backoff should be a business requirement, so it should be exposed prominently?

perf concerns that you mentioned.

Swatinem · 2021-10-06T09:17:34Z

tests/sentry/processing/realtime_metrics/test_redis.py

+
+def test_recently_moved_projects_noop(store: RedisRealtimeMetricsStore) -> None:
+    store.remove_projects_from_lpq({42})
+    assert store.recently_moved_projects() == {42}


maybe a bit related to your Q about mocking/fixtures:

do all of the tests get a completely empty redis store? As in: are they completely independent?
As in: is it possible that code is broken, but the project is part of "recently moved" by the side effect of another test?

In that case, might be good to use unique IDs for each test, and update the assertions accordingly.

Another way to do it would be to change store._prefix to something unique in each test.

Like this:

def test_name() -> str: import sys return sys._getframe(1).f_code.co_name def test_increment_project_event_counter_same_bucket( store: RedisRealtimeMetricsStore, redis_cluster: redis._RedisCluster ) -> None: store._prefix = test_name() store.increment_project_event_counter(17, 1147) store.increment_project_event_counter(17, 1149) assert redis_cluster.get(f"{store._prefix}:counter:10:17:1140") == "2"

from @Swatinem:

do all of the tests get a completely empty redis store? As in: are they completely independent?

that's a good question; the fact that all of these tests are passing despite using common project IDs and timestamps suggests to me that they all get fresh redis stores, as several assertions would break if that wasn't the case.

unfortunately i can't really explain how this works under the hood: if the fixture is invoked once per test, constructing a new RedisRealtimeMetricsStore with the specified config, all it should be doing is connecting to the same test redis cluster the config points to. if not, how is this feature working in production? are redis test clusters automagically being torn down and put back up in between every test somewhere in the test framework?

if this reasoning isn't enough to provide a reasonable amount of confidence about false positives in tests, @loewenheim's prefix-setting strategy is a great idea as a quick fix for this problem. ideally i think figuring out how to automatically set the prefix by using setUp would make our lives easier in the long run, if we were to commit to using this strategy.

@relaxolotl What do you mean by setUp here?

sorry about that, i skipped a few steps in that sentence: that was in reference to test setup, e.g. defining setUp on a test class which then gets run for every test case.

src/sentry/processing/realtime_metrics/redis.py

src/sentry/tasks/low_priority_symbolication.py

src/sentry/processing/realtime_metrics/redis.py

flub · 2021-10-07T11:03:26Z

src/sentry/processing/realtime_metrics/redis.py

+        # If this successfully completes then the project is expected to be in the set.
+        was_added = int(self.cluster.sadd(LPQ_MEMBERS_KEY, project_id)) > 0
+        self._register_backoffs({project_id})
+        return was_added


I'm afraid I'd argue the opposite: it makes more sense to put it in the abstraction instead of pushing all that complexity around and adding even more methods to this class. I think the trick is in naming things (isn't it always 😉 ), currently these methods are named to directly take an action, instead they can be named to indicate the result of the caller's business logic: mark_low_priority() (and mark_normal_priority).

Saving the computation really doesn't matter: this is a fixed cost which we basically always pay, avoiding this cost in a few exceptional moments is really not important imho.

It also means you can avoid creating yet more public methods on this class. It already has too many of them.

…the LPQ NATIVE-211

src/sentry/processing/realtime_metrics/redis.py

src/sentry/tasks/low_priority_symbolication.py

relaxolotl requested review from flub, Swatinem, loewenheim and a team October 6, 2021 01:49

relaxolotl commented Oct 6, 2021

View reviewed changes

vercel bot deployed to Preview – sentry October 6, 2021 02:10 View deployment

vercel bot deployed to Preview – storybook October 6, 2021 02:10 View deployment

vercel bot deployed to Preview – sentry October 6, 2021 02:20 View deployment

vercel bot deployed to Preview – storybook October 6, 2021 02:20 View deployment

vercel bot deployed to Preview – sentry October 6, 2021 02:30 View deployment

vercel bot deployed to Preview – storybook October 6, 2021 02:30 View deployment

Swatinem approved these changes Oct 6, 2021

View reviewed changes

flub reviewed Oct 7, 2021

View reviewed changes

vercel bot deployed to Preview – storybook October 8, 2021 03:25 View deployment

vercel bot deployed to Preview – sentry October 8, 2021 03:25 View deployment

vercel bot deployed to Preview – sentry October 8, 2021 04:11 View deployment

vercel bot deployed to Preview – storybook October 8, 2021 04:11 View deployment

vercel bot deployed to Preview – sentry October 8, 2021 18:16 View deployment

vercel bot deployed to Preview – storybook October 8, 2021 18:16 View deployment

relaxolotl force-pushed the ref/lpq/tests branch 2 times, most recently from 00e69b3 to f4b4a9b Compare October 8, 2021 20:31

Base automatically changed from ref/lpq/tests to master October 8, 2021 20:52

relaxolotl added 8 commits October 8, 2021 16:59

feat(symbolicator): Add a backoff timer when projects move in/out of …

dbe014b

…the LPQ NATIVE-211

make backoff timer a setting

2483f1b

make a check more explicit

3667042

be consistent with wording at least in methods touched in this PR

11bb683

types and imports

f93196a

pass backoff timers in when creating stores

29648de

address feedback

8f9732d

hide backoff logic in redis store

085f6d0

relaxolotl added 2 commits October 8, 2021 17:10

update task tests

c36f736

properly mock the store

65d109f

relaxolotl force-pushed the feat/lpq/backoff branch from 4d6d068 to 65d109f Compare October 8, 2021 21:10

vercel bot deployed to Preview – storybook October 8, 2021 21:10 View deployment

vercel bot deployed to Preview – sentry October 8, 2021 21:10 View deployment

fix tests

86067f7

vercel bot deployed to Preview – storybook October 9, 2021 01:59 View deployment

vercel bot deployed to Preview – sentry October 9, 2021 01:59 View deployment

loewenheim approved these changes Oct 11, 2021

View reviewed changes

flub approved these changes Oct 11, 2021

View reviewed changes

src/sentry/processing/realtime_metrics/redis.py Outdated Show resolved Hide resolved

src/sentry/processing/realtime_metrics/redis.py Outdated Show resolved Hide resolved

src/sentry/tasks/low_priority_symbolication.py Outdated Show resolved Hide resolved

Apply suggestions from code review

35342ef

vercel bot deployed to Preview – storybook October 11, 2021 09:20 View deployment

vercel bot deployed to Preview – sentry October 11, 2021 09:20 View deployment

flub merged commit d414c5f into master Oct 11, 2021

flub deleted the feat/lpq/backoff branch October 11, 2021 09:47

github-actions bot locked and limited conversation to collaborators Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(symbolicator): Add a backoff timer when projects move in/out of the LPQ #29100

feat(symbolicator): Add a backoff timer when projects move in/out of the LPQ #29100

relaxolotl commented Oct 6, 2021 •

edited

Loading

relaxolotl Oct 6, 2021 •

edited

Loading

Swatinem Oct 6, 2021

flub Oct 7, 2021

relaxolotl Oct 8, 2021

Swatinem Oct 6, 2021

Swatinem Oct 6, 2021

loewenheim Oct 6, 2021

loewenheim Oct 6, 2021 •

edited

Loading

relaxolotl Oct 8, 2021

loewenheim Oct 8, 2021

relaxolotl Oct 8, 2021

flub Oct 7, 2021

feat(symbolicator): Add a backoff timer when projects move in/out of the LPQ #29100

feat(symbolicator): Add a backoff timer when projects move in/out of the LPQ #29100

Conversation

relaxolotl commented Oct 6, 2021 • edited Loading

relaxolotl Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

loewenheim Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relaxolotl commented Oct 6, 2021 •

edited

Loading

relaxolotl Oct 6, 2021 •

edited

Loading

loewenheim Oct 6, 2021 •

edited

Loading