Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(symbolicator): Add a backoff timer when projects move in/out of the LPQ #29100

Merged
merged 12 commits into from
Oct 11, 2021

Conversation

relaxolotl
Copy link
Contributor

@relaxolotl relaxolotl commented Oct 6, 2021

This adds in a customizable backoff timer to the LPQ. The timer is 5 minutes by default.

The timer is not meant to block all actions related to the LPQ such as metric-gathering after a project has had its eligibility changed recently. This timer is only meant to prevent projects from rapidly leaving and entering the queue in a short period of time.

NATIVE-211

@relaxolotl relaxolotl requested review from flub, Swatinem, loewenheim and a team October 6, 2021 01:49
Comment on lines 253 to 262
# If this successfully completes then the project is expected to be in the set.
was_added = int(self.cluster.sadd(LPQ_MEMBERS_KEY, project_id)) > 0
self._register_backoffs({project_id})
return was_added
Copy link
Contributor Author

@relaxolotl relaxolotl Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something i'm still mulling over right now is whether it should be the responsibility of add_project_to_lpq and remove_project_from_lpq to determine the appropriate action to take based on backoffs, or if it should be the responsibility of their callers (scan_for_suspect_projects and update_lpq_eligibility.

one may argue that the backoff timer may be something that the store should be abstracting away from the callers, and that all callers should care about is to add and remove projects when they feel it is appropriate to do so based on metrics and project membership in the LPQ.

the cost of abstracting this away however is that extra tasks will be spawned to eventually end up as no-op actions because the callers have zero awareness of the existence of the backoff timer.

the tradeoff here is to either lose simplicity at the task level for efficiency gained from being aware of the backoff timer, or for there to be inefficiencies by needlessly spawning celery tasks that do nothing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe recommend making it a concern of the caller because:

  • less magic and surprises. the call does exactly what it says it does.
  • the backoff should be a business requirement, so it should be exposed prominently?
  • perf concerns that you mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'd argue the opposite: it makes more sense to put it in the abstraction instead of pushing all that complexity around and adding even more methods to this class. I think the trick is in naming things (isn't it always 😉 ), currently these methods are named to directly take an action, instead they can be named to indicate the result of the caller's business logic: mark_low_priority() (and mark_normal_priority).

Saving the computation really doesn't matter: this is a fixed cost which we basically always pay, avoiding this cost in a few exceptional moments is really not important imho.

It also means you can avoid creating yet more public methods on this class. It already has too many of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about it a little and i think floris is right that the amount of tasks saved if we pre-emptively check the backoff timer during scanning might be pretty small in the grand scheme of things: assuming a default task frequency of 10s and a backoff timer of 5 minutes, you skip 30 tasks in total whenever a project has recently moved in or out of the LPQ.

we don't expect there to be much traffic when it comes to LPQ membership, which means that in theory we don't cut those 30 tasks too often. the number of projects being scanned easily dwarfs this number, and we anticipate there to be a steady stream of these task-specific projects being created at all times. this conclusion might be different for on-premise setups with a very small number of projects: moving 1 project into the LPQ could mean halving the number of tasks being created every 10s. it would also be neat if we could apply this task-saving logic when a massive number of projects enter the LPQ in an attempt to save on resources in a dire-looking situation, but that's probably not worth the effort to do automatically.

in any case, this axes the perf concern. settings such as the the bucket size and timeout window i feel are requirements similar in nature to that of the backoff timer; the backoff timer is exposed in the same way as the bucket settings, so i think we could make a case for hiding most of this in the store.

i'm unsure if the magic/surprise point can be completely addressed via naming; the methods to assign a project to the LPQ are doing two things, and i'm skeptical that any amount of naming can hide that. it might not be that big of a concern if this is called out in the docstrings, though? it's not as if we're obfuscating the code either.

Comment on lines 253 to 262
# If this successfully completes then the project is expected to be in the set.
was_added = int(self.cluster.sadd(LPQ_MEMBERS_KEY, project_id)) > 0
self._register_backoffs({project_id})
return was_added
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe recommend making it a concern of the caller because:

  • less magic and surprises. the call does exactly what it says it does.
  • the backoff should be a business requirement, so it should be exposed prominently?
  • perf concerns that you mentioned.


def test_recently_moved_projects_noop(store: RedisRealtimeMetricsStore) -> None:
store.remove_projects_from_lpq({42})
assert store.recently_moved_projects() == {42}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a bit related to your Q about mocking/fixtures:

do all of the tests get a completely empty redis store? As in: are they completely independent?
As in: is it possible that code is broken, but the project is part of "recently moved" by the side effect of another test?

In that case, might be good to use unique IDs for each test, and update the assertions accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do it would be to change store._prefix to something unique in each test.

Copy link
Contributor

@loewenheim loewenheim Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this:

def test_name() -> str:
    import sys
    return sys._getframe(1).f_code.co_name

def test_increment_project_event_counter_same_bucket(
    store: RedisRealtimeMetricsStore, redis_cluster: redis._RedisCluster
) -> None:
    store._prefix = test_name()
    store.increment_project_event_counter(17, 1147)
    store.increment_project_event_counter(17, 1149)

    assert redis_cluster.get(f"{store._prefix}:counter:10:17:1140") == "2"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from @Swatinem:

do all of the tests get a completely empty redis store? As in: are they completely independent?

that's a good question; the fact that all of these tests are passing despite using common project IDs and timestamps suggests to me that they all get fresh redis stores, as several assertions would break if that wasn't the case.

unfortunately i can't really explain how this works under the hood: if the fixture is invoked once per test, constructing a new RedisRealtimeMetricsStore with the specified config, all it should be doing is connecting to the same test redis cluster the config points to. if not, how is this feature working in production? are redis test clusters automagically being torn down and put back up in between every test somewhere in the test framework?

if this reasoning isn't enough to provide a reasonable amount of confidence about false positives in tests, @loewenheim's prefix-setting strategy is a great idea as a quick fix for this problem. ideally i think figuring out how to automatically set the prefix by using setUp would make our lives easier in the long run, if we were to commit to using this strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@relaxolotl What do you mean by setUp here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about that, i skipped a few steps in that sentence: that was in reference to test setup, e.g. defining setUp on a test class which then gets run for every test case.

Comment on lines 253 to 262
# If this successfully completes then the project is expected to be in the set.
was_added = int(self.cluster.sadd(LPQ_MEMBERS_KEY, project_id)) > 0
self._register_backoffs({project_id})
return was_added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'd argue the opposite: it makes more sense to put it in the abstraction instead of pushing all that complexity around and adding even more methods to this class. I think the trick is in naming things (isn't it always 😉 ), currently these methods are named to directly take an action, instead they can be named to indicate the result of the caller's business logic: mark_low_priority() (and mark_normal_priority).

Saving the computation really doesn't matter: this is a fixed cost which we basically always pay, avoiding this cost in a few exceptional moments is really not important imho.

It also means you can avoid creating yet more public methods on this class. It already has too many of them.

@flub flub merged commit d414c5f into master Oct 11, 2021
@flub flub deleted the feat/lpq/backoff branch October 11, 2021 09:47
@github-actions github-actions bot locked and limited conversation to collaborators Oct 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants