feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229

relaxolotl · 2021-10-09T03:02:59Z

Adds two major changes:

Reports project-related metrics that might be of interest to the LPQ, as requested by ops:
- the number of projects selected for the LPQ
- the number of projects manually forced into the LPQ via the killswitch
- the number of projects manually excluded from the LPQ via the killswitch
Sends more detailed warnings to sentry when a project is moved in/out of the LPQ
- adds project id
- includes reason for change

2's implementation is heavily inspired by existing code in getsentry. 1 is a best attempt given the tools known to me.

One drawback to 1's implementation is that it can take up to 10 seconds for a change to the manual kill switch to be reflected in our metrics. However, this seems preferable to the alternative of reporting whenever an event is processed, as there is no guarantee that events will always be hitting sentry. Reporting in both places is an option though.

src/sentry/tasks/low_priority_symbolication.py

flub · 2021-10-13T11:02:51Z

I'm very confused about metrics and datadog right now. The acceptance criteria of NATIVE-203 only say this:

Datadog dashboards show reliable symbolication durations

NATIVE-211 says:

There are metrics showing selected projects
There are alerts and sentry errors showing selected projects

I'm not sure which ones this is trying to address?

relaxolotl · 2021-10-13T13:13:02Z

@flub

There are metrics showing selected projects

this doesn't show which projects are selected, just the number. i couldn't recall whether we were fine with tagging these metrics with the selected projects due to cardinality concerns.

relaxolotl · 2021-10-13T15:46:52Z

there was a discussion outside of this PR that has raised a point of contention about the exact value that should be reported under this metric. it was decided that this would be left up to ops to determine what they would like to see recorded; PR will be updated accordingly based on their preferences.

cc @flub @Swatinem @loewenheim

…ted reason

relaxolotl · 2021-10-13T23:26:27Z

reworked the implementation and updated the description to reflect those changes.

cc @oioki if you have time to take a look at the metrics bits and sanity check that i'm logging the values you're expecting to be logged

src/sentry/tasks/low_priority_symbolication.py

flub · 2021-10-14T07:52:01Z

src/sentry/tasks/low_priority_symbolication.py

+    if not reason:
+        reason = "unknown"


why did you make reason optional? afaik all callers provide a reason.

i don't really have any good reason 🙃 fixed up the types, thanks for catching this

i'm keeping this check in case an empty string does get passed in though

src/sentry/tasks/low_priority_symbolication.py

flub · 2021-10-14T08:26:13Z

src/sentry/tasks/low_priority_symbolication.py

+    try:
+        _update_lpq_eligibility(project_id, cutoff)
+    finally:
+        _record_metrics()


these same metrics are now emitted a lot:

Once after the scan of all projects is completed, some computation tasks are probably still running at this time so the result is not "definitive" (in so far we can ever get a definitive result).

Once after each project's calculation is finished (that's what this line here does). This is... a bit overkill and still does not give a "definitive" metric.

I think there are two approaches to this:

Don't care about "definitive", the next metric is emitted in 10s time and will include the results of the computations that are still running. This is obviously the simplest.

Spawn a separate task to record the metrics after all the computation tasks have finished. This is something like https://docs.celeryproject.org/en/stable/userguide/canvas.html#chords

After reading option 2 I really don't feel so bad about choosing option 1 anymore.

yeah, as you've noted one of the metrics becomes "eventually" correct and needs a bit of time to update itself.

it's unfortunate that we're leaking this implementation out into our metrics, but it looks like the best compromise, especially since the scanning doesn't require a final synchronization step to do its job properly. as you've probably noticed, that makes 2 hard to commit to just to ensure our metrics are less "jiggly" after a scan.

i was kind of arguing for removing this line to not emit so many duplicate metrics... could we still do that?

do you have an example of an instance where an emitted metric is considered to be a duplicate? something to consider is that there is no other place where all of these metrics are being emitted, which means that when a manual kill switch's value is updated, this is the only place that would catch it. if these metrics are only written whenever some project is added or removed to the manual list, then given how frequently we expect projects to enter that list it would mean that it's highly likely that manual changes would not be reflected in the metrics until a significant amount of time has passed since they were made.

This same function is called at

sentry/src/sentry/tasks/low_priority_symbolication.py

Line 39 in 0513d7f

_record_metrics()

which is called once every 10s. Here is is called again, but once for each project every 10s. The function does exactly the same in both cases and it means these metrics are emitted many, many times once every 10s while they're probably just fine being emitted once every 10s.

That is at least what I understand, do I have something wrong?

_record_metrics is called for every project that is eligible for the LPQ, so it isn't necessarily being called for every project.

It is being invoked in places where the automatic list is being mutated:

During the initial scan, which removes projects from the automatic list

When specific projects are being assessed for their eligibility, which may either add or remove an item from the automatic list

The recorded metrics will lag behind and be 10 seconds late if we don't update them in the latter scenario. The former (scan removal) only performs what is essentially pruning work, and a bulk of the actual updates are performed in the individual project assessments.

Again, if we skip recording after those are completed, all metrics will lag behind by 10s. Changes to manual kill switches will also potentially take up to 10s to be reflected in metrics. Do we want to make this tradeoff for the sake of avoiding the work in emitting duplicate metrics?

Ok, it is called once for every project actively sending in events. Not every project

Yes, it would lag by 10s if not doing this. I think it is the lesser of the two evils here. And I'm assuming here that 10s delay in creating or clearing the alert here won't have a measurable impact on our operations.

src/sentry/tasks/low_priority_symbolication.py

relaxolotl requested review from flub, Swatinem, loewenheim and a team October 9, 2021 03:02

vercel bot deployed to Preview – sentry October 9, 2021 03:10 View deployment

vercel bot deployed to Preview – storybook October 9, 2021 03:10 View deployment

relaxolotl commented Oct 9, 2021

View reviewed changes

src/sentry/tasks/low_priority_symbolication.py Outdated Show resolved Hide resolved

flub reviewed Oct 11, 2021

View reviewed changes

src/sentry/tasks/low_priority_symbolication.py Outdated Show resolved Hide resolved

src/sentry/tasks/low_priority_symbolication.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview – sentry October 12, 2021 21:59 View deployment

vercel bot deployed to Preview – storybook October 12, 2021 21:59 View deployment

vercel bot deployed to Preview – sentry October 13, 2021 23:00 View deployment

vercel bot deployed to Preview – storybook October 13, 2021 23:00 View deployment

vercel bot deployed to Preview – sentry October 13, 2021 23:03 View deployment

vercel bot deployed to Preview – storybook October 13, 2021 23:03 View deployment

relaxolotl added 7 commits October 13, 2021 19:07

feat(symbolicator): record metrics for projects in LPQ

2a2f386

you can't get the len of an iterable

ca6e5bd

grabbed the wrong list

bd62ecc

clunky attempt to log several values related to the LPQ

9ffce55

log to sentry whenever projects are added and removed with an associa…

b025549

…ted reason

minor rename

a40e857

address feedback

99e55f0

relaxolotl force-pushed the feat/lpq/project-metrics branch from 2bfc601 to 99e55f0 Compare October 13, 2021 23:08

vercel bot deployed to Preview – sentry October 13, 2021 23:08 View deployment

vercel bot deployed to Preview – storybook October 13, 2021 23:08 View deployment

missed something

306b4c1

vercel bot deployed to Preview – storybook October 13, 2021 23:19 View deployment

vercel bot deployed to Preview – sentry October 13, 2021 23:19 View deployment

record metrics even if updating fails

d04cb34

vercel bot deployed to Preview – sentry October 13, 2021 23:22 View deployment

vercel bot deployed to Preview – storybook October 13, 2021 23:22 View deployment

typing issues

447e8fb

vercel bot deployed to Preview – storybook October 13, 2021 23:27 View deployment

vercel bot deployed to Preview – sentry October 13, 2021 23:27 View deployment

Swatinem approved these changes Oct 14, 2021

View reviewed changes

flub reviewed Oct 14, 2021

View reviewed changes

oioki reviewed Oct 14, 2021

View reviewed changes

src/sentry/tasks/low_priority_symbolication.py Outdated Show resolved Hide resolved

address feedback

b945b8a

vercel bot deployed to Preview – storybook October 14, 2021 17:39 View deployment

vercel bot deployed to Preview – sentry October 14, 2021 17:39 View deployment

import entire modules

061c3c2

vercel bot deployed to Preview – sentry October 14, 2021 17:41 View deployment

vercel bot deployed to Preview – storybook October 14, 2021 17:43 View deployment

relaxolotl merged commit cb676f6 into master Oct 14, 2021

relaxolotl deleted the feat/lpq/project-metrics branch October 14, 2021 18:46

github-actions bot locked and limited conversation to collaborators Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229

feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229

relaxolotl commented Oct 9, 2021 •

edited

Loading

flub commented Oct 13, 2021

relaxolotl commented Oct 13, 2021

relaxolotl commented Oct 13, 2021

relaxolotl commented Oct 13, 2021

flub Oct 14, 2021

relaxolotl Oct 14, 2021

relaxolotl Oct 14, 2021

flub Oct 14, 2021

relaxolotl Oct 14, 2021

flub Oct 15, 2021

relaxolotl Oct 16, 2021

flub Oct 18, 2021

relaxolotl Oct 18, 2021

flub Oct 18, 2021

feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229

feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229

Conversation

relaxolotl commented Oct 9, 2021 • edited Loading

flub commented Oct 13, 2021

relaxolotl commented Oct 13, 2021

relaxolotl commented Oct 13, 2021

relaxolotl commented Oct 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

relaxolotl commented Oct 9, 2021 •

edited

Loading