Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229

Merged
merged 12 commits into from
Oct 14, 2021

Conversation

relaxolotl
Copy link
Contributor

@relaxolotl relaxolotl commented Oct 9, 2021

Adds two major changes:

  1. Reports project-related metrics that might be of interest to the LPQ, as requested by ops:
    • the number of projects selected for the LPQ
    • the number of projects manually forced into the LPQ via the killswitch
    • the number of projects manually excluded from the LPQ via the killswitch
  2. Sends more detailed warnings to sentry when a project is moved in/out of the LPQ
    • adds project id
    • includes reason for change

2's implementation is heavily inspired by existing code in getsentry. 1 is a best attempt given the tools known to me.

One drawback to 1's implementation is that it can take up to 10 seconds for a change to the manual kill switch to be reflected in our metrics. However, this seems preferable to the alternative of reporting whenever an event is processed, as there is no guarantee that events will always be hitting sentry. Reporting in both places is an option though.

@flub
Copy link
Contributor

flub commented Oct 13, 2021

I'm very confused about metrics and datadog right now. The acceptance criteria of NATIVE-203 only say this:

Datadog dashboards show reliable symbolication durations

NATIVE-211 says:

There are metrics showing selected projects
There are alerts and sentry errors showing selected projects

I'm not sure which ones this is trying to address?

@relaxolotl
Copy link
Contributor Author

@flub

There are metrics showing selected projects

this doesn't show which projects are selected, just the number. i couldn't recall whether we were fine with tagging these metrics with the selected projects due to cardinality concerns.

@relaxolotl
Copy link
Contributor Author

there was a discussion outside of this PR that has raised a point of contention about the exact value that should be reported under this metric. it was decided that this would be left up to ops to determine what they would like to see recorded; PR will be updated accordingly based on their preferences.

cc @flub @Swatinem @loewenheim

@relaxolotl
Copy link
Contributor Author

reworked the implementation and updated the description to reflect those changes.

cc @oioki if you have time to take a look at the metrics bits and sanity check that i'm logging the values you're expecting to be logged

Comment on lines +107 to +108
if not reason:
reason = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you make reason optional? afaik all callers provide a reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't really have any good reason 🙃 fixed up the types, thanks for catching this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm keeping this check in case an empty string does get passed in though

try:
_update_lpq_eligibility(project_id, cutoff)
finally:
_record_metrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these same metrics are now emitted a lot:

  1. Once after the scan of all projects is completed, some computation tasks are probably still running at this time so the result is not "definitive" (in so far we can ever get a definitive result).
  2. Once after each project's calculation is finished (that's what this line here does). This is... a bit overkill and still does not give a "definitive" metric.

I think there are two approaches to this:

  1. Don't care about "definitive", the next metric is emitted in 10s time and will include the results of the computations that are still running. This is obviously the simplest.
  2. Spawn a separate task to record the metrics after all the computation tasks have finished. This is something like https://docs.celeryproject.org/en/stable/userguide/canvas.html#chords

After reading option 2 I really don't feel so bad about choosing option 1 anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, as you've noted one of the metrics becomes "eventually" correct and needs a bit of time to update itself.

it's unfortunate that we're leaking this implementation out into our metrics, but it looks like the best compromise, especially since the scanning doesn't require a final synchronization step to do its job properly. as you've probably noticed, that makes 2 hard to commit to just to ensure our metrics are less "jiggly" after a scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was kind of arguing for removing this line to not emit so many duplicate metrics... could we still do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an example of an instance where an emitted metric is considered to be a duplicate? something to consider is that there is no other place where all of these metrics are being emitted, which means that when a manual kill switch's value is updated, this is the only place that would catch it. if these metrics are only written whenever some project is added or removed to the manual list, then given how frequently we expect projects to enter that list it would mean that it's highly likely that manual changes would not be reflected in the metrics until a significant amount of time has passed since they were made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same function is called at

which is called once every 10s. Here is is called again, but once for each project every 10s. The function does exactly the same in both cases and it means these metrics are emitted many, many times once every 10s while they're probably just fine being emitted once every 10s.

That is at least what I understand, do I have something wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_record_metrics is called for every project that is eligible for the LPQ, so it isn't necessarily being called for every project.

It is being invoked in places where the automatic list is being mutated:

  • During the initial scan, which removes projects from the automatic list
  • When specific projects are being assessed for their eligibility, which may either add or remove an item from the automatic list

The recorded metrics will lag behind and be 10 seconds late if we don't update them in the latter scenario. The former (scan removal) only performs what is essentially pruning work, and a bulk of the actual updates are performed in the individual project assessments.

Again, if we skip recording after those are completed, all metrics will lag behind by 10s. Changes to manual kill switches will also potentially take up to 10s to be reflected in metrics. Do we want to make this tradeoff for the sake of avoiding the work in emitting duplicate metrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Ok, it is called once for every project actively sending in events. Not every project
  • Yes, it would lag by 10s if not doing this. I think it is the lesser of the two evils here. And I'm assuming here that 10s delay in creating or clearing the alert here won't have a measurable impact on our operations.

@relaxolotl relaxolotl merged commit cb676f6 into master Oct 14, 2021
@relaxolotl relaxolotl deleted the feat/lpq/project-metrics branch October 14, 2021 18:46
@github-actions github-actions bot locked and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants