-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(symbolicator): Record metrics for projects being assigned and unassigned to the LPQ #29229
Conversation
I'm very confused about metrics and datadog right now. The acceptance criteria of NATIVE-203 only say this:
NATIVE-211 says:
I'm not sure which ones this is trying to address? |
this doesn't show which projects are selected, just the number. i couldn't recall whether we were fine with tagging these metrics with the selected projects due to cardinality concerns. |
there was a discussion outside of this PR that has raised a point of contention about the exact value that should be reported under this metric. it was decided that this would be left up to ops to determine what they would like to see recorded; PR will be updated accordingly based on their preferences. |
2bfc601
to
99e55f0
Compare
reworked the implementation and updated the description to reflect those changes. cc @oioki if you have time to take a look at the metrics bits and sanity check that i'm logging the values you're expecting to be logged |
if not reason: | ||
reason = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you make reason
optional? afaik all callers provide a reason
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't really have any good reason 🙃 fixed up the types, thanks for catching this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm keeping this check in case an empty string does get passed in though
try: | ||
_update_lpq_eligibility(project_id, cutoff) | ||
finally: | ||
_record_metrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these same metrics are now emitted a lot:
- Once after the scan of all projects is completed, some computation tasks are probably still running at this time so the result is not "definitive" (in so far we can ever get a definitive result).
- Once after each project's calculation is finished (that's what this line here does). This is... a bit overkill and still does not give a "definitive" metric.
I think there are two approaches to this:
- Don't care about "definitive", the next metric is emitted in 10s time and will include the results of the computations that are still running. This is obviously the simplest.
- Spawn a separate task to record the metrics after all the computation tasks have finished. This is something like https://docs.celeryproject.org/en/stable/userguide/canvas.html#chords
After reading option 2 I really don't feel so bad about choosing option 1 anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, as you've noted one of the metrics becomes "eventually" correct and needs a bit of time to update itself.
it's unfortunate that we're leaking this implementation out into our metrics, but it looks like the best compromise, especially since the scanning doesn't require a final synchronization step to do its job properly. as you've probably noticed, that makes 2 hard to commit to just to ensure our metrics are less "jiggly" after a scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was kind of arguing for removing this line to not emit so many duplicate metrics... could we still do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an example of an instance where an emitted metric is considered to be a duplicate? something to consider is that there is no other place where all of these metrics are being emitted, which means that when a manual kill switch's value is updated, this is the only place that would catch it. if these metrics are only written whenever some project is added or removed to the manual list, then given how frequently we expect projects to enter that list it would mean that it's highly likely that manual changes would not be reflected in the metrics until a significant amount of time has passed since they were made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This same function is called at
_record_metrics() |
That is at least what I understand, do I have something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_record_metrics
is called for every project that is eligible for the LPQ, so it isn't necessarily being called for every project.
It is being invoked in places where the automatic list is being mutated:
- During the initial scan, which removes projects from the automatic list
- When specific projects are being assessed for their eligibility, which may either add or remove an item from the automatic list
The recorded metrics will lag behind and be 10 seconds late if we don't update them in the latter scenario. The former (scan removal) only performs what is essentially pruning work, and a bulk of the actual updates are performed in the individual project assessments.
Again, if we skip recording after those are completed, all metrics will lag behind by 10s. Changes to manual kill switches will also potentially take up to 10s to be reflected in metrics. Do we want to make this tradeoff for the sake of avoiding the work in emitting duplicate metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Ok, it is called once for every project actively sending in events. Not every project
- Yes, it would lag by 10s if not doing this. I think it is the lesser of the two evils here. And I'm assuming here that 10s delay in creating or clearing the alert here won't have a measurable impact on our operations.
Adds two major changes:
2's implementation is heavily inspired by existing code in
getsentry
. 1 is a best attempt given the tools known to me.One drawback to 1's implementation is that it can take up to 10 seconds for a change to the manual kill switch to be reflected in our metrics. However, this seems preferable to the alternative of reporting whenever an event is processed, as there is no guarantee that events will always be hitting sentry. Reporting in both places is an option though.