Add warn/error level logging in telemetry to appropriate events #166
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This comes in response to some problems we had in production. Ideally,
we should have warning/error level logs for certain events.
This does the following:
NOTUNIQUE
ACK
with:ok
statusACK
with:error
statusIt maintains the same formatting as the
info
level helper already setup, as to not mess up any saved searches folks may have, but gives this
granularity for easier inspection in situations like mass job-failure.
Aside
The problems we were seeing this stems from largely hinge around the failure of
ACK
-ing jobs upstream and those log levels beinginfo
made it a bit harder to dig into things.As a next step, we may want to consider a means of adding backpressure handling to the sending of all these messages out when trying to put jobs onto the queue. If we kick off a couple thousand, the failures pile up and in one instance for us led to a partial outage. Its reasonable to assume this is possible when flooding the connection, and #157 made it such that we dont raise and cause a ton of noise/failures in parent applications, but we ideally would have a means where if a user is shoving thousands of jobs up to the queue that they would be able to do so without writing preventative code or a band-aid atop this library's API.