Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warn/error level logging in telemetry to appropriate events #166

Conversation

notactuallytreyanastasio
Copy link
Collaborator

@notactuallytreyanastasio notactuallytreyanastasio commented Aug 24, 2022

This comes in response to some problems we had in production. Ideally,
we should have warning/error level logs for certain events.

This does the following:

  • log warn when NOTUNIQUE
  • log warn on heartbeat failure
  • log error on job failure
  • log warn on failed ACK with :ok status
  • log error on failed ACK with :error status

It maintains the same formatting as the info level helper already set
up, as to not mess up any saved searches folks may have, but gives this
granularity for easier inspection in situations like mass job-failure.

Aside

The problems we were seeing this stems from largely hinge around the failure of ACK-ing jobs upstream and those log levels being info made it a bit harder to dig into things.

As a next step, we may want to consider a means of adding backpressure handling to the sending of all these messages out when trying to put jobs onto the queue. If we kick off a couple thousand, the failures pile up and in one instance for us led to a partial outage. Its reasonable to assume this is possible when flooding the connection, and #157 made it such that we dont raise and cause a ton of noise/failures in parent applications, but we ideally would have a means where if a user is shoving thousands of jobs up to the queue that they would be able to do so without writing preventative code or a band-aid atop this library's API.

This comes in response to some problems we had in production. Ideally,
we should have warning/error level logs for certain events.

This does the following:

- log warn when `NOTUNIQUE`
- log warn on heartbeat failure
- log error on job failure
- log warn on failed `ACK` with `:ok` status
- log error on failed `ACK` with `:error` status

It maintains the same formatting as the `info` level helper already set
up, as to not mess up any saved searches folks may have, but gives this
granularity for easier inspection in situations like mass job-failure.
@Ch4s3 Ch4s3 merged commit 0c12718 into opt-elixir:master Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants