Add warn/error level logging in telemetry to appropriate events #166

notactuallytreyanastasio · 2022-08-24T14:35:47Z

This comes in response to some problems we had in production. Ideally,
we should have warning/error level logs for certain events.

This does the following:

log warn when NOTUNIQUE
log warn on heartbeat failure
log error on job failure
log warn on failed ACK with :ok status
log error on failed ACK with :error status

It maintains the same formatting as the info level helper already set
up, as to not mess up any saved searches folks may have, but gives this
granularity for easier inspection in situations like mass job-failure.

Aside

The problems we were seeing this stems from largely hinge around the failure of ACK-ing jobs upstream and those log levels being info made it a bit harder to dig into things.

As a next step, we may want to consider a means of adding backpressure handling to the sending of all these messages out when trying to put jobs onto the queue. If we kick off a couple thousand, the failures pile up and in one instance for us led to a partial outage. Its reasonable to assume this is possible when flooding the connection, and #157 made it such that we dont raise and cause a ton of noise/failures in parent applications, but we ideally would have a means where if a user is shoving thousands of jobs up to the queue that they would be able to do so without writing preventative code or a band-aid atop this library's API.

This comes in response to some problems we had in production. Ideally, we should have warning/error level logs for certain events. This does the following: - log warn when `NOTUNIQUE` - log warn on heartbeat failure - log error on job failure - log warn on failed `ACK` with `:ok` status - log error on failed `ACK` with `:error` status It maintains the same formatting as the `info` level helper already set up, as to not mess up any saved searches folks may have, but gives this granularity for easier inspection in situations like mass job-failure.

Ch4s3 merged commit 0c12718 into opt-elixir:master Aug 24, 2022

notactuallytreyanastasio mentioned this pull request Aug 24, 2022

Bump version to 1.9.1, add changelog entry #167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add warn/error level logging in telemetry to appropriate events #166

Add warn/error level logging in telemetry to appropriate events #166

notactuallytreyanastasio commented Aug 24, 2022 •

edited

Loading

Add warn/error level logging in telemetry to appropriate events #166

Add warn/error level logging in telemetry to appropriate events #166

Conversation

notactuallytreyanastasio commented Aug 24, 2022 • edited Loading

Aside

notactuallytreyanastasio commented Aug 24, 2022 •

edited

Loading