AWS Cloudwatch Receiver stops/errors after a log group gets removed #35361

elburnetto-intapp · 2024-09-23T13:43:29Z

Component(s)

receiver/awscloudwatch

What happened?

Description

We have the AWS Cloudwatch Receiver setup to auto-discover and poll log groups from our AWS Account, to then be exported out to Kafka. The idea to use auto-discover was so that log groups can be added/removed automatically by the receiver, and not require manual intervention.

However we've noticed when a log group gets removed from AWS, this causes the receiver to panic and completely stop, as it's unable to find the log group (instead of ignoring this and continuing to poll the other log groups). It's as if the functionality to update log groups isn't removing deleted ones.

Steps to Reproduce

Setup the OTLP Collector to use the receiver with Auto-Discovery for log groups, wait 5/10 minutes with it running, then remove a log group from the AWS console.

Expected Result

The receiver to stop polling for logs in a group which no longer exists, and continue polling groups still active.

Actual Result

The receiver stops and continuously errors. The only way to stop this is to delete the pod and wait for the receiver to restart.

Collector version

0.101.0

Environment information

Environment

Kubernetes (EKS)

OpenTelemetry Collector configuration

receivers:
      awscloudwatch/rds:
        logs:
          groups:
            autodiscover:
              limit: 500
              prefix: /aws/rds/instance/
          max_events_per_request: 300
          poll_interval: 5m
        region: us-east-1    
    exporters:
      kafka/logs:
        auth:
          tls:
            ca_file: <path-to-ca-crt>
        brokers: <kafka-broker-url>
        encoding: otlp_json
        producer:
          max_message_bytes: 2000000
        protocol_version: 2.8.0
        retry_on_failure:
          enabled: true
          max_elapsed_time: 600s
          max_interval: 60s
        topic: processed-logs
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - kafka/logs
          receivers:
          - awscloudwatch/rds

Log output

2024-09-23T13:35:52.472Z	error	awscloudwatchreceiver@v0.101.0/logs.go:213	unable to retrieve logs from cloudwatch	{"kind": "receiver", "name": "awscloudwatch/rds", "data_type": "logs", "log group": "/aws/rds/instance/test-log-group-instance/sql", "error": "ResourceNotFoundException: The specified log group does not exist."}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).pollForLogs
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:213
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).poll
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:187
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).startPolling
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:174

Additional context

No response

github-actions · 2024-09-23T13:43:45Z

Pinging code owners:

receiver/awscloudwatch: @schmikei

See Adding Labels via Comments if you do not have permissions to add labels yourself.

schmikei · 2024-09-23T14:30:59Z

Hmm it was my understanding that we rediscover on each poll interval so I imagined that it would fit your use case...

Its not panicking from that error based on the code but behaving correctly for that request. Any other groups should still be getting collected just by looking at the code.

opentelemetry-collector-contrib/receiver/awscloudwatchreceiver/logs.go

Line 213 in 5ec6872

    
           l.logger.Error("unable to retrieve logs from cloudwatch", zap.String("log group", pc.groupName()), zap.Error(err))

The only reason I could think is that the AWS CloudWatch Logs API is still returning on subsequent poll intervals. Would you be up for enabling debug logs by adding this service snippet to your config?

service:
  telemetry:
    logs:
      level: debug

I would like to see if it's still getting rediscovered after deletion and after 2 polls.

Expecting a log message to be outputted in the debug level that has the message "discovered log group" with the deleted log group. We may need to add special handling for that error, but I'd rather avoid that if possible.

github-actions · 2024-12-12T03:39:37Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/awscloudwatch: @schmikei

See Adding Labels via Comments if you do not have permissions to add labels yourself.

elburnetto-intapp added bug Something isn't working needs triage New item requiring triage labels Sep 23, 2024

github-actions bot added the receiver/awscloudwatch label Sep 23, 2024

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

github-actions bot mentioned this issue Oct 8, 2024

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

atoulme removed the needs triage New item requiring triage label Oct 12, 2024

github-actions bot added the Stale label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Cloudwatch Receiver stops/errors after a log group gets removed #35361

AWS Cloudwatch Receiver stops/errors after a log group gets removed #35361

elburnetto-intapp commented Sep 23, 2024

github-actions bot commented Sep 23, 2024

schmikei commented Sep 23, 2024

github-actions bot commented Dec 12, 2024