Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Cloudwatch Receiver stops/errors after a log group gets removed #35361

Open
elburnetto-intapp opened this issue Sep 23, 2024 · 3 comments
Open
Labels

Comments

@elburnetto-intapp
Copy link

Component(s)

receiver/awscloudwatch

What happened?

Description

We have the AWS Cloudwatch Receiver setup to auto-discover and poll log groups from our AWS Account, to then be exported out to Kafka. The idea to use auto-discover was so that log groups can be added/removed automatically by the receiver, and not require manual intervention.

However we've noticed when a log group gets removed from AWS, this causes the receiver to panic and completely stop, as it's unable to find the log group (instead of ignoring this and continuing to poll the other log groups). It's as if the functionality to update log groups isn't removing deleted ones.

Steps to Reproduce

Setup the OTLP Collector to use the receiver with Auto-Discovery for log groups, wait 5/10 minutes with it running, then remove a log group from the AWS console.

Expected Result

The receiver to stop polling for logs in a group which no longer exists, and continue polling groups still active.

Actual Result

The receiver stops and continuously errors. The only way to stop this is to delete the pod and wait for the receiver to restart.

Collector version

0.101.0

Environment information

Environment

Kubernetes (EKS)

OpenTelemetry Collector configuration

receivers:
      awscloudwatch/rds:
        logs:
          groups:
            autodiscover:
              limit: 500
              prefix: /aws/rds/instance/
          max_events_per_request: 300
          poll_interval: 5m
        region: us-east-1    
    exporters:
      kafka/logs:
        auth:
          tls:
            ca_file: <path-to-ca-crt>
        brokers: <kafka-broker-url>
        encoding: otlp_json
        producer:
          max_message_bytes: 2000000
        protocol_version: 2.8.0
        retry_on_failure:
          enabled: true
          max_elapsed_time: 600s
          max_interval: 60s
        topic: processed-logs
    service:
      extensions:
      - health_check
      pipelines:
        logs:
          exporters:
          - kafka/logs
          receivers:
          - awscloudwatch/rds

Log output

2024-09-23T13:35:52.472Z	error	awscloudwatchreceiver@v0.101.0/logs.go:213	unable to retrieve logs from cloudwatch	{"kind": "receiver", "name": "awscloudwatch/rds", "data_type": "logs", "log group": "/aws/rds/instance/test-log-group-instance/sql", "error": "ResourceNotFoundException: The specified log group does not exist."}
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).pollForLogs
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:213
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).poll
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:187
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver.(*logsReceiver).startPolling
	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awscloudwatchreceiver@v0.101.0/logs.go:174

Additional context

No response

@elburnetto-intapp elburnetto-intapp added bug Something isn't working needs triage New item requiring triage labels Sep 23, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@schmikei
Copy link
Contributor

Hmm it was my understanding that we rediscover on each poll interval so I imagined that it would fit your use case...

Its not panicking from that error based on the code but behaving correctly for that request. Any other groups should still be getting collected just by looking at the code.

l.logger.Error("unable to retrieve logs from cloudwatch", zap.String("log group", pc.groupName()), zap.Error(err))

The only reason I could think is that the AWS CloudWatch Logs API is still returning on subsequent poll intervals. Would you be up for enabling debug logs by adding this service snippet to your config?

service:
  telemetry:
    logs:
      level: debug

I would like to see if it's still getting rediscovered after deletion and after 2 polls.

Expecting a log message to be outputted in the debug level that has the message "discovered log group" with the deleted log group. We may need to add special handling for that error, but I'd rather avoid that if possible.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants