Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global sampling service telemetry logging configuration options #4554

Closed
rmfitzpatrick opened this issue Dec 14, 2021 · 3 comments · Fixed by #6404
Closed

Global sampling service telemetry logging configuration options #4554

rmfitzpatrick opened this issue Dec 14, 2021 · 3 comments · Fixed by #6404

Comments

@rmfitzpatrick
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Some components use warn and info level statements for paths that can be hit frequently, like the batch processor when configured with a small sending queue:

bp.logger.Warn("Sender failed", zap.Error(err))
. This can lead to collectors logging hundreds or thousands of messages a second, much of which is redundant information.

Describe the solution you'd like
The ServiceTelemetryLogs config should provide zap sampling initial and thereafter fields that are used when instantiating all component loggers, with defaults equivalent to disabling:

    sampling_initial: 1
    sampling_thereafter: 1

Describe alternatives you've considered
Given the nature of zap logger usage in components I'm not sure of an alternative sampling approach within the collector process, or another course other than setting an arbitrarily high log level and missing important statements, or making valid warning/info scenarios have a lower log level (to same effect or no change with low enough configured log level).

Additional context
Would potentially resolve #1061

@tigrannajaryan
Copy link
Member

I think this is a reasonable request but I would like to understand better why we are logging too many messages per second.

This can lead to collectors logging hundreds or thousands of messages a second, much of which is redundant information.

This a bit unexpected to me. We use exponential backoff in senders so presumably after a few failures to send the rate of failures should slow down a lot. Is this not happening? What message is logged thousands times a second?

Also queued_retry component itself uses a sampled logger, precisely to avoid logging too much. Is batch processor the guilty component?

@rmfitzpatrick
Copy link
Contributor Author

On a quick pass I'm not seeing where the batch processor uses a sampling config, and right it is flooding the logs when extended export failures occur:

2021-12-14T15:26:49.497Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.508Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.525Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.588Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.597Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.614Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.640Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.649Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.673Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.689Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.707Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.771Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.809Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.854Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.872Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.883Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.886Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.925Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.935Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.943Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}
2021-12-14T15:26:49.974Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "sending_queue is full"}

There may be other issues with the exporter helpers as well and I will open another issue if I determine what it is.

@tigrannajaryan
Copy link
Member

On a quick pass I'm not seeing where the batch processor uses a sampling config

AFAIK, it doesn't. queued_retry does.

I think it is a good idea to have a general sampling for all logs (at least configurable).

Ideally I would prefer that components are given 2 loggers: one to use during startup which will not be sampled (or will have higher sampling thresholds) to make sure all critical messages during startup are printed, and another logger to be used after startup to make sure long-running processes don't generate huge volumes of logs.

bogdandrutu added a commit to bogdandrutu/opentelemetry-collector that referenced this issue Oct 25, 2022
Fixes open-telemetry#4554

Signed-off-by: Bogdan <bogdandrutu@gmail.com>
bogdandrutu added a commit to bogdandrutu/opentelemetry-collector that referenced this issue Nov 7, 2022
Fixes open-telemetry#4554

Signed-off-by: Bogdan <bogdandrutu@gmail.com>
bogdandrutu added a commit that referenced this issue Nov 7, 2022
Fixes #4554

Signed-off-by: Bogdan <bogdandrutu@gmail.com>

Signed-off-by: Bogdan <bogdandrutu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants