Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obsreport's TracesDropped/MetricsDropped/LogsDropped are not actually used #5056

Open
pmm-sumo opened this issue Mar 21, 2022 · 8 comments
Open
Assignees
Labels

Comments

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Mar 21, 2022

Is your feature request related to a problem? Please describe.

I was asked in which circumstances *_dropped_spans metric can be observed and while searching for the answer I noticed there are no references in code to obsreport/obsreport_processor.go's TracesDropped/MetricsDropped/LogsDropped functions (except for unit tests). They can be used to generate processor's _dropped_spans/_dropped_metric_points/_dropped_log_records metrics respectively.

We use the "Refused" counterparts but not the "Dropped" ones. Perhaps we should remove those?

Also, TracesAccepted and TracesRefused refer to spans rather than traces and MetricsAccepted/MetricsRefused refer to metric points.

Describe the solution you'd like

  • SpansDropped/MetricsDropped/LogsDropped removed
  • SpansAccepted/SpansRefused in obsreport rather than TracesAccepted/TracesRefused
  • (Perhaps) MetricsDataPointsAccepted/MetricsDataPointsRefused rather than MetricsAccepted/MetricsRefused
@jpkrohling
Copy link
Member

There used to be, we might have lost it. Seems related to #5038

@codeboten
Copy link
Contributor

As discussed on Mar 23 - we should keep them and fix inconsistencies in them. It would be good to understand where/when their usage changed.

@jpkrohling jpkrohling self-assigned this Mar 24, 2022
@dmitryax dmitryax added the priority:p2 Medium label Mar 25, 2022
@devrimdemiroz
Copy link

Clarification on this would be excellent as we do not observe and presume according to #5038 that they will appear once such a situation happens.

@jmacd
Copy link
Contributor

jmacd commented Oct 6, 2023

I came to this issue, realizing a related problem exists. When the exporterhelper code has queuing enabled and encounters a full queue, it will drop the data and log a message about it. It returns nil to the obsreport logic that counts send-failures, and consequently we are counting drops as successful sends.

Conceptually, however, it is an exporter, not a processor, and somehow it crosses a logical barrier for exporterhelper to be counting drops. The only way exporters currently have for counting is success/failure, dropping is done in processors.

I'm aware of ongoing work to migrate batchprocessor into exporterhelper, but note that batchprocessor never drops data.

Having recently deployed a trace sampler processor, this has become a serious matter. When a processor drops data, it can be bad or it can be good--it's not always a failure. When the sampler processor drops a span, it's a normal situation. When a memorylimiter processor drops a span, it's a failure. I am beginning think that given how non-functional the existing metrics are, we have a good opportunity to redesign them now.

Please bring your attention to open-telemetry/semantic-conventions#184, which is a discussion of how OTel will count success/failure/drops consistently across the entire telemetry pipeline. One thing that collector brings to this (larger) discussion worth mentioning is the notion of metrics level. My current thinking is that we should view "drops because of queue overflow" as a detailed form of "send failed", and we should only be able to distinguish local-queue-failure from remote-timeout and other retryable errors via metrics when level=detailed is set.

@codeboten
Copy link
Contributor

codeboten commented Oct 6, 2023

This was the last change to reference the Traces* in a processor: ccee4eb#diff-e88817c5bceb5f48f1d38dfdc37163fd2a1745d496b1597bce17ef27e47fa183

@jmacd
Copy link
Contributor

jmacd commented Oct 12, 2023

I was confused, in the comment above, by the issue identified here: #8674 (review). Exporters that use exporterhelper are counting enqueue_failed metrics.

@danielgblanco
Copy link

Am I right in thinking that this is still relevant? Looking at https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/processorhelper/obsreport.go it seems *Accepted and *Refused methods are all used in the memory limiter (and only there) but I don't see usages of *Dropped. Although I can see places in contrib that refer to dropped telemetry in different ways (none using those functions).

Probably related to #7460 as there may be places where telemetry is currently dropped (e.g. by the exporter after retries are exhausted) but it shouldn't.

Also related to the docs that mention these dropped metrics as a way to monitor data loss https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md#data-loss

If the collector aims to keep retrying (after #7460) can we assume then that the metric to monitor for data loss should be enqueue_failed (which already is a recommended) and refused (although the client will retry and it may not be actually data lost)? Should these methods still exist if telemetry is not meant to be dropped but rather "filtered" or unprocessable by certain processors?

@TylerHelmuth
Copy link
Member

It feels like we should be recording a dropped metric here and/or here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants