Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsamplingprocessor] Improve accuracy of count_traces_sampled metric #26724

Merged

Conversation

jmsnll
Copy link
Contributor

@jmsnll jmsnll commented Sep 18, 2023

Description:

Previously, the calculation of count_traces_sampled was incorrect, resulting in all per-policy metric counts being the same.

count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 50
count_traces_sampled{policy="policy-b", sampled="true"} 50

This issue stemmed from the metrics being incremented based on the final decision of trace sampling, rather than being attributed to the policies causing the sampling.

With the fix implemented, the total sampling count can no longer be inferred from the metrics alone. To address this, a new metric, global_count_traces_sampled, was introduced to accurately capture the global count of sampled traces.

count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 24
count_traces_sampled{policy="policy-b", sampled="true"} 76
global_count_traces_sampled{sampled="false"} 24
global_count_traces_sampled{sampled="true"} 76

Reusing the count_traces_sampled policy metric for this purpose would have either meant:

  1. Leaving the policy field empty, which didn't feel very explicit.
  2. Setting a global placeholder policy name (such as final-decision), which could then potentially collide with existing user-configured policy names without validation.

Link to tracking Issue: Fixes #25882

Testing:

Tested with various combinations of probabilistic, latency, status_code and attribute policies (including combinations of and and composite policies).

No unit tests were added, open to suggestions for adding tests for metrics but I couldn't find any examples of it being done elsewhere.

Documentation: No documentation changes.

James Neill added 2 commits September 18, 2023 09:05
…h policy decision

- previously metric counts for each policy were based on the final decision and not on the outcome of each specific policy
- captures the global count of traces sampled by at least one policy
- trying to capture the count in the existing `statCountTracesSampled` would have resulted in either an empty policy key or have a placeholder value (causing potential collision with user named policies from the configuration)
@jmsnll jmsnll requested a review from jpkrohling as a code owner September 18, 2023 09:23
@jmsnll jmsnll requested a review from a team September 18, 2023 09:23
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 18, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Sep 18, 2023
@jpkrohling jpkrohling merged commit 66cbd50 into open-telemetry:main Sep 27, 2023
@github-actions github-actions bot added this to the next release milestone Sep 27, 2023
@mx-psi
Copy link
Member

mx-psi commented Sep 27, 2023

This broke main

Error: /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/processor/tailsamplingprocessor/metrics.go:93:16: undefined: obsreport

@jmsnll @jpkrohling would you be able to do a quick fix? Or should I revert the PR and we can address this more calmly?

@mx-psi mx-psi mentioned this pull request Sep 27, 2023
@jmsnll
Copy link
Contributor Author

jmsnll commented Sep 27, 2023

@mx-psi why obsreport isn't define isn't immediately obvious to me so maybe best reverting to take a closer look

@jmsnll
Copy link
Contributor Author

jmsnll commented Sep 27, 2023

Actually a fix has been submitted at #27246

@mx-psi
Copy link
Member

mx-psi commented Sep 27, 2023

Indeed :) that should fix it, thanks for taking a look in any case!

jmsnll added a commit to jmsnll/opentelemetry-collector-contrib that referenced this pull request Nov 12, 2023
…ampled` metric (open-telemetry#26724)

**Description:** 

Previously, the calculation of `count_traces_sampled` was incorrect,
resulting in all per-policy metric counts being the same.

```
count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 50
count_traces_sampled{policy="policy-b", sampled="true"} 50
```

This issue stemmed from the metrics being incremented based on the final
decision of trace sampling, rather than being attributed to the policies
causing the sampling.

With the fix implemented, the total sampling count can no longer be
inferred from the metrics alone. To address this, a new metric,
`global_count_traces_sampled`, was introduced to accurately capture the
global count of sampled traces.

```
count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 24
count_traces_sampled{policy="policy-b", sampled="true"} 76
global_count_traces_sampled{sampled="false"} 24
global_count_traces_sampled{sampled="true"} 76
```

Reusing the `count_traces_sampled` policy metric for this purpose would
have either meant:

1. Leaving the policy field empty, which didn't feel very explicit.
2. Setting a global placeholder policy name (such as `final-decision`),
which could then potentially collide with existing user-configured
policy names without validation.

**Link to tracking Issue:** Fixes open-telemetry#25882

**Testing:** 

Tested with various combinations of `probabilistic`, `latency`,
`status_code` and attribute policies (including combinations of `and`
and `composite` policies).

No unit tests were added, open to suggestions for adding tests for
metrics but I couldn't find any examples of it being done elsewhere.

**Documentation:** No documentation changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tail sampling processor "count_traces_sampled" metric is inaccurate
4 participants