[processor/tailsamplingprocessor] Improve accuracy of `count_traces_sampled` metric #26724

jmsnll · 2023-09-18T09:23:54Z

Description:

Previously, the calculation of count_traces_sampled was incorrect, resulting in all per-policy metric counts being the same.

count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 50
count_traces_sampled{policy="policy-b", sampled="true"} 50

This issue stemmed from the metrics being incremented based on the final decision of trace sampling, rather than being attributed to the policies causing the sampling.

With the fix implemented, the total sampling count can no longer be inferred from the metrics alone. To address this, a new metric, global_count_traces_sampled, was introduced to accurately capture the global count of sampled traces.

count_traces_sampled{policy="policy-a", sampled="false"} 50
count_traces_sampled{policy="policy-a", sampled="true"} 50
count_traces_sampled{policy="policy-b", sampled="false"} 24
count_traces_sampled{policy="policy-b", sampled="true"} 76
global_count_traces_sampled{sampled="false"} 24
global_count_traces_sampled{sampled="true"} 76

Reusing the count_traces_sampled policy metric for this purpose would have either meant:

Leaving the policy field empty, which didn't feel very explicit.
Setting a global placeholder policy name (such as final-decision), which could then potentially collide with existing user-configured policy names without validation.

Link to tracking Issue: Fixes #25882

Testing:

Tested with various combinations of probabilistic, latency, status_code and attribute policies (including combinations of and and composite policies).

No unit tests were added, open to suggestions for adding tests for metrics but I couldn't find any examples of it being done elsewhere.

Documentation: No documentation changes.

…h policy decision - previously metric counts for each policy were based on the final decision and not on the outcome of each specific policy

- captures the global count of traces sampled by at least one policy - trying to capture the count in the existing `statCountTracesSampled` would have resulted in either an empty policy key or have a placeholder value (causing potential collision with user named policies from the configuration)

linux-foundation-easycla · 2023-09-18T09:23:57Z

The committers listed above are authorized under a signed CLA.

✅ login: jmsnll / name: James Neill (38a61c4, 7311b19, e11235e, 5ad41e0, 25cbc88)

mx-psi · 2023-09-27T14:30:04Z

This broke main

Error: /home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/processor/tailsamplingprocessor/metrics.go:93:16: undefined: obsreport

@jmsnll @jpkrohling would you be able to do a quick fix? Or should I revert the PR and we can address this more calmly?

jmsnll · 2023-09-27T14:55:26Z

@mx-psi why obsreport isn't define isn't immediately obvious to me so maybe best reverting to take a closer look

jmsnll · 2023-09-27T15:04:21Z

Actually a fix has been submitted at #27246

mx-psi · 2023-09-27T15:05:32Z

Indeed :) that should fix it, thanks for taking a look in any case!

…ampled` metric (open-telemetry#26724) **Description:** Previously, the calculation of `count_traces_sampled` was incorrect, resulting in all per-policy metric counts being the same. ``` count_traces_sampled{policy="policy-a", sampled="false"} 50 count_traces_sampled{policy="policy-a", sampled="true"} 50 count_traces_sampled{policy="policy-b", sampled="false"} 50 count_traces_sampled{policy="policy-b", sampled="true"} 50 ``` This issue stemmed from the metrics being incremented based on the final decision of trace sampling, rather than being attributed to the policies causing the sampling. With the fix implemented, the total sampling count can no longer be inferred from the metrics alone. To address this, a new metric, `global_count_traces_sampled`, was introduced to accurately capture the global count of sampled traces. ``` count_traces_sampled{policy="policy-a", sampled="false"} 50 count_traces_sampled{policy="policy-a", sampled="true"} 50 count_traces_sampled{policy="policy-b", sampled="false"} 24 count_traces_sampled{policy="policy-b", sampled="true"} 76 global_count_traces_sampled{sampled="false"} 24 global_count_traces_sampled{sampled="true"} 76 ``` Reusing the `count_traces_sampled` policy metric for this purpose would have either meant: 1. Leaving the policy field empty, which didn't feel very explicit. 2. Setting a global placeholder policy name (such as `final-decision`), which could then potentially collide with existing user-configured policy names without validation. **Link to tracking Issue:** Fixes open-telemetry#25882 **Testing:** Tested with various combinations of `probabilistic`, `latency`, `status_code` and attribute policies (including combinations of `and` and `composite` policies). No unit tests were added, open to suggestions for adding tests for metrics but I couldn't find any examples of it being done elsewhere. **Documentation:** No documentation changes.

James Neill added 2 commits September 18, 2023 09:05

[processor/tailsamplingprocessor] ensure metric count emitted for eac…

38a61c4

…h policy decision - previously metric counts for each policy were based on the final decision and not on the outcome of each specific policy

jmsnll requested a review from jpkrohling as a code owner September 18, 2023 09:23

jmsnll requested a review from a team September 18, 2023 09:23

github-actions bot assigned dashpole Sep 18, 2023

github-actions bot added the processor/tailsampling Tail sampling processor label Sep 18, 2023

jmsnll and others added 2 commits September 18, 2023 10:27

Merge branch 'main' into fix/tailsampling-metric-incorrect-count

e11235e

chore: add chloggen entry

5ad41e0

jpkrohling approved these changes Sep 20, 2023

View reviewed changes

Merge branch 'main' into fix/tailsampling-metric-incorrect-count

25cbc88

jpkrohling merged commit 66cbd50 into open-telemetry:main Sep 27, 2023

github-actions bot added this to the next release milestone Sep 27, 2023

mx-psi mentioned this pull request Sep 27, 2023

CI is broken on main #27245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsamplingprocessor] Improve accuracy of `count_traces_sampled` metric #26724

[processor/tailsamplingprocessor] Improve accuracy of `count_traces_sampled` metric #26724

jmsnll commented Sep 18, 2023 •

edited

Loading

linux-foundation-easycla bot commented Sep 18, 2023 •

edited

Loading

mx-psi commented Sep 27, 2023

jmsnll commented Sep 27, 2023

jmsnll commented Sep 27, 2023

mx-psi commented Sep 27, 2023

[processor/tailsamplingprocessor] Improve accuracy of count_traces_sampled metric #26724

[processor/tailsamplingprocessor] Improve accuracy of count_traces_sampled metric #26724

Conversation

jmsnll commented Sep 18, 2023 • edited Loading

linux-foundation-easycla bot commented Sep 18, 2023 • edited Loading

mx-psi commented Sep 27, 2023

jmsnll commented Sep 27, 2023

jmsnll commented Sep 27, 2023

mx-psi commented Sep 27, 2023

[processor/tailsamplingprocessor] Improve accuracy of `count_traces_sampled` metric #26724

[processor/tailsamplingprocessor] Improve accuracy of `count_traces_sampled` metric #26724

jmsnll commented Sep 18, 2023 •

edited

Loading

linux-foundation-easycla bot commented Sep 18, 2023 •

edited

Loading