You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Report to the general tracing community the problems and solutions that have been discussed.
The Sampling SIG has had two meetings and themes are emerging:
The TraceIDRatio sampler is our preferred default mechanism because it supports independent span sampling without requiring propagation of the effective sampling probability.
To complete the TraceIDRatio specification for its intended use, we need to agree on and specify a hashing mechanism, as we do not expect TraceID values to be uniformly distributed.
Any Sampler or export/processing stage that knows the output inclusion probability for a span SHOULD output that value in the form of an adjusted count (i.e., the inverse of inclusion probability) using an attribute named sampling.adjusted_count (proposed)
The name of the effective sampling policy should be attached as a resource attribute to indicate further information about what sampling is taking place, e.g., sampling.policy=xyz. If the sampling policy is dynamic on a per-span basis, the attribute MAY be a span attribute instead of a resource attribute.
The problem with incomplete traces
When collecting spans that may or may not have been sampled, there is a well-known problem with identifying when a Trace is complete. This problem exists with or without sampling, but it becomes substantially worse with sampling. Without sampling, a trace could be incomplete because spans are dropping, so you could suspect incomplete traces whenever spans are dropping. With sampling, you have to suspect traces may be incomplete even under normal operation.
We are searching for mechanisms to detect incomplete traces, and the ones we know of are:
Ensure that collection is perfect (i.e., no dropped spans), use a Parent Sampler for sampling decisions
Ensure that collection is perfect (i.e., no dropped spans), use the TraceIDRatio Sampler and record the per-process minimum TraceIDRatio threshold that is in effect. Assume traces are complete when they fall below the global threshold.
The "missing parent" heuristic: when a parent is missing you can tell because there is a child without a parent (this doesn't work for missing leaves).
The approach OpenCensus took, which is to count the expected number of child spans and check that they are all present.
None of these approaches is perfect. When sampling is present we may not know that traces are complete without some or all of these techniques being applied. The last of these approaches, the OpenCensus child count technique, was rejected from OpenTelemetry in the early days because, following the OpenTracing API specification, there are separate APIs to inject and extract context that do not directly imply the creation of a child. We could specify that spans maintain a "number of child contexts spawned", record it in SpanData, and encourage tracing systems to display a "probably incomplete" marker when the number of child contexts spawned disagrees with the actual child count. It would be a signal for the system or the user to investigate all known causes of trace incompleteness.
Action items for the community: discuss how you would like to report about trace incompleteness.
Regarding the Jaeger remote sampler configuration as an OTel spec: #1791
Regarding trace incompleteness, @oertl has provided a technical report on ways to use partial trace information, here: https://arxiv.org/pdf/2107.07703.pdf. This argues for using power-of-two sampling rates for head sampling because it enables a novel analysis technique covered in the paper.
Regarding the steps to complete the TraceIDRatio sampler specification, three options were discussed:
Use sufficiently random trace IDs, specify how to evaluate the ratio test directly from TraceID bits
Propagate a new uniform random number in the range (0, 1] in the W3C tracestate to facilitate the ratio test
Specify that TraceIDs need not be "very random" but dictate the use of a standard hashing algorithm.
discuss how you would like to report about trace incompleteness.
Was the "OpenCensus child count technique" used for incomplete sampling marker or for the trace buffering optimization? So if one doing retroactive sampling or just indexing traces, can tell if trace is complete and not wait for more spans to arrive. Perhaps it's the same, curious what other scenarios exists where trace incompleteness marker is critical?
What are you trying to achieve?
Report to the general tracing community the problems and solutions that have been discussed.
The Sampling SIG has had two meetings and themes are emerging:
TraceIDRatio
sampler is our preferred default mechanism because it supports independent span sampling without requiring propagation of the effective sampling probability.TraceIDRatio
specification for its intended use, we need to agree on and specify a hashing mechanism, as we do not expect TraceID values to be uniformly distributed.sampling.adjusted_count
(proposed)sampling.policy=xyz
. If the sampling policy is dynamic on a per-span basis, the attribute MAY be a span attribute instead of a resource attribute.The problem with incomplete traces
When collecting spans that may or may not have been sampled, there is a well-known problem with identifying when a Trace is complete. This problem exists with or without sampling, but it becomes substantially worse with sampling. Without sampling, a trace could be incomplete because spans are dropping, so you could suspect incomplete traces whenever spans are dropping. With sampling, you have to suspect traces may be incomplete even under normal operation.
We are searching for mechanisms to detect incomplete traces, and the ones we know of are:
TraceIDRatio
Sampler and record the per-process minimum TraceIDRatio threshold that is in effect. Assume traces are complete when they fall below the global threshold.None of these approaches is perfect. When sampling is present we may not know that traces are complete without some or all of these techniques being applied. The last of these approaches, the OpenCensus child count technique, was rejected from OpenTelemetry in the early days because, following the OpenTracing API specification, there are separate APIs to inject and extract context that do not directly imply the creation of a child. We could specify that spans maintain a "number of child contexts spawned", record it in SpanData, and encourage tracing systems to display a "probably incomplete" marker when the number of child contexts spawned disagrees with the actual child count. It would be a signal for the system or the user to investigate all known causes of trace incompleteness.
Action items for the community: discuss how you would like to report about trace incompleteness.
Additional context.
See OTEP 148 for more detail and background.
The text was updated successfully, but these errors were encountered: