Sampling SIG progress report #1819

jmacd · 2021-07-16T15:35:08Z

What are you trying to achieve?

Report to the general tracing community the problems and solutions that have been discussed.

The Sampling SIG has had two meetings and themes are emerging:

The TraceIDRatio sampler is our preferred default mechanism because it supports independent span sampling without requiring propagation of the effective sampling probability.
To complete the TraceIDRatio specification for its intended use, we need to agree on and specify a hashing mechanism, as we do not expect TraceID values to be uniformly distributed.
Any Sampler or export/processing stage that knows the output inclusion probability for a span SHOULD output that value in the form of an adjusted count (i.e., the inverse of inclusion probability) using an attribute named sampling.adjusted_count (proposed)
The name of the effective sampling policy should be attached as a resource attribute to indicate further information about what sampling is taking place, e.g., sampling.policy=xyz. If the sampling policy is dynamic on a per-span basis, the attribute MAY be a span attribute instead of a resource attribute.

The problem with incomplete traces

When collecting spans that may or may not have been sampled, there is a well-known problem with identifying when a Trace is complete. This problem exists with or without sampling, but it becomes substantially worse with sampling. Without sampling, a trace could be incomplete because spans are dropping, so you could suspect incomplete traces whenever spans are dropping. With sampling, you have to suspect traces may be incomplete even under normal operation.

We are searching for mechanisms to detect incomplete traces, and the ones we know of are:

Ensure that collection is perfect (i.e., no dropped spans), use a Parent Sampler for sampling decisions
Ensure that collection is perfect (i.e., no dropped spans), use the TraceIDRatio Sampler and record the per-process minimum TraceIDRatio threshold that is in effect. Assume traces are complete when they fall below the global threshold.
The "missing parent" heuristic: when a parent is missing you can tell because there is a child without a parent (this doesn't work for missing leaves).
The approach OpenCensus took, which is to count the expected number of child spans and check that they are all present.

None of these approaches is perfect. When sampling is present we may not know that traces are complete without some or all of these techniques being applied. The last of these approaches, the OpenCensus child count technique, was rejected from OpenTelemetry in the early days because, following the OpenTracing API specification, there are separate APIs to inject and extract context that do not directly imply the creation of a child. We could specify that spans maintain a "number of child contexts spawned", record it in SpanData, and encourage tracing systems to display a "probably incomplete" marker when the number of child contexts spawned disagrees with the actual child count. It would be a signal for the system or the user to investigate all known causes of trace incompleteness.

Action items for the community: discuss how you would like to report about trace incompleteness.

Additional context.

See OTEP 148 for more detail and background.

The text was updated successfully, but these errors were encountered:

jmacd · 2021-07-22T20:38:59Z

Following on these topics from today's Sampling SIG:

Regarding the Jaeger remote sampler configuration as an OTel spec: #1791

Regarding trace incompleteness, @oertl has provided a technical report on ways to use partial trace information, here: https://arxiv.org/pdf/2107.07703.pdf. This argues for using power-of-two sampling rates for head sampling because it enables a novel analysis technique covered in the paper.

Regarding the steps to complete the TraceIDRatio sampler specification, three options were discussed:

Use sufficiently random trace IDs, specify how to evaluate the ratio test directly from TraceID bits
Propagate a new uniform random number in the range (0, 1] in the W3C tracestate to facilitate the ratio test
Specify that TraceIDs need not be "very random" but dictate the use of a standard hashing algorithm.

SergeyKanzhelev · 2021-07-22T22:59:02Z

discuss how you would like to report about trace incompleteness.

Was the "OpenCensus child count technique" used for incomplete sampling marker or for the trace buffering optimization? So if one doing retroactive sampling or just indexing traces, can tell if trace is complete and not wait for more spans to arrive. Perhaps it's the same, curious what other scenarios exists where trace incompleteness marker is critical?

jmacd added the spec:trace Related to the specification/trace directory label Jul 16, 2021

github-actions bot assigned SergeyKanzhelev Jul 16, 2021

jmacd mentioned this issue Jul 21, 2021

Probability sampling basics for telemetry events open-telemetry/oteps#148

Closed

carlosalberto mentioned this issue Jul 23, 2021

Complete the TraceIdRatio specification #1826

Open

jmacd closed this as completed Jul 27, 2021

jmacd mentioned this issue Aug 30, 2021

Probability sampling specification #1899

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling SIG progress report #1819

Sampling SIG progress report #1819

jmacd commented Jul 16, 2021

jmacd commented Jul 22, 2021

SergeyKanzhelev commented Jul 22, 2021

Sampling SIG progress report #1819

Sampling SIG progress report #1819

Comments

jmacd commented Jul 16, 2021

jmacd commented Jul 22, 2021

SergeyKanzhelev commented Jul 22, 2021