Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling SIG progress report #1819

Closed
jmacd opened this issue Jul 16, 2021 · 2 comments
Closed

Sampling SIG progress report #1819

jmacd opened this issue Jul 16, 2021 · 2 comments
Assignees
Labels
spec:trace Related to the specification/trace directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Jul 16, 2021

What are you trying to achieve?

Report to the general tracing community the problems and solutions that have been discussed.

The Sampling SIG has had two meetings and themes are emerging:

  1. The TraceIDRatio sampler is our preferred default mechanism because it supports independent span sampling without requiring propagation of the effective sampling probability.
  2. To complete the TraceIDRatio specification for its intended use, we need to agree on and specify a hashing mechanism, as we do not expect TraceID values to be uniformly distributed.
  3. Any Sampler or export/processing stage that knows the output inclusion probability for a span SHOULD output that value in the form of an adjusted count (i.e., the inverse of inclusion probability) using an attribute named sampling.adjusted_count (proposed)
  4. The name of the effective sampling policy should be attached as a resource attribute to indicate further information about what sampling is taking place, e.g., sampling.policy=xyz. If the sampling policy is dynamic on a per-span basis, the attribute MAY be a span attribute instead of a resource attribute.

The problem with incomplete traces

When collecting spans that may or may not have been sampled, there is a well-known problem with identifying when a Trace is complete. This problem exists with or without sampling, but it becomes substantially worse with sampling. Without sampling, a trace could be incomplete because spans are dropping, so you could suspect incomplete traces whenever spans are dropping. With sampling, you have to suspect traces may be incomplete even under normal operation.

We are searching for mechanisms to detect incomplete traces, and the ones we know of are:

  1. Ensure that collection is perfect (i.e., no dropped spans), use a Parent Sampler for sampling decisions
  2. Ensure that collection is perfect (i.e., no dropped spans), use the TraceIDRatio Sampler and record the per-process minimum TraceIDRatio threshold that is in effect. Assume traces are complete when they fall below the global threshold.
  3. The "missing parent" heuristic: when a parent is missing you can tell because there is a child without a parent (this doesn't work for missing leaves).
  4. The approach OpenCensus took, which is to count the expected number of child spans and check that they are all present.

None of these approaches is perfect. When sampling is present we may not know that traces are complete without some or all of these techniques being applied. The last of these approaches, the OpenCensus child count technique, was rejected from OpenTelemetry in the early days because, following the OpenTracing API specification, there are separate APIs to inject and extract context that do not directly imply the creation of a child. We could specify that spans maintain a "number of child contexts spawned", record it in SpanData, and encourage tracing systems to display a "probably incomplete" marker when the number of child contexts spawned disagrees with the actual child count. It would be a signal for the system or the user to investigate all known causes of trace incompleteness.

Action items for the community: discuss how you would like to report about trace incompleteness.

Additional context.

See OTEP 148 for more detail and background.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 22, 2021

Following on these topics from today's Sampling SIG:

Regarding the Jaeger remote sampler configuration as an OTel spec: #1791

Regarding trace incompleteness, @oertl has provided a technical report on ways to use partial trace information, here: https://arxiv.org/pdf/2107.07703.pdf. This argues for using power-of-two sampling rates for head sampling because it enables a novel analysis technique covered in the paper.

Regarding the steps to complete the TraceIDRatio sampler specification, three options were discussed:

  1. Use sufficiently random trace IDs, specify how to evaluate the ratio test directly from TraceID bits
  2. Propagate a new uniform random number in the range (0, 1] in the W3C tracestate to facilitate the ratio test
  3. Specify that TraceIDs need not be "very random" but dictate the use of a standard hashing algorithm.

@SergeyKanzhelev
Copy link
Member

discuss how you would like to report about trace incompleteness.

Was the "OpenCensus child count technique" used for incomplete sampling marker or for the trace buffering optimization? So if one doing retroactive sampling or just indexing traces, can tell if trace is complete and not wait for more spans to arrive. Perhaps it's the same, curious what other scenarios exists where trace incompleteness marker is critical?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests

2 participants