-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decision on new encoding for sampling "selectivity" #3602
Comments
Seems to me that this comes down to a couple questions:
I would argue compactness of representation is the most important attribute, followed by parsing/processing efficiency, followed by human readability. To achieve those goals, I would support a base-2 representation such as F or A1, favoring F. Options B and E appear to be quite similar IMO, and prioritize human readability over processing efficiency and simplicity. Particularly E introduces what I think is unnecessary processing overhead while still failing to achieve a compact representation for power of 2 sample rates. Options C and D seem most likely to invite implementation issues. They require all participants in the trace to parse with the same precision, and the representation is still not as compact as some other options. |
The main factor to consider for human readability for me is ease of debugging when inspecting data on the wire. The questions I'd have for that are:
This may already be understood by folks here, but for any casual observer, we should also make it clear that the main human interaction points - configuration and reading a value from an observability backend - should prioritize human readability. |
@dyladan F is not really a "base 2 representation" - it simply doesn't allow for many common sample rates. While one can synthesize a sample rate of 10 by alternating between 8 and 16, it's not at all obvious what's going on when you look at the resulting telemetry. |
I'd guess fairly unlikely unless that person is developing the SDK itself.
As an SDK developer myself I can say that I personally don't find human readability to be an important part of that process. It is "nice to have" in some cases, but I really just want to see that the sample rate sent in the header is the one I configured, and that the data is correct in the exported OTLP telemetry (which is also not human readable).
Completely agree that configuration should prioritize human simplicity. For representation in the UI of an observability backend, that is up to each backend to figure out.
Sorry I was a little cavalier with mixing "base 2" and "power of 2". F is a power of 2 strategy. Indeed it is restrictive in the values it allows, trading some flexibility for efficiency. While, as you said, it is possible to synthesize additional sampling rates, I wouldn't expect that to be common. I think it's far more likely people would just use 8 rather than introducing that much additional complexity to force 10. |
Regarding debugability, there are actually two things you would like to check, and you cannot have direct access to both:
|
@dyladan I see sampling configurations designed by users all the time, and I'm not sure I've ever seen 8, but I've seen 10 a lot, as well as 3, 50, 100, 1000, 10000. If we were to choose F, then we could still allow users to specify 10, but then the result they see after it flows through the pipeline is a mix of 8 and 16. I think this is explainable but both confusing and completely unnecessary. |
I think if we selected F we would most likely encourage users to use the rates naturally provided. Of course a sampler could use more complex strategies to mimic additional rates, but that would be an option only for advanced users. From a configuration standpoint, I think most likely we would call it something like "sample factor" and each increment would halve the sample rate. |
The sampling SIG met and, having received this feedback, has a unanimous agreement on option A1. |
We consider this issue resolved. Reviewers, please endorse open-telemetry/oteps#235 with your approvals! |
OTEP 235 has merged. 🎉 |
…rt OTEP 235) (#31894) **Description:** Creates new sampler modes named "equalizing" and "proportional". Preserves existing functionality under the mode named "hash_seed". Fixes #31918 This is the final step in a sequence, the whole of this work was factored into 3+ PRs, including the new `pkg/sampling` and the previous step, #31946. The two new Sampler modes enable mixing OTel sampling SDKs with Collectors in a consistent way. The existing hash_seed mode is also a consistent sampling mode, which makes it possible to have a 1:1 mapping between its decisions and the OTEP 235 randomness and threshold values. Specifically, the 14-bit hash value and sampling probability are mapped into 56-bit R-value and T-value encodings, so that all sampling decisions in all modes include threshold information. This implements the semantic conventions of open-telemetry/semantic-conventions#793, namely the `sampling.randomness` and `sampling.threshold` attributes used for logs where there is no tracestate. The default sampling mode remains HashSeed. We consider a future change of default to Proportional to be desirable, because: 1. Sampling probability is the same, only the hashing algorithm changes 2. Proportional respects and preserves information about earlier sampling decisions, which HashSeed can't do, so it has greater interoperability with OTel SDKs which may also adopt OTEP 235 samplers. **Link to tracking Issue:** Draft for open-telemetry/opentelemetry-specification#3602. Previously #24811, see also open-telemetry/oteps#235 Part of #29738 **Testing:** New testing has been added. **Documentation:** ✅ --------- Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Summary
The Sampling SIG has been working on a proposal to follow the W3C tracecontext group, which has added a flag to convey definite information about randomness in the TraceID.
In particular, we aim to address the TODO about the TraceIDRatioBased Sampler:
We are looking for community input on a choice which will impact implementation complexity for OpenTelemetry Samplers as well as human-interpretability of the raw data.
Note this proposal was co-authored by @kentquirk @oertl @PeterF778 and @jmacd.
ACTION ITEM: Please review and vote for your preferred encoding strategy in the comments below. OpenTelemetry Tracing SDK authors as well as OpenTelemetry Collector trace processors will be asked to implement the encoding and decoding strategies here in order to communicate about sampling selectivity, and we need your input!
Background
We propose to add information to the W3C Trace Context specification to allow consistent sampling decisions to be made across the entire lifetime of a trace.
The expectation is that trace IDs should contain at least 56 bits of randomness in a known portion of the ID. This value is known as r, and there is a bit in the trace header that indicates its presence.
In probabilistic sampling, the sampling decision is a binary choice to keep (store) or drop (discard) a trace. Because traces are composed of multiple spans, we want to be sure that the same decision is made for all elements in the trace. Therefore, we don’t make a truly random decision at each stage. We instead wish to use the randomness embedded in the trace ID so that all stages can make consistent decisions.
In order to make consistent decisions, we need to propagate not only the randomness (the r value), but also the sampling selectivity used. In other words, in a trace that travels between services A, B, and C, any decision made by B should use the same information as a decision made by A, and B could potentially modify the selectivity so that C could also make an appropriate decision.
As noted, the r value expresses a 56-bit random value that can be used as the source of randomness for a probabilistic sampling decision. The intent of this proposal is to express the sampling selectivity that was used to make the decision, and to do it in the trace state.
Sampling selectivity can be described in several different ways:
Minimum requirements
Given the sampling information on the trace state it MUST be specified for any possible representations on any platform, how this translates to the applied sampling threshold (the value that was used to compare against the random bits). Only this allows to reproduce the sampling decision together with the 56 random bits and gives 100% consistency.
Based on that, it can be derived which of the 2^56+1 sampling thresholds, that are meaningful when having 56 random bits, can be expressed by the sampling information on the trace state. The proposals should therefore be clear about which thresholds are actually supported. The set of supported thresholds also defines the set of possible sampling probabilities. The sampling probability is just the threshold multiplied by 2^(-56).
When picking one of the supported thresholds, there should be a lossless way to map it to the sampling information that is written to the trace state. Lossless in the sense, that the reverse mapping as described in 1. yields again exactly the chosen threshold. The mapping from thresholds to the sampling information is important for adaptive sampling, where the threshold is automatically chosen.
Objective
We would like to express this sampling probability/rate/threshold in a reasonably compact way in the trace state. We would like that expression to be easy and efficient to generate and parse in any programming language. Another requirement is that the used notation should be able to describe cases of non-probabilistic sampling (corresponding to the zero adjusted count or the old p=63 case). We have been referring to this value as t.
The sampling SIG has been discussing this issue for some time, and we have examined several proposals. Each proposal has its strengths and weaknesses and we are looking for input from the larger community.
Note that this document includes just a summary of the proposals, below; all of them have been specified in sufficient detail to resolve most implementation issues. We are hoping for input from the community to help make the big decision about the implementation direction.
Request for input
The major difference in these proposals that we wish to seek input on is whether it is more important to optimize for threshold calculation (option A) at the expense of human readability, or whether to choose one of the other options which are readable and accessible, but make threshold calculations harder to work with.
List of options
When we refer to Tmax, we mean 2^56 (0x100000000000000 or 72057594037927936)
Option A: Hex Threshold
Keep 1 in 10: t=19999999999999
Keep 1 in 8: t=20000000000000
Keep half: t=80000000000000
Keep 2/3: t=aaaaaaaaaaaaaa
Keep 1 in 1000: t=004189374bc6a7
If t is absent the threshold is 2^56
If t is 2^56, corresponding to 100% sampling probability, the t-value is not set
Option A1: Hex Threshold with omission of trailing zeros
Keep 1 in 10: t=19999999999999
Keep 1 in 8: t=2
Keep half: t=8
Keep 2/3: t=aaaaaaaaaaaaaa
Keep 1 in 1000: t= 004189374bc6a7
If t is absent the threshold is 2^56t is padded with zeros if it has less than 14 hex digits
If t is 2^56, corresponding to 100% sampling probability, the t-value is not settrailing zeros may be omitted
Option B: Integer Sampling Rate
Keep 1 in 8: t=8
Keep 1 in 10: t=10
Keep half: t=2
Keep 2/3: not expressible in this format
Keep 1 in 1000: t=1000
Keep none: not expressible in this format
Option C: Sampling probability
Keep 1 in 8: t=.125
Keep 1 in 10: t=.1
Keep half: t=.5
Keep 2/3 ieee precision: t=.6666666666667
Keep 2/3 precision 4: t=.6667
Keep 2/3 precision 2: t=.67
Keep 1 in 1000: t=.001
Note rounding is performed in parseDecimalFloat.
Option C1: Sampling probability with hex floating point
Keep 1 in 8: as in C or t=0x1p-3
Keep 1 in 10: as in C or 0x1.5p-3
Keep half: as in C or t=0x1p-1
Keep 2/3 ieee precision: as in C or 0x1.5555555555555p-1
Keep 2/3 precision 4: as in C or 0x1.5555p-1
Keep 2/3 precision 2: as in C or 0x1.55p-1
Keep 1 in 1000 ieee precision: as in C or 0x1.0624dd2f1a9fcp-10
Keep 1 in 1000 precision 4: as in C or 0x1.0625p-10
Keep 1 in 1000 precision 2: as in C or 0x1.06p-10
Note there is no rounding performed.
Option C2: Sampling Probability with unnormalized hex floating point
Keep 1 in 8: as in C1 or t=0x2p-04 (threshold = 0x20000000000000)
Keep 1 in 10: as in C1 or t=0x2ap-8 (threshold = 0x2a000000000000)
Keep half: as in C1 or t=0x8p-04 (threshold 0x80000000000000)
Keep 2/3 full precision: as in C1 or 0xaaaaaaaaaaaaaap-56
Keep 2/3 ieee precision: as in C1 or 0xaaaaaaaaaaaaap-52
Keep 2/3 precision 4: as in C1 or 0xaaabp-16
Keep 2/3 precision 2: as in C1 or 0xabp-8
Keep 1 in 1000 ieee precision: as in C1 or 0x4189374bc6a7p-56
Keep 1 in 1000 precision 4: as in C1 or 0x4189p-16
Keep 1 in 1000 precision 2: as in C1 or 0x42p-8
Note there is no rounding performed.
Option D: Combination of C and D
Keep 1 in 8: t=8 or t=.125
Keep 1 in 10: t=.1 or t=10
Keep half: t=.5 or t=2
Keep 2/3: t=.6667
Keep 1 in 1000: t=.001 or t=1000
Keep arbitrary hex-digit threshold HHHH with custom code: t=0xHHHHp-(len(HHHH)*4)
Keep arbitrary hex-digit threshold HHHH with standard library:t=0x1.JJJJJp-DDwhere JJJJJ and DD correspond with the normalized hex floating point value value corresponding with HHHH. Note that JJJJJ is one digit longer than HHHH, due to shifting hex digits by one bit.
Option E: Ratio
Keep 1 in 8: t=8 or t=1/8
Keep 1 in 10: t=10 or t=1/10
Keep half: t=50/100 or t=1/2 or t=2
Keep 2/3: t=2/3 or t=6667/10000
Keep 1 in 1000: t=1/1000 or t=1000
Option F: Powers-of-two
Keep 1 in 8: t=3
Keep half: t=1
The text was updated successfully, but these errors were encountered: