Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipking sampling percentage configuration #30

Closed
golonzovsky opened this issue Mar 9, 2018 · 36 comments
Closed

zipking sampling percentage configuration #30

golonzovsky opened this issue Mar 9, 2018 · 36 comments

Comments

@golonzovsky
Copy link

Currently there is no way (please correct me if I wrong) to specify sampling percentage for Zipkin tracer. This renders zipkin support not so usable on high load prod environments, as amount of traces generated is more than actual traffic.

Would it be possible to provide sampling param same as for eg. jaeger?

@rnburn
Copy link
Collaborator

rnburn commented Mar 9, 2018

That's right. The zipkin library doesn't provide sampling in the way jaeger does.

But I'd be happy to take a PR for it if it's something you'd like to work on.

@pingles
Copy link
Contributor

pingles commented May 13, 2018

I'd really like to have support for this too- @rnburn I can try and take a look, do you have any suggestions about where you'd like to see support/how it would be implemented? I.e. perhaps some kind of opentracing_sampler_probability 0.01; type option?

Am I right in thinking that this should also set X-B3-Sampled: 0 in the downstream tracing headers?

@rnburn
Copy link
Collaborator

rnburn commented May 13, 2018

@pingles - yes that's correct it should set that sampled header.

Additionally, it should look at the standard opentracing sampling.priority tag -- described here.

You might check out

And I think adding a sampling_rate field to this structure would be a good place to control the behavior. It would also need to be added to the JSON configuration.

Also, sampling is tracer-specific functionality, so there wouldn't be any opentracing interface to specify how the tracer should do it. Probabilistic sampling is only one particular strategy -- though probably the best one to start with. You can see Jaeger's documentation for examples of other ways to do it.

@pingles
Copy link
Contributor

pingles commented May 14, 2018 via email

@pingles
Copy link
Contributor

pingles commented May 16, 2018

I've started taking a look at this and trying to get my head around the interplay between the different modules.

Looking at the Jaeger module, the determination of whether to propagate the sampling decision, or determine whether to sample is done inside ::StartSpanWithOptions. Would I be correct in thinking that the right place to inject the same logic for the Zipkin lib would be it's equivalent: https://github.com/rnburn/zipkin-cpp-opentracing/blob/master/zipkin_opentracing/src/opentracing.cc#L277

Thanks @rnburn for your patience in helping me figure this out :)

@rnburn
Copy link
Collaborator

rnburn commented May 16, 2018

@pingles - yes, that would be a good place.

@pingles
Copy link
Contributor

pingles commented May 16, 2018

Ok cool, I found https://github.com/opentracing/specification/blob/master/specification.md and wasn't sure whether sampling was something that's tracer specific (and on a SpanContext like in the Jaeger lib, or associated to a Span). I've added the code to the place above, now trying to figure out why that's not propagating to the forwarded request's HTTP headers :)

@pingles
Copy link
Contributor

pingles commented May 16, 2018

I've started working on this and I've managed to (brutally) get it to do the right thing (at least from observing the downstream request headers):

pingles/zipkin-cpp-opentracing@67d68c6

I'm going to try and carry on tomorrow- I need to fix everything that's referenced in the commit message, and additionally make sure that spans are only reported if they've been sampled (missed that out of my message).

I'm certain that I've misunderstood the relationship between the classes so if anyone more familiar could have a scan and point out where I've made a mistake/should change something I'd really appreciate it. I'm by no means an expert C++ engineer either so please correct me on that also.

Thanks again for supporting me to add this.

@rnburn
Copy link
Collaborator

rnburn commented May 16, 2018

Thanks @pingles - I went through and made a few suggestions.

@pingles
Copy link
Contributor

pingles commented May 16, 2018

Awesome, thanks @rnburn- I'll make those changes tomorrow, appreciate the feedback.

@pingles
Copy link
Contributor

pingles commented May 17, 2018

I've made a few more improvements- is there a recommended way to determine whether a span is the root or not? Looking at https://github.com/opentracing/opentracing-cpp/blob/master/include/opentracing/tracer.h#L39 suggests that if references was empty then it could be considered the root span and we could then consult a sampler to determine whether to sample?

@pingles
Copy link
Contributor

pingles commented May 17, 2018

@rnburn I've copied the Jaeger implementation for determining whether a span is the root (checking whether references contains a ChildOf relationship. pingles/zipkin-cpp-opentracing@d7628b2

Unfortunately, it doesn't seem to work when I test it inside the nginx module- passing sample headers with curl:

curl -H "X-B3-Sampled: 1" -H "X-B3-SpanId: c4adf4696a1334" -H "X-B3-TraceId: c4adf4696a1334" http://localhost

I have a dummy app running that's fronted by nginx and prints the incoming request headers and it shows:

Sampled: 0

So... close, but not quite enough :)

I'll keep debugging to try and figure out where the missing chain is but if anyone else spots it before me please let me know! All work is in https://github.com/pingles/zipkin-cpp-opentracing/tree/sampling currently.

@pingles
Copy link
Contributor

pingles commented May 17, 2018

Related to the above- the other B3 headers are propagated correctly so it's either an issue with the sampled header specifically, or, something to do with my implementation of:

bool
hasParent(const ot::StartSpanOptions &options) const {
    for (auto ref : options.references) {
      if (ref.first == ot::SpanReferenceType::ChildOfRef) {
        return true;
      }
    }
    return false;
  }

@rnburn
Copy link
Collaborator

rnburn commented May 17, 2018

@pingles - I think the parseBool function for reading the sampling header that I wrote might be the problem.
https://github.com/rnburn/zipkin-cpp-opentracing/blob/master/zipkin_opentracing/src/propagation.cc#L34-L46
it always returns true even if sampling is suppose to be off.

(Sorry)

@rnburn
Copy link
Collaborator

rnburn commented May 17, 2018

Also, does

curl -H "X-B3-Sampled: 1" -H "X-B3-SpanId: c4adf4696a1334" -H "X-B3-TraceId: c4adf4696a1334" http://localhost

have a space in front of the header values? that might not work for propagation.

@pingles
Copy link
Contributor

pingles commented May 17, 2018

@rnburn Ah... so parseBool returns whether the value could be parsed?

bool sampled;
if (!parseBool(value, sampled)) {
  return ot::make_unexpected(ot::span_context_corrupted_error);
}
flags |= sampled_flag;

So that should change to:

if sampled {
  flags |= sampled_flag;
}

@rnburn
Copy link
Collaborator

rnburn commented May 17, 2018

Yes, that looks right. I think it's also hard-coded for inject: https://github.com/rnburn/zipkin-cpp-opentracing/blob/master/zipkin_opentracing/src/propagation.cc#L67

@pingles
Copy link
Contributor

pingles commented May 17, 2018 via email

@pingles
Copy link
Contributor

pingles commented May 17, 2018

Ok, made both of those changes but still no cigar...

I tried:

$ curl -H "x-b3-sampled: 0" http://localhost
# from the server:
Sampled: 1

I've pushed the changes again to my repository. rnburn/zipkin-cpp-opentracing@master...pingles:sampling shows current progress ;)

I suspect now the issue is the behaviour in OtTracer::StartSpanWithOptions (rnburn/zipkin-cpp-opentracing@master...pingles:sampling#diff-67ac5aa296d675567a7ada133765ddf9R280)- it determines if its a new span but doesn't do anything to pass sampling state from the references (if there are any). It feels like it needs to be something like:

if (hasParent) {
  span->setSampled(samplingState(references));
} else {
  span->setSampled(sampler.ShouldSample());
}

At least that's pretty close to the Jaeger module's equivalent. @rnburn does that sound right- that the job of the tracer when creating a span would be to check the referenced context and extract information from there? Am I also right to assume that, assuming the nginx module is using zipkin, any SpanContext objects passed via references to OtTracer::StartSpanWithOptions would be zipkin::SpanContext?

@pingles
Copy link
Contributor

pingles commented May 17, 2018

I've also just seen that OtSpan has nothing relating to sampling (but does copy other items like trace IDs etc.). It doesn't look like that has anything related to passing context along but does seem to be involved in sending the trace data?

@pingles
Copy link
Contributor

pingles commented May 17, 2018

I've done a few more tests and it appears that it's nondeterministic when incoming context information is missing some fields. When I was using curl and just setting x-b3-sampled it didn't appear to reliably propagate. However, once I added span and trace IDs it appears to reliably forward information.

I'll try and dig more into that before finally verifying that the sample decision is used to determine whether to send the span data to zipkin.

@pingles
Copy link
Contributor

pingles commented May 17, 2018

I'm an idiot- just seen there's some nice tests to build on, will start there to figure this out.

@pingles
Copy link
Contributor

pingles commented May 18, 2018

Writing some tests was super helpful (although they're a bit ugly- was tricky to control the behaviour of the sampler once the tracer had been constructed).

@rnburn would you be able to have another review through the changes please- any style changes etc. are obviously welcomed also rnburn/zipkin-cpp-opentracing@master...pingles:sampling

I'm still having issues getting it to behave correctly via this nginx module but the tests make me feel happier that I'm close :)

@rnburn
Copy link
Collaborator

rnburn commented May 18, 2018

For the souldSample function, you don't want to be doing initializing every time like this -- that's going to be slow (see the notes in the example code here).

Instead could you add a function to utility.cc and reuse this rand_source? (you can move the variable to be static to the translation unit instead of the function).

@pingles
Copy link
Contributor

pingles commented May 18, 2018

@rnburn thanks. Is it fair to store them as members instead- and then it's only initialised when the sampler is contructed (when the tracer is)?

public:
ProbabilisticSampler(double sample_rate) : dist_(sample_rate) {
  std::random_device device;
  rng_.seed(device());
};
private:
  std::mt19937 rng_;
  std::bernoulli_distribution dist_;

Or is it substantially better to use static? Also- is there any concern around thread-safe concurrent requests for numbers from the distribution, does it need to be guarded with some kind of mutex?

@pingles
Copy link
Contributor

pingles commented May 18, 2018

Having read around it seems like this may be preferable:

ProbabilisticSampler::ShouldSample() {
  static thread_local std::mt19937 rng(std::random_device{}()); 
  std::bernoulli_distribution dist(sample_rate_);
  return dist(rng);
}

@rnburn
Copy link
Collaborator

rnburn commented May 18, 2018

Yeah, that's what's done here: https://github.com/rnburn/zipkin-cpp-opentracing/blob/master/zipkin/src/utility.cc.

I was thinking add something like randomBool(double p) so that you can reuse that same rand_source variable that's already initialized.

But I think what you have is ok too.

@pingles
Copy link
Contributor

pingles commented May 18, 2018

Ok cool, I'll go with that for now :-) I think there's two things left to do: 1) make sure that spans aren't reported unless they're sampled, 2) figure out why the randomisation doesn't seem to work from the nginx module (pingles@e7b586c).

Current behaviour is that whatever sampling decision is forwarded from the client is propagated correctly. However, without any existing context headers the tracer always sets spans to be unsampled.

So very nearly there :-)

@pingles
Copy link
Contributor

pingles commented May 18, 2018

Oh, and check the sampling.priority tag that you referenced in a comment above, so 3 things.

@pingles
Copy link
Contributor

pingles commented May 21, 2018

I've updated the lib to only report sampled spans and am now debugging the issue with all spans being unsampled when used by the nginx module.

I've turned on the debug log and can see the following (with opentracing_trace_locations off):

2018/05/21 12:28:54 [debug] 23125#0: *1 http proxy header: "x-b3-traceid: 0948372d22ebf708"
2018/05/21 12:28:54 [debug] 23125#0: *1 http proxy header: "x-b3-spanid: 3bcb59253435d447"
2018/05/21 12:28:54 [debug] 23125#0: *1 http proxy header: "x-b3-sampled: 0"
2018/05/21 12:28:54 [debug] 23125#0: *1 http proxy header: "x-b3-parentspanid: 0000000000000000"
2018/05/21 12:28:54 [debug] 23125#0: *1 http proxy header: "x-b3-flags: 0"

It feels like there's something I've misunderstood in the way that spans are constructed around the request- that either the sampler is never called, or that it's not randomly drawing from the distribution. I don't suppose you have any good suggestions on where to look @rnburn please before I start debug logging everything

@pingles
Copy link
Contributor

pingles commented May 21, 2018

Note to myself: it feels like it might be a case of checking whether a parent span is valid, not just that it exists: https://github.com/jaegertracing/jaeger-client-cpp/blob/master/src/jaegertracing/SpanContext.h#L150

@pingles
Copy link
Contributor

pingles commented May 21, 2018

I think everything in the underlying zipkin opentracing lib is now updated to support probabilistic sampling from within the nginx module (aside from propagating information through the tags also).

However, I'm still having issues using it within the nginx module: currently it looks like the sampling rate is incorrectly parsed- so ~1/10 requests are sampled when the sampling probability ought to be 1/1.

I think I'll try and get the tag stuff completed on the zipkin lib and then open a PR as I'm pretty sure that is good to go- it'd be good to get some more feedback on the code to make sure I've not missed/misunderstood anything. After that I'll get back on trying to figure out what's happening within the nginx module.

Current code is in: https://github.com/pingles/zipkin-cpp-opentracing/tree/sampling

@rnburn
Copy link
Collaborator

rnburn commented May 21, 2018

thanks @pingles - let me know when you're ready, and I can take another look at it.

you might open up the PR, even if you're not finished - it will make it easier to review.

@pingles
Copy link
Contributor

pingles commented May 21, 2018

Perfect- I'll just do that then

@pingles
Copy link
Contributor

pingles commented May 21, 2018

I've just created the PR: rnburn/zipkin-cpp-opentracing#20

@rnburn
Copy link
Collaborator

rnburn commented May 30, 2018

Fixed with #38 (thank you @pingles )

@rnburn rnburn closed this as completed May 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants