Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling #20

Merged
merged 22 commits into from
May 29, 2018
Merged

Sampling #20

merged 22 commits into from
May 29, 2018

Conversation

pingles
Copy link
Contributor

@pingles pingles commented May 21, 2018

The work in this PR adds a basic probabilistic sampler. Sampling decisions are propagated between contexts and only sampled spans are reported.

ZipkinOtTracerOptions defaults a sample_rate to 1.0 (always sample) but can be changed between 0 and 1.

pingles added 14 commits May 16, 2018 18:28
* hardcoded sampling rate of 50%
* sampling decision should only be performed at root
  or honour parent span decision
* allow more than probabilistic strategy for sampling
* pretty sure i've not understood the model, code tidying to come
* change sample_rate to double
* use bernoulli distribution to decide whether to sample
* inspired by the jaeger module we check whether any of the references are
  ChildOf to identify a parent span.
* propagation works but seems to be non-deterministic when incoming
  context is incomplete (i.e. just sampling header but no trace id)
* currently hardcoded to prob sample with fixed rate but will
  be replaced next
* parent context may be not-null but empty- not sufficient to just check
  if there's a parent span, must also check whether it's valid.
* probabilistic sampler bounds p(sampled) between 0 and 1
@@ -566,6 +569,7 @@ class Span : public ZipkinBase {
std::string name_;
uint64_t id_;
Optional<TraceId> parent_id_;
bool sampled_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you default initialize sampled_ to true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely- just to check, do you mean via the default constructor of Span or do you mean in the definition of sampled_?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in-class initialization is preferred for something like that: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c48-prefer-in-class-initializers-to-member-initializers-in-constructors-for-constant-initializers

A lot of the code was copied in from envoy, so not all of it is following some of the conventions I'd use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link- super helpful for me. I'll make the initialization change now.

@@ -271,17 +282,27 @@ class OtSpan : public ot::Span {
class OtTracer : public ot::Tracer,
public std::enable_shared_from_this<OtTracer> {
public:
explicit OtTracer(TracerPtr &&tracer) : tracer_{std::move(tracer)} {}
explicit OtTracer(TracerPtr &&tracer) : tracer_{std::move(tracer)}, sampler_{std::move(new ProbabilisticSampler(1.0))} {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::move(new ProbabilisticSampler(1.0)) can be changed to new ProbabilisticSampler(1.0)

@rnburn
Copy link
Owner

rnburn commented May 21, 2018

Could you run scripts/run_clang_format.sh scripts/run_clang_format.sh?

@rnburn
Copy link
Owner

rnburn commented May 21, 2018

The tsan failure means you might be missing some locking somewhere.

You can run the test locally and look at the log produced by cmake. Here's how it's run:
https://github.com/rnburn/zipkin-cpp-opentracing/blob/master/ci/do_ci.sh#L24

Or maybe set up circleci to upload the log as an artifact like I was doing here with Lightstep:
https://github.com/lightstep/lightstep-tracer-cpp/blob/master/.circleci/config.yml#L24-L26
so you can view it.

@pingles
Copy link
Contributor Author

pingles commented May 21, 2018

Cool- I'll do some more debugging. I can't replicate it locally (executing the tsan target doesn't yield anything) but that could be a quirk of my machine (MBP) and compiler I guess. I'll try and dig into it tomorrow. Thanks for your patience @rnburn.

pingles added 2 commits May 22, 2018 08:34
- normally this looks like it would be configured with the circleci
  config but this should work without repo write access
@pingles
Copy link
Contributor Author

pingles commented May 22, 2018

Finally got the output, 1 test fails:

https://369-95814681-gh.circle-artifacts.com/0/Test.log

-------------------------------------------------------------------------------
ot_tracer
  Propagates sampling decision to child span
-------------------------------------------------------------------------------
/root/project/zipkin_opentracing/test/ot_tracer_test.cc:73
...............................................................................

/root/project/zipkin_opentracing/test/ot_tracer_test.cc:95: FAILED:
  CHECK( r2->spans().empty() )
with expansion:
  false

So appears to be something related to propagation.

@pingles
Copy link
Contributor Author

pingles commented May 22, 2018

@rnburn not sure how to debug this- if I run the ./ci/do_ci.sh cmake.tsan on my MBP or in an ubuntu vm I don't see the same test failure. Likewise, ./ci/do_ci.sh cmake.asan also passes fine. However, if I run the following in the ubuntu vm then i do see the same test failure as above:

mkdir -p .build
cd .build && cmake .. && make && make test

But I don't see that error when I compile and run the tests on my main development MBP

@rnburn
Copy link
Owner

rnburn commented May 22, 2018

It might mean you're reading from uninitialized memory somewhere -- that could explain the unpredicatable behavior.

You could try adding some print statements to trace what's going on.

Or run in the debugger and set breakpoints at key locations.

@pingles
Copy link
Contributor Author

pingles commented May 23, 2018

I think i've reduced it to a problem with how the Sampler is constructed for use with the OtTracer. I changed the OtTracer constructor to always use new ProbabilisticSampler(1.0) and it propagates and samples correctly. Will keep probing.

@pingles
Copy link
Contributor Author

pingles commented May 23, 2018

I think I reduced the uninitialized memory to the flags in the span context (when constructed from a span). At this point I'm reasonably confident the work in this PR for the underlying lib has the necessary implementation for probabilistic sampling.

I'm still trying to debug the issue when it's consumed by the nginx module- it seems that the ZipkinOtTracerOptions::sample_rate is always 0 causing it to reliably fail.

return std::shared_ptr<ot::Tracer>{new OtTracer{std::move(tracer)}};
SamplerPtr sampler{new ProbabilisticSampler{options.sample_rate}};
return std::shared_ptr<ot::Tracer>{
new OtTracer{std::move(tracer), std::move(sampler)}};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks, will do.

Can/should SamplerPtr and TracerPtr be replaced with std::make_unique also? https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#c150-use-make_unique-to-construct-objects-owned-by-unique_ptrs

Thanks again for patiently sending me to the C++ idioms :)


class ProbabilisticSampler : public Sampler {
public:
ProbabilisticSampler(double sample_rate)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rnburn
Copy link
Owner

rnburn commented May 25, 2018

@pingles - is this ready to merge? it's looking pretty good.

@pingles
Copy link
Contributor Author

pingles commented May 25, 2018 via email

@pingles
Copy link
Contributor Author

pingles commented May 29, 2018

@rnburn pretty sure this is something inside the way I'm using the lib within the nginx module. am in the middle of poking around with lldb to debug and noticed this:

* thread #1, queue = 'com.apple.main-thread', stop reason = step in
    frame #0: 0x000000010844b7ee libzipkin_opentracing.0.dylib`zipkin::ProbabilisticSampler::ShouldSample(this=0x00007fa300f079a0) at sampling.cc:8
   5   	bool ProbabilisticSampler::ShouldSample() {
   6   	  static thread_local std::mt19937 rng(std::random_device{}());
   7   	  std::bernoulli_distribution dist(sample_rate_);
-> 8   	  return dist(rng);
   9   	}
   10  	} // namespace zipkin
Target 0: (nginx) stopped.

(lldb) print sample_rate_
(double) $0 = 6.5422003934259188E-310

@pingles
Copy link
Contributor Author

pingles commented May 29, 2018

I changed all the nginx config so it didn't daemonize or fork a worker process to make it easier to debug the initialisation. Looks like zipking::ZipkinOtTracerOptions is wrong when the factory function is called:

    frame #0: 0x0000000102b35225 libzipkin_opentracing.0.dylib`zipkin::makeZipkinOtTracer(options=0x00007ffeefbff680) at opentracing.cc:400
   397 	std::shared_ptr<ot::Tracer>
   398 	makeZipkinOtTracer(const ZipkinOtTracerOptions &options) {
   399 	  auto reporter =
-> 400 	      makeHttpReporter(options.collector_host.c_str(), options.collector_port,
   401 	                       options.reporting_period, options.max_buffered_spans);
   402 	  return makeZipkinOtTracer(options, std::move(reporter));
   403 	}
Target 0: (nginx) stopped.
(lldb) print options
(const zipkin::ZipkinOtTracerOptions) $0 = {
  collector_host = "localhost"
  collector_port = 9411
  ...
  sample_rate = 6.5422003934259188E-310

@pingles
Copy link
Contributor Author

pingles commented May 29, 2018

Ok figured it out- the nginx dynamic load was picking up the wrong object, but the headers included the sample_rate causing the value to always be garbage.

Pointing it at the right location causes it to initialise correctly:

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.2
    frame #0: 0x0000000102035215 libzipkin_opentracing.0.dylib`zipkin::makeZipkinOtTracer(options=0x00007ffeefbff360) at opentracing.cc:400
   397 	std::shared_ptr<ot::Tracer>
   398 	makeZipkinOtTracer(const ZipkinOtTracerOptions &options) {
   399 	  auto reporter =
-> 400 	      makeHttpReporter(options.collector_host.c_str(), options.collector_port,
   401 	                       options.reporting_period, options.max_buffered_spans);
   402 	  return makeZipkinOtTracer(options, std::move(reporter));
   403 	}
Target 0: (nginx) stopped.
(lldb) print options
(const zipkin::ZipkinOtTracerOptions) $0 = {
  collector_host = "localhost"
  collector_port = 9411
  reporting_period = (__rep_ = 500000000)
  max_buffered_spans = 1000
  sample_rate = 1

At this point I'd say I'm happy for the above to be merged as long as you're happy there's no other style/correctness tweaks you'd like @rnburn :)

I'll create a separate PR for the nginx module that consumes these changes.

@rnburn rnburn merged commit 6194d20 into rnburn:master May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants