Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

Fix the span allocation in the pool #381

Merged
merged 2 commits into from
Apr 23, 2019
Merged

Conversation

demdxx
Copy link
Contributor

@demdxx demdxx commented Mar 16, 2019

Which problem is this PR solving?

Allocation in the pool didn't work properly because of tracer tried to reuse the span object before processing finishing, what spawned errors and data overwriting before spans were sent.

func TestRemoteReporterAppendWithPollAllocator(t *testing.T) {
	s := makeReporterSuiteWithSender(t, &fakeSender{bufferSize: 10}, ReporterOptions.BufferFlushInterval(time.Millisecond*10))
	TracerOptions.PoolSpans(true)(s.tracer.(*Tracer))
	for i := 0; i < 1000; i++ {
		s.tracer.StartSpan("sp").Finish()
	}
	time.Sleep(time.Second)
	s.sender.assertFlushedSpans(t, 1000)
	s.close() // causes explicit flush that also fails with the same error
}
go test -timeout 30s github.com/uber/jaeger-client-go -run ^(TestRemoteReporterAppendWithPollAllocator)$ -v -race
=== RUN   TestRemoteReporterAppendWithPollAllocator
==================
WARNING: DATA RACE
Write at 0x00c4202223d8 by goroutine 18:
  github.com/uber/jaeger-client-go.(*Tracer).newSpan()
      github.com/uber/jaeger-client-go/tracer.go:357 +0xff
  github.com/uber/jaeger-client-go.(*Tracer).startSpanWithOptions()
      github.com/uber/jaeger-client-go/tracer.go:289 +0xbdd
  github.com/uber/jaeger-client-go.(*Tracer).StartSpan()
      github.com/uber/jaeger-client-go/tracer.go:200 +0x17d
  github.com/uber/jaeger-client-go.TestRemoteReporterAppendWithPollAllocator()
      github.com/uber/jaeger-client-go/reporter_test.go:162 +0x1fc
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:777 +0x16d

Previous read at 0x00c4202223d8 by goroutine 19:
  github.com/uber/jaeger-client-go.(*fakeSender).Append()
      github.com/uber/jaeger-client-go/reporter_test.go:328 +0xb7
  github.com/uber/jaeger-client-go.(*remoteReporter).processQueue()
      github.com/uber/jaeger-client-go/reporter.go:273 +0x2b8

Goroutine 18 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:824 +0x564
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1063 +0xa4
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:777 +0x16d
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1061 +0x4e1
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:978 +0x2cd
  main.main()
      _testmain.go:212 +0x22a

Goroutine 19 (running) created at:
  github.com/uber/jaeger-client-go.NewRemoteReporter()
      github.com/uber/jaeger-client-go/reporter.go:209 +0x2ec
  github.com/uber/jaeger-client-go.makeReporterSuiteWithSender()
      github.com/uber/jaeger-client-go/reporter_test.go:65 +0x4f3
  github.com/uber/jaeger-client-go.TestRemoteReporterAppendWithPollAllocator()
      github.com/uber/jaeger-client-go/reporter_test.go:159 +0x154
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:777 +0x16d
==================
--- FAIL: TestRemoteReporterAppendWithPollAllocator (2.35s)
	github.com/uber/jaeger-client-go/assertions.go:256:
			Error Trace:	reporter_test.go:386
			            				reporter_test.go:165
			Error:      	"[61bdfe3abf262a6a:61bdfe3abf262a6a:0:1 ...
			Test:       	TestRemoteReporterAppendWithPollAllocator
	github.com/uber/jaeger-client-go/testing.go:730: race detected during execution of test

Short description of the changes

  • Was added features of controlling of the lifecycle of the span object with .Retain() and .Release() methods.
  • Was rewritten the system of allocation of the Span objects via the object factories. See span_allocator.go
  • Fixed the module reporter.go remoteReporter what became the reason for racing in the spans allocations.

This feature didn't work properly what spawned errors and data overwriting before spans was sent.

```
WARNING: DATA RACE
Write at 0x00c4202223d8 by goroutine 18:
  github.com/uber/jaeger-client-go.(*Tracer).newSpan()
      github.com/uber/jaeger-client-go/tracer.go:357 +0xff
  github.com/uber/jaeger-client-go.(*Tracer).startSpanWithOptions()
      github.com/uber/jaeger-client-go/tracer.go:289 +0xbdd
  github.com/uber/jaeger-client-go.(*Tracer).StartSpan()
      github.com/uber/jaeger-client-go/tracer.go:200 +0x17d
  github.com/uber/jaeger-client-go.TestRemoteReporterAppendWithPollAllocator()
```

Signed-off-by: Dmitry Ponomarev <demdxx@gmail.com>
@codecov
Copy link

codecov bot commented Mar 16, 2019

Codecov Report

Merging #381 into master will increase coverage by 0.47%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #381      +/-   ##
==========================================
+ Coverage    87.8%   88.27%   +0.47%     
==========================================
  Files          54       55       +1     
  Lines        3033     3061      +28     
==========================================
+ Hits         2663     2702      +39     
+ Misses        264      255       -9     
+ Partials      106      104       -2
Impacted Files Coverage Δ
tracer.go 96.75% <100%> (+4.72%) ⬆️
tracer_options.go 77.04% <100%> (+1.18%) ⬆️
span.go 100% <100%> (ø) ⬆️
reporter.go 100% <100%> (ø) ⬆️
span_allocator.go 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a1cede...07e2d7e. Read the comment docs.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I think it's a good direction, but I am bothered by so many changes to the tests to Retain/Release, I don't understand why they are necessary (given that the default allocator is not pooling), and they make the tests harder to read.

tracer.options.poolSpans = poolSpans
if poolSpans {
tracer.spanAllocator = newSpanSyncPool()
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else not needed, constructor defaults to this anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to save the current functionality of PoolSpans, otherwise need to make checks inside the option that the allocators are the same what is more complicated. So I believe that current overhead easier for understanding.
In case if PoolSpans() will be called several times it won't be a big problem.

@demdxx
Copy link
Contributor Author

demdxx commented Mar 18, 2019

@yurishkuro I tried don't broke the logic that you have right now, exactly that Finish() method sends and release the Span object. The problem is that the allocation in the pool didn't work at all with the InMemoryReporter and remoteReporter because they saved object references for asynchronous processing.

There were two options on how to fix it:
1/ Add move semantic into the reporters
2/ Extend span with additional memory logic
I choose the second one because of move semantic demands additional memory for Span object body and after movement, the parent object becomes useless, which makes the Span object not applicable in case of compositeReporter.

After adding Retain() & Release() feature the lifecycle of the object changes and need to understand that the Finish() is only for the collection of logs (tags). It does not make understanding significantly harder, but make possible more important features possible. For us consuming memory is a very important indicator, without it it's hard to use on high load services.

I will add another test with Span lifecycle into the span_test.go, also I'm open to suggestions on how this can be solved.

@yurishkuro
Copy link
Member

I understand that introducing life cycle management for spans will require changes in the reporters. When a reporter receives a span, it needs to call Retain() and then Release() at the end of async processing. In case of RemoteReporter span can be released after converting it to thrift. In case of in memory reporter, span can be released after it is purged from cache. But the latter should not have any effect on the unit tests, since they rely on the fact that span is being retained by in memory reporter.

@yurishkuro
Copy link
Member

yurishkuro commented Mar 18, 2019

BTW I am extremely interested in this change. Could you elaborate on your performance goals?

  • how many spans per second per core are you trying to achieve?
  • what is the server desired QPS per core?
  • what are the server latency percentiles?
  • what level of sampling are you targeting?

@demdxx
Copy link
Contributor Author

demdxx commented Mar 18, 2019

But the latter should not have any effect on the unit tests, since they rely on the fact that span is being retained by in memory reporter.

Ok then, I can use InMemoryReporter instead of current nullReporter as temporary storage for tests, in that case, I can revert almost all changes in tests. What do you think?

  • how many spans per second per core are you trying to achieve?
    We are just started and count on at least on 20-30 spans per request
  • what is the server desired QPS per core?
    Approximately 400 QPS per core
  • what are the server latency percentiles?
    avg(99%) 8.43ms
    avg: 3.73ms
    min: 11.73 microseconds
    max: 482ms
    But need to take in count that we are adnetwork and we have many integrations with API of other adnetworks
  • what level of sampling are you targeting?
    will see

Of course, there is no big overhead, especially because we will use custom StartSpanFromContext method for do not allocate even unused spans on each request.

Btw need to integrate sampling on the opentracing level instead of jaeger

demdxx added a commit to demdxx/jaeger-client-go that referenced this pull request Mar 19, 2019
See the pull comments jaegertracing#381 (comment)

Signed-off-by: Dmitry Ponomarev <demdxx@gmail.com>
@demdxx
Copy link
Contributor Author

demdxx commented Mar 19, 2019

@yurishkuro check this out demdxx@6b2d873
I have reverted almost all tests changes with adding of InMemoryReporter. It works but I think that it's not completely fair, because it hiding that fact that after Finish() span becomes invalid

@yurishkuro
Copy link
Member

@demdxx the reason I am concerned with Retain/Release in the tests is that the tests often represent the use by an external user, who interacts with Jaeger lib via the OpenTracing API, which does not expose Retain/Release.

It works but I think that it's not completely fair, because it hiding that fact that after Finish() span becomes invalid

It only becomes invalid if you pool the span, that most of the tests don't do. The default mode is not to pool spans.

demdxx added a commit to demdxx/jaeger-client-go that referenced this pull request Mar 24, 2019
TestRemoteReporterAppendWithPollAllocator => TestRemoteReporterAppendWithPoolAllocator

Revert tests related to span allocation Retain/Release functions

See the pull comments jaegertracing#381 (comment)

Revert the tests to make them work by original logic

Revert transport and tracer tests

Signed-off-by: Dmitry Ponomarev <demdxx@gmail.com>
demdxx added a commit to demdxx/jaeger-client-go that referenced this pull request Mar 24, 2019
TestRemoteReporterAppendWithPollAllocator => TestRemoteReporterAppendWithPoolAllocator

Revert tests related to span allocation Retain/Release functions

See the pull comments jaegertracing#381 (comment)

Revert the tests to make them work by original logic

Revert transport and tracer tests

Signed-off-by: Dmitry Ponomarev <demdxx@gmail.com>
reporter.go Outdated
@@ -122,7 +123,14 @@ func (r *InMemoryReporter) GetSpans() []opentracing.Span {
func (r *InMemoryReporter) Reset() {
r.lock.Lock()
defer r.lock.Unlock()
r.spans = nil

if len(r.spans) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary if(), just run the loop

reporter_test.go Outdated
@@ -301,6 +312,10 @@ func (s *fakeSender) Append(span *Span) (int, error) {
s.mutex.Lock()
defer s.mutex.Unlock()

// Validation of span
if span.tracer == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can this happen? Maybe better to panic?

span.go Outdated

// retainCounter used to increase the lifetime of
// the object before return it into the pool.
retainCounter int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic vars must be placed at the top of the struct to ensure word alignment

span.go Outdated
@@ -225,18 +234,66 @@ func (s *Span) OperationName() string {
return s.operationName
}

// Retain increases object counter to increase the lifetime of the object
func (s *Span) Retain(count ...int32) *Span {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the vararg argument really necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It was needed for the first commit implementation.
Thank you, I will remove it

span.go Outdated
}

if atomic.AddInt32(&s.retainCounter, counter) < 0 {
if tr := s.tracer; tr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a race condition to me: the counter can go up after the check.

Also, why do we need to be checking for s.tracer? It seems like it's trying to compensate for some other issue. I think the returning to the pool should be based purely on the ref counting.

span.go Outdated
@@ -174,6 +179,9 @@ func (s *Span) BaggageItem(key string) string {
}

// Finish implements opentracing.Span API
// After finishing of the Span object it returns back to the allocator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless reporter retains it again

IsPool() bool
}

type spanSyncPool struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncPollSpanAllocator

return true
}

type spanSimpleAllocator struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpleSpanAllocator


func (pool spanSimpleAllocator) Put(span *Span) {
// https://github.com/jaegertracing/jaeger-client-go/pull/381#issuecomment-475904351
// span.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what the link is trying to say. I would add a comment // since finished spans are not reused, no need to reset them

type SpanAllocator interface {
Get() *Span
Put(*Span)
IsPool() bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? It's not used in any biz logic, only in the test, which can just check for type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

* TestRemoteReporterAppendWithPollAllocator => TestRemoteReporterAppendWithPoolAllocator
* Revert tests related to span allocation Retain/Release functions
* See the pull comments jaegertracing#381 (comment)
* Revert the tests to make them work by original logic
* Revert transport and tracer tests
* Fixes after code review
  jaegertracing#381 (review)

Signed-off-by: Dmitry Ponomarev <demdxx@gmail.com>
@demdxx
Copy link
Contributor Author

demdxx commented Apr 1, 2019

@yurishkuro can we merge the PR?
Without it, the benchmark fails

func BenchmarkSpanCommon(b *testing.B) {
	service := "DOOP"
	tracer, closer := NewTracer(service, NewConstSampler(true), NewNullReporter(), TracerOptions.PoolSpans(true))
	defer closer.Close()

	sub := func(name string, span opentracing.Span, cb func(span opentracing.Span)) {
		subSpan := tracer.StartSpan(name, opentracing.ChildOf(span.Context()))
		if cb != nil {
			cb(subSpan)
		}
		subSpan.Finish()
	}

	b.ReportAllocs()
	b.ResetTimer()
	b.RunParallel(func(bp *testing.PB) {
		for bp.Next() {
			var wg sync.WaitGroup
			span := tracer.StartSpan("base")

			wg.Add(1)
			go sub("sub1", span, func(span1 opentracing.Span) {
				span1.SetTag("level", 1)
				span1.LogEvent("enter.sub1")
				span1.SetOperationName("sub1")
				// span1.(*Span).Retain()
				go sub("sub1.1", span1, func(span11 opentracing.Span) {
					span11.SetTag("level", 2)
					// span1.(*Span).Release()
					wg.Done()
				})
			})

			wg.Add(2)
			go sub("sub2", span, func(span2 opentracing.Span) {
				span2.LogEvent("enter.sub2")
				span2.SetBaggageItem("baggage2", "v1")
				sub("sub2.1", span2, func(span21 opentracing.Span) {
					span21.SetTag("level", 2)
					wg.Done()
				})
				sub("sub2.2", span2, func(span22 opentracing.Span) {
					span22.SetTag("level", 2)
					wg.Done()
				})
			})

			wg.Add(3)
			go sub("sub3", span, func(span3 opentracing.Span) {
				span3.LogEvent("enter.sub3")
				span3.SetBaggageItem("baggage3", "v3")
				// span3.(*Span).Retain()
				go sub("sub3.1", span3, func(span31 opentracing.Span) {
					// span3.(*Span).Release()
					span31.SetTag("level", 2)
					sub("sub3.1.1", span31, func(span311 opentracing.Span) {
						span311.SetTag("level", 3)
						wg.Done()
					})
				})
				sub("sub3.2", span3, func(span32 opentracing.Span) {
					span32.LogEvent("enter.sub3.2")
					wg.Done()
				})
				sub("sub3.3", span3, func(span33 opentracing.Span) {
					span33.LogEvent("enter.sub3.3")
					wg.Done()
				})
			})
			wg.Wait()
			span.Finish()
		}
	})
}
go test -benchmem -run=^$ github.com/uber/jaeger-client-go -bench ^(BenchmarkSpanCommon)$ -v -race

goos: darwin
goarch: amd64
pkg: github.com/uber/jaeger-client-go
BenchmarkSpanCommon-4   	==================
WARNING: DATA RACE
Write at 0x00c000136110 by goroutine 33:
  github.com/uber/jaeger-client-go.(*Tracer).newSpan()
      .../github.com/uber/jaeger-client-go/tracer.go:356 +0xc0
  github.com/uber/jaeger-client-go.(*Tracer).startSpanWithOptions()
      .../github.com/uber/jaeger-client-go/tracer.go:289 +0x9da
  github.com/uber/jaeger-client-go.(*Tracer).StartSpan()
      .../github.com/uber/jaeger-client-go/tracer.go:200 +0x1d0
  github.com/uber/jaeger-client-go.BenchmarkSpanCommon.func1()
      .../github.com/uber/jaeger-client-go/span_test.go:156 +0x157

Previous read at 0x00c000136110 by goroutine 41:
  runtime.convT2I()
      /usr/local/Cellar/go/1.12/libexec/src/runtime/iface.go:386 +0x0
  github.com/uber/jaeger-client-go.(*Span).Context()
      .../github.com/uber/jaeger-client-go/span.go:204 +0x8a
  github.com/uber/jaeger-client-go.BenchmarkSpanCommon.func1()
      .../github.com/uber/jaeger-client-go/span_test.go:156 +0x75

Goroutine 33 (running) created at:
  github.com/uber/jaeger-client-go.BenchmarkSpanCommon.func2()
      .../github.com/uber/jaeger-client-go/span_test.go:184 +0x380
  testing.(*B).RunParallel.func1()
      /usr/local/Cellar/go/1.12/libexec/src/testing/benchmark.go:665 +0x160

Goroutine 41 (running) created at:
  github.com/uber/jaeger-client-go.BenchmarkSpanCommon.func2.3()
      .../github.com/uber/jaeger-client-go/span_test.go:202 +0x1aa
  github.com/uber/jaeger-client-go.BenchmarkSpanCommon.func1()
      .../github.com/uber/jaeger-client-go/span_test.go:158 +0x1a5

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Sorry for delay.

@yurishkuro yurishkuro merged commit 896f2ab into jaegertracing:master Apr 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants