Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing support #227

Merged
merged 25 commits into from
Nov 19, 2019
Merged

tracing support #227

merged 25 commits into from
Nov 19, 2019

Conversation

vyzo
Copy link
Collaborator

@vyzo vyzo commented Nov 4, 2019

Currently only the tracing scaffolding, but this should be enough for initial review pass.

TBD:

  • implement tracer methods with particular event types
  • implement tracing to file
  • implement tracing to peer

@vyzo vyzo requested review from raulk and Stebalien November 4, 2019 17:23
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too "not invented here".

@Stebalien
Copy link
Member

(we discussed this on a call but I misunderstood some important context)

If we were trying to trace across machines, I'd understand maybe doing this ourselves. Given that we're just tracing events locally, this seems way over the top. Why not just use Zap?

If the answer is "performance", have you tested that? I really doubt logging is going to be a bottleneck given encrypted transports, signing, verifying, etc.

@Stebalien
Copy link
Member

Unless the concern is bandwidth. That's the one thing this buys us over a general-purpose logger like zap.

@Stebalien Stebalien dismissed their stale review November 13, 2019 00:33

(not going to block this)

trace.go Outdated

// Generic event tracer interface
type EventTracer interface {
Trace(evt interface{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pb.Message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do that to avoid the cast.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, it takes a *pb.TraceEvent now.

@Stebalien
Copy link
Member

Ok, I've discussed this with raul and looked at it a bit more. This still feels like a lot of code for something so simple but I understand why we're using protobufs (gRPC & bandwidth).

@vyzo
Copy link
Collaborator Author

vyzo commented Nov 13, 2019

Yes, we are trying to trace across machines, it's just that only the local tracer has been written so far.
Tracing to peer (and gRPC based tracing) are forthcoming.
Also note, bandwidth is a real concern here.

@Stebalien
Copy link
Member

Yes, we are trying to trace across machines

I'm talking about including trace IDs in messages. However, message IDs should give us that anyways.

@vyzo
Copy link
Collaborator Author

vyzo commented Nov 13, 2019

I'm talking about including trace IDs in messages.

The peer ID of the originating peer is included in the message.

@vyzo
Copy link
Collaborator Author

vyzo commented Nov 15, 2019

Rebased on master.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After my initial "why are we doing this" reaction, this LGTM (modulo a few questions/nits).

}

func (fs *FloodSubRouter) AddPeer(peer.ID, protocol.ID) {}
func (fs *FloodSubRouter) AddPeer(p peer.ID, proto protocol.ID) {
fs.tracer.AddPeer(p, proto)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the router or in the part that calls AddPeer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have it in the control substrate, it just felt more natural to do it in the router.
Other than that there is no particular reason for the choice.


func (fs *FloodSubRouter) RemovePeer(peer.ID) {}
func (fs *FloodSubRouter) RemovePeer(p peer.ID) {
fs.tracer.RemovePeer(p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, it could be in either place, just felt more natural to do it in the router.

@@ -0,0 +1,290 @@
package pubsub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking, this is moving to a new repo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this file here, so that we have a one-stop construction of the pubsub system without requiring an external dependency.

Don't accumulate memory if the tracer is being slow or unavailable, just drop the trace and log.
@vyzo vyzo changed the title [WIP] tracing support tracing support Nov 18, 2019
tracer.go Show resolved Hide resolved
tracer.go Outdated
}

// wait a bit to accumulate a batch
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could end up with a pretty large buffer this way and/or drop messages. Is there a better way to do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check the size of the buffer and sleep only if the batch is smaller than a threshold.
We can also reduce the sleep time and poll a few times, say up to a second.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for the size of the batch, and polling every 100ms (for up to 1s) in order to accumulate buffer in safer way.

tracer.go Outdated Show resolved Hide resolved
tracer.go Outdated Show resolved Hide resolved
vyzo added 4 commits November 19, 2019 00:07
this avoids holding on memory while we are waiting.
Instead check the batch size and poll every 100ms (up to 1s) until the
minimum batch size is accumulated.
that way we don't have to connect every time we open the stream.
@vyzo vyzo requested a review from Stebalien November 18, 2019 22:31
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits but LGTM

tracer.go Outdated
tr := &RemoteTracer{ctx: ctx, host: host, pi: pi, basicTracer: basicTracer{ch: make(chan struct{}, 1), lossy: true}}
tr := &RemoteTracer{ctx: ctx, host: host, peer: pi.ID, basicTracer: basicTracer{ch: make(chan struct{}, 1), lossy: true}}
for _, addr := range pi.Addrs {
host.Peerstore().AddAddr(pi.ID, addr, peerstore.PermanentAddrTTL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host.Peerstore().AddAddrs(pi.ID, pi.Addrs, ...)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we can do them all at once)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point; will do.

t.mx.Unlock()
time.Sleep(100 * time.Millisecond)
goto again
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for loop?

t.mx.Lock()
for len(t.buf) < MinTraceBatchSize && time.Now().Before(deadline) { 
  t.mx.Unlock()
  time.Sleep(100 * time.Millisecond)
  t.mx.Lock()
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alergy to gotos strikes again! Ok, I'll write it as a for loop.

@raulk
Copy link
Member

raulk commented Nov 19, 2019

Hey, would’ve liked to review this properly before it was merged to master :-( Didn’t get a chance due to other things, but I’d have appreciated a mention and a heads-up warning that the merge train was leaving.

@vyzo
Copy link
Collaborator Author

vyzo commented Nov 19, 2019

@raulk you were extremely busy, so didn't want to burden you.
Feel free to make comments about potential improvements, and we can address them.
Note that the grpc trace client will live out of tree, as we don't want a dependency on grpc for mainline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants