Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swarm: minimal set of metrics #1910

Closed
marten-seemann opened this issue Nov 21, 2022 · 3 comments · Fixed by #1973
Closed

swarm: minimal set of metrics #1910

marten-seemann opened this issue Nov 21, 2022 · 3 comments · Fixed by #1973
Assignees
Labels
effort/days Estimated to take multiple days, but less than a week exp/intermediate Prior experience is likely helpful P0 Critical: Tackled by core team ASAP

Comments

@marten-seemann
Copy link
Contributor

The objective is to define a minimal set of metrics that the swarm should expose. There are a lot of interesting things we could measure (and we should in the long term, see #1356), but we need to start somewhere.

I propose to define the minimal set as those metrics that allow us to measure the effect of our smarter dialing logic (#1785).

Suggested metrics:

  • conns opened (grouped by: direction, transport / security / muxer)
  • conns closed (grouped by: direction, transport / security / muxer)
  • conn duration histogram: duration between handshake completion and closing (grouped by: transport / security / muxer)
  • dial duration histogram
  • dial errors (group by: error type (timeout, cancellation, other), transport / security / muxer)

When implementing smarter dialing, we expect:

  • the dial errors to decrease dramatically (for cancelations)
  • the dial duration to just increase slightly (otherwise this dialing logic wouldn't be very smart)
  • the number of very short-lived incoming connection (as other nodes upgrade)
@marten-seemann marten-seemann added P0 Critical: Tackled by core team ASAP exp/intermediate Prior experience is likely helpful effort/days Estimated to take multiple days, but less than a week labels Nov 21, 2022
@marten-seemann marten-seemann self-assigned this Nov 21, 2022
@marten-seemann
Copy link
Contributor Author

Adding to the list of suggested metrics:

  • when closing: amount of stream data exchanged, total number of streams opened / accepted
  • canceled handshakes (on the listening side)

The last one is a tricky one, for multiple reasons:

  1. It’s not something we can learn about in the swarm package, as the transport implementation perform the entire libp2p handshake before handing a connection to the swarm. We therefore need to instrument every transport with a handshake_started event. The number of canceled handshakes is then handshake_completed - handshake_started.
  2. A connection might finish the handshake and then be canceled (there’s a race condition on the sender side here). As long as we don’t have proper close reasons (Proposal: provide error codes when closing connections and resetting streams specs#479), we have no way of definitely telling, but we can look at connections that only existed for a very short time, or even better, didn’t accept any streams

@GlenDC
Copy link

GlenDC commented Dec 8, 2022

Hi I can pick this issue up with the aim to ship it within the next 10 days.
I do have some questions that I would like to align on before putting potentially wasted dev time into this:

  1. Reporter is exposed as an interface (https://github.com/libp2p/go-libp2p/blob/master/core/metrics/reporter.go#L20-L21) but normally with Go you expose from your package concrete types, and it are the users that can (do not have to) use it as an interface.
  • that and given the fact that we're the only users and do not have multiple reporter production use cases, we might as well keep it at the single reporter use case?
  1. Right now we have
    type Stats struct {
    TotalIn int64
    TotalOut int64
    RateIn float64
    RateOut float64
    }
    , but it's very focussed on Bandwidth, so it might make sense to rename that to BandwidthStats and create a secondary ConnectionStats, which can contain the metrics requested by this issue;
  2. Combining (1) and (2) it would mean that we could rename bandwidth*.go to swarm*.go and turn that into a SwamCounter concrete type which has functions to expose BandwidthStats snapshots but also ConnectionStats.

An additional question to confirm. I suppose you mean incoming / outgoing with direction?

@marten-seemann
Copy link
Contributor Author

@GlenDC I've already started working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week exp/intermediate Prior experience is likely helpful P0 Critical: Tackled by core team ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants