Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
cf06f35
feat(wip): add blackbox instrumentation
derrandz Nov 15, 2022
20a6e05
modify(fx): use decorate instead of invoke
derrandz Nov 15, 2022
c7f219e
fix: correct metric name and add random uniq id as label
derrandz Nov 15, 2022
77c3ef4
fix: make linter happy
derrandz Nov 16, 2022
9e2f8e7
feat: add blackbox instrumentation to the share service
derrandz Nov 17, 2022
4873431
refactor(share.Module): refactor service.ShareService relationship wi…
derrandz Nov 21, 2022
b7a9ce5
feat: provide service directly
derrandz Nov 21, 2022
50b3f0b
feat: add SyncerHead for latest validator network height + update das…
derrandz Nov 22, 2022
ea53b3c
refactor: extract RandString to celestiaorg/utils
derrandz Dec 12, 2022
731fad1
feat: add square size metric and fix unnecessary return
derrandz Dec 13, 2022
7acbd90
feat(wip): add blackbox instrumentation
derrandz Nov 15, 2022
e12b1b1
modify(fx): use decorate instead of invoke
derrandz Nov 15, 2022
9e43517
fix: correct metric name and add random uniq id as label
derrandz Nov 15, 2022
c49070b
fix: make linter happy
derrandz Nov 16, 2022
c355187
feat: add blackbox instrumentation to the share service
derrandz Nov 17, 2022
36bee9f
refactor(share.Module): refactor service.ShareService relationship wi…
derrandz Nov 21, 2022
2a1214e
feat: provide service directly
derrandz Nov 21, 2022
6df5d21
feat: add SyncerHead for latest validator network height + update das…
derrandz Nov 22, 2022
c626255
refactor: extract RandString to celestiaorg/utils
derrandz Dec 12, 2022
669dd25
feat: add square size metric and fix unnecessary return
derrandz Dec 13, 2022
37b4aba
refactor: adapt the blackbox proxy pattern to proxy the getter interface
derrandz Jan 10, 2023
900773f
Merge branch 'blackbox-getter' into blackbox-metrics
derrandz Jan 10, 2023
fc0eaaf
fix: remove relics from the past
derrandz Jan 10, 2023
677c260
test: add test doubles to test the proxying pattern
derrandz Jan 10, 2023
b436e5f
Merge branch 'main' into blackbox-metrics
derrandz Jan 11, 2023
3b59ccf
wip: adding swamp test; blocked by keyring problem on mac
derrandz Jan 11, 2023
c45a60a
test: skipped failing test cases (on mac)
derrandz Jan 15, 2023
78135ec
feat: add p2p bandwidth metrics using opentelemetry primitives
derrandz Jan 15, 2023
802fc8b
ci: make linter happy
derrandz Jan 16, 2023
6014c3d
Merge branch 'main' into blackbox-metrics
derrandz Jan 16, 2023
326baae
refactor: simplify getter proxying
derrandz Jan 16, 2023
6a46c5d
refactor: switch eds to become a histogram
derrandz Jan 16, 2023
aa54481
feat: use labels for square size metric
derrandz Jan 17, 2023
f78e787
chore: cleanup and add docs
derrandz Jan 17, 2023
7959dc5
Merge branch 'main' into blackbox-metrics
derrandz Jan 17, 2023
1a936dd
feat: add metrics for bandwidth per peer
derrandz Jan 19, 2023
08518dd
chore: add SyncerHeader method to header interface to retrieve more t…
derrandz Jan 19, 2023
88ba11c
Merge branch 'main' into blackbox-metrics
derrandz Jan 20, 2023
9025cee
ci: make linter happy
derrandz Jan 20, 2023
ca92182
doc: update nodebuilder/header/header.go from @Wondertan
derrandz Feb 1, 2023
1446352
nit: rename metrics struct field lastHeadTS (from @distractedm1nd)
derrandz Feb 1, 2023
4e4a334
nit: rename metric var nane (@distractedm1nd)
derrandz Feb 1, 2023
33e18af
pr(first pass): address review comments
derrandz Feb 2, 2023
1af715c
pr: address ryan's comments
derrandz Feb 8, 2023
008ee2e
fix(core): Fix modified grpc testing util to not be flakey/fail -- th…
renaynay Feb 2, 2023
e8e5e09
fix(core/testing): revert change to grpc server, and instead wait til…
renaynay Feb 2, 2023
0f66046
fix(nodebuilder/tests/swamp): Sleep for 50 ms before starting to fill…
renaynay Feb 2, 2023
afa705d
Merge branch 'main' into blackbox-metrics
derrandz Feb 8, 2023
9664666
refactor: remove unnecessary by peer metrics, and add peerCount metric
derrandz Feb 8, 2023
df68ff2
pr: address ryan's comments
derrandz Feb 8, 2023
08e40b1
fix(core): Fix modified grpc testing util to not be flakey/fail -- th…
renaynay Feb 2, 2023
98ff6fd
fix(core/testing): revert change to grpc server, and instead wait til…
renaynay Feb 2, 2023
9513399
fix(nodebuilder/tests/swamp): Sleep for 50 ms before starting to fill…
renaynay Feb 2, 2023
2b647e5
Merge branch 'main' into blackbox-metrics
derrandz Feb 8, 2023
b75ebf2
fix(core): Fix modified grpc testing util to not be flakey/fail -- th…
renaynay Feb 2, 2023
e959de7
fix(core/testing): revert change to grpc server, and instead wait til…
renaynay Feb 2, 2023
8446602
fix(nodebuilder/tests/swamp): Sleep for 50 ms before starting to fill…
renaynay Feb 2, 2023
81ee795
Merge branch 'main' into blackbox-metrics
derrandz Feb 8, 2023
d368579
fix: minor import issue
derrandz Feb 8, 2023
f996fce
Merge branch 'main' into blackbox-metrics
derrandz Feb 14, 2023
1409d53
feat: add get_headers requests metrics
derrandz Feb 14, 2023
c10a17b
dep: use latest celestiaorg/utils instead of commit hash
derrandz Feb 14, 2023
e0ee3b9
Merge branch 'main' into blackbox-metrics
derrandz Feb 17, 2023
0adaeff
git: rebase and disable header metrics for now
derrandz Feb 17, 2023
d57e7b2
Revert "feat: add get_headers requests metrics"
derrandz Feb 20, 2023
0993628
feat: implement metrics for exchange using simpler approach
derrandz Feb 20, 2023
b6fb9b6
Merge branch 'main' into blackbox-metrics
derrandz Feb 20, 2023
8f94b6f
feat: implement metrics for exchange using simpler approach
derrandz Feb 20, 2023
4c1fa3b
Merge branch 'main' into blackbox-metrics
derrandz Feb 20, 2023
531abfb
fix: remove bestHead metric
derrandz Feb 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ vendor
/cel-key
coverage.txt
go.work
nodebuilder/keyring-test/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it happens, it means some test broke and writes keyring data on disk. It does not hurt to have it though to prevent accidental push of this test data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line

1 change: 0 additions & 1 deletion core/testing_grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,5 @@ func startGRPCServer(
log.Error("serving GRPC: ", err)
}
}()

return grpcSrv, nil
}
2 changes: 2 additions & 0 deletions das/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -141,9 +141,11 @@ func (m *metrics) observeSample(
if m == nil {
return
}

m.sampleTime.Record(ctx, sampleTime.Seconds(),
attribute.Bool("failed", err != nil),
attribute.Int("header_width", len(h.DAH.RowsRoots)),
attribute.Int("header", int(h.RawHeader.Height)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If header high used as attribute, it will be evaluated as vector dimension in the prometheus. Header hight will have very high cardinality and will totally kill prometheus performance. Thats the primary reason I've not added it as an attribute.

Please consider using it as a metric value(Gauge) if it is a must for your observation or use an aggregated value for observation like das_sampled_chain_head.
Please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.

)

m.sampled.Add(ctx, 1,
Expand Down
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ require (
github.com/celestiaorg/go-libp2p-messenger v0.1.0
github.com/celestiaorg/nmt v0.14.0
github.com/celestiaorg/rsmt2d v0.8.0
github.com/celestiaorg/utils v0.1.0
github.com/cosmos/cosmos-sdk v0.46.7
github.com/cosmos/cosmos-sdk/api v0.1.0
github.com/cristalhq/jwt v1.2.0
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,8 @@ github.com/celestiaorg/quantum-gravity-bridge v1.3.0 h1:9zPIp7w1FWfkPnn16y3S4FpF
github.com/celestiaorg/quantum-gravity-bridge v1.3.0/go.mod h1:6WOajINTDEUXpSj5UZzod16UZ96ZVB/rFNKyM+Mt1gI=
github.com/celestiaorg/rsmt2d v0.8.0 h1:ZUxTCELZCM9zMGKNF3cT+rUqMddXMeiuyleSJPZ3Wn4=
github.com/celestiaorg/rsmt2d v0.8.0/go.mod h1:hhlsTi6G3+X5jOP/8Lb/d7i5y2XNFmnyMddYbFSmrgo=
github.com/celestiaorg/utils v0.1.0 h1:WsP3O8jF7jKRgLNFmlDCwdThwOFMFxg0MnqhkLFVxPo=
github.com/celestiaorg/utils v0.1.0/go.mod h1:vQTh7MHnvpIeCQZ2/Ph+w7K1R2UerDheZbgJEJD2hSU=
github.com/cenkalti/backoff v2.2.1+incompatible h1:tNowT99t7UNflLxfYYSlKYsBpXdEet03Pg2g16Swow4=
github.com/cenkalti/backoff v2.2.1+incompatible/go.mod h1:90ReRw6GdpyfrHakVjL/QHaoyV4aDUVVkXQJJJ3NXXM=
github.com/cenkalti/backoff/v4 v4.1.1/go.mod h1:scbssz8iZGpm3xbr14ovlUdkxfGXNInqkPWOWmG2CLw=
Expand Down
6 changes: 5 additions & 1 deletion libs/header/p2p/exchange.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ type Exchange[H header.Header] struct {
peerTracker *peerTracker

Params ClientParameters

metrics *metrics
}

func NewExchange[H header.Header](
Expand Down Expand Up @@ -251,7 +253,9 @@ func (ex *Exchange[H]) request(
req *p2p_pb.HeaderRequest,
) ([]H, error) {
log.Debugw("requesting peer", "peer", to)
responses, _, _, err := sendMessage(ctx, ex.host, to, ex.protocolID, req)
responses, size, duration, err := sendMessage(ctx, ex.host, to, ex.protocolID, req)
ex.metrics.ObserveRequest(ctx, size, duration)

if err != nil {
log.Debugw("err sending request", "peer", to, "err", err)
return nil, err
Expand Down
57 changes: 57 additions & 0 deletions libs/header/p2p/metrics.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
package p2p

import (
"context"

"go.opentelemetry.io/otel/metric/global"
"go.opentelemetry.io/otel/metric/instrument"
"go.opentelemetry.io/otel/metric/instrument/syncfloat64"
)

type metrics struct {
responseSize syncfloat64.Histogram
responseDuration syncfloat64.Histogram
}

var (
meter = global.MeterProvider().Meter("libs/header/p2p")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meter naming is inconsistent.

)

func (ex *Exchange[H]) RegisterMetrics() error {
responseSize, err := meter.
SyncFloat64().
Histogram(
"libhead_get_headers_response_size",
instrument.WithDescription("Size of get headers response in bytes"),
)
if err != nil {
panic(err)
}

responseDuration, err := meter.
SyncFloat64().
Histogram(
"libhead_get_headers_request_duration",
instrument.WithDescription("Duration of get headers request in seconds"),
)
if err != nil {
return err
}

m := &metrics{
responseSize: responseSize,
responseDuration: responseDuration,
}

ex.metrics = m

return nil
}

func (m *metrics) ObserveRequest(ctx context.Context, size uint64, duration uint64) {
if m == nil {
return
}
m.responseSize.Record(ctx, float64(size))
m.responseDuration.Record(ctx, float64(duration))
}
3 changes: 3 additions & 0 deletions nodebuilder/header/header.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ type Module interface {
Head(context.Context) (*header.ExtendedHeader, error)
// IsSyncing returns the status of sync
IsSyncing(context.Context) bool

// SyncerHead returns the highest known ExtendedHeader of the network.
SyncerHead(context.Context) (*header.ExtendedHeader, error)
}

// API is a wrapper around Module for the RPC.
Expand Down
20 changes: 20 additions & 0 deletions nodebuilder/header/opts.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
package header

import (
header "github.com/celestiaorg/celestia-node/header"
libhead "github.com/celestiaorg/celestia-node/libs/header"
p2p "github.com/celestiaorg/celestia-node/libs/header/p2p"
)

// WithMetrics provides sets `MetricsEnabled` to true on ClientParameters for the header exchange
func WithMetrics(ex libhead.Exchange[*header.ExtendedHeader]) error {
exchange, ok := (ex).(*p2p.Exchange[*header.ExtendedHeader])
if !ok {
// not all implementations of libhead.Exchange[*header.ExtendedHeader]
// are p2p.Exchange[*header.ExtendedHeader
// thus we need to avoid panicking here for when
// ex is of another base type
return nil
}
return exchange.RegisterMetrics()
}
9 changes: 7 additions & 2 deletions nodebuilder/header/service.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import (
"github.com/celestiaorg/celestia-node/libs/header/sync"
)

// Service represents the header Service that can be started / stopped on a node.
// Service represents the header service that can be started / stopped on a node.
// Service's main function is to manage its sub-services. Service can contain several
// sub-services, such as Exchange, ExchangeServer, Syncer, and so forth.
type Service struct {
Expand Down Expand Up @@ -45,6 +45,11 @@ func (s *Service) Head(ctx context.Context) (*header.ExtendedHeader, error) {
return s.store.Head(ctx)
}

func (s *Service) IsSyncing(context.Context) bool {
func (s *Service) IsSyncing(ctx context.Context) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unneeded change in the ctx

return !s.syncer.State().Finished()
}

// SyncerHead returns the ExtendedHeader of the chain head from the validator network.
func (s *Service) SyncerHead(ctx context.Context) (*header.ExtendedHeader, error) {
return s.syncer.Head(ctx)
}
82 changes: 82 additions & 0 deletions nodebuilder/p2p/metrics.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
package p2p

import (
"context"

"github.com/libp2p/go-libp2p/core/metrics"
"go.opentelemetry.io/otel/metric/global"
"go.opentelemetry.io/otel/metric/instrument"
"go.opentelemetry.io/otel/metric/unit"
)

// global meter provider (see opentelemetry docs)
var (
meter = global.MeterProvider().Meter("p2p")
)

// WithMetrics option sets up metrics for p2p networking.
func WithMetrics(bc *metrics.BandwidthCounter) error {
bandwidthTotalInbound, err := meter.
SyncInt64().
Histogram(
"p2p_bandwidth_total_inbound",
instrument.WithUnit(unit.Bytes),
instrument.WithDescription("total number of bytes received by the host"),
)
if err != nil {
return err
}
bandwidthTotalOutbound, _ := meter.
SyncInt64().
Histogram(
"p2p_bandwidth_total_outbound",
instrument.WithUnit(unit.Bytes),
instrument.WithDescription("total number of bytes sent by the host"),
)
if err != nil {
return err
}
bandwidthRateInbound, _ := meter.
SyncFloat64().
Histogram(
"p2p_bandwidth_rate_inbound",
instrument.WithDescription("total number of bytes sent by the host"),
)
if err != nil {
return err
}
bandwidthRateOutbound, _ := meter.
SyncFloat64().
Histogram(
"p2p_bandwidth_rate_outbound",
instrument.WithDescription("total number of bytes sent by the host"),
)
if err != nil {
return err
}
p2pPeerCount, _ := meter.
AsyncFloat64().
Gauge(
"p2p_peer_count",
instrument.WithDescription("number of peers connected to the host"),
)
if err != nil {
return err
}

return meter.RegisterCallback(
[]instrument.Asynchronous{
p2pPeerCount,
}, func(ctx context.Context) {
bcStats := bc.GetBandwidthTotals()
bcByPeerStats := bc.GetBandwidthByPeer()

bandwidthTotalInbound.Record(ctx, bcStats.TotalIn)
bandwidthTotalOutbound.Record(ctx, bcStats.TotalOut)
bandwidthRateInbound.Record(ctx, bcStats.RateIn)
bandwidthRateOutbound.Record(ctx, bcStats.RateOut)

p2pPeerCount.Observe(ctx, float64(len(bcByPeerStats)))
},
)
}
16 changes: 15 additions & 1 deletion nodebuilder/settings.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,12 @@ import (

fraudPkg "github.com/celestiaorg/celestia-node/fraud"
headerPkg "github.com/celestiaorg/celestia-node/header"
header "github.com/celestiaorg/celestia-node/nodebuilder/header"

"github.com/celestiaorg/celestia-node/nodebuilder/das"
"github.com/celestiaorg/celestia-node/nodebuilder/node"
"github.com/celestiaorg/celestia-node/nodebuilder/p2p"
sharePkg "github.com/celestiaorg/celestia-node/nodebuilder/share"
"github.com/celestiaorg/celestia-node/state"
)

Expand All @@ -39,10 +41,12 @@ func WithMetrics(metricOpts []otlpmetrichttp.Option, nodeType node.Type) fx.Opti
baseComponents := fx.Options(
fx.Supply(metricOpts),
fx.Invoke(initializeMetrics),
fx.Invoke(headerPkg.WithMetrics),
fx.Invoke(state.WithMetrics),
fx.Invoke(p2p.WithMetrics),
fx.Invoke(fraudPkg.WithMetrics),
fx.Invoke(node.WithMetrics),
fx.Invoke(headerPkg.WithMetrics),
fx.Invoke(header.WithMetrics),
)

var opts fx.Option
Expand All @@ -64,6 +68,16 @@ func WithMetrics(metricOpts []otlpmetrichttp.Option, nodeType node.Type) fx.Opti
return opts
}

// WithBlackboxMetrics enables blackbox metrics for the node.
// Blackbox metrics are metrics that are recorded for the node's components
// through a proxy that records metrics for the node's components
// on each method call.
func WithBlackboxMetrics() fx.Option {
return fx.Options(
fx.Decorate(sharePkg.WithBlackBoxMetrics),
)
}

// initializeMetrics initializes the global meter provider.
func initializeMetrics(
ctx context.Context,
Expand Down
Loading