feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

derrandz · 2022-11-16T11:56:39Z

Overview

To enable blackbox instrumentation for our benchmarking tests, we need to enrich the celestia-node codebase with blackbox metrics, measurements that the benchmark will rely on to count the final results.

In this PR, we ship blackbox instrumentation for the HeaderService + ShareService in a well defined pattern that relies on:

Proxying the existing service
Decorating the injected service with the proxied service (fx)

This pattern will allow blackbox instrumentation to be:

Optional
Enable-able by using WithBlackBoxMetrics

Changes

Implement blackboxInstrument for the HeaderService
Implement blackboxInstrument for the ShareService
libp2p bandwidth metrics using open telemetry

Checklist

New and updated code has appropriate documentation
New and updated code has new and/or updated testing
Required CI checks are passing
Visual proof for any user facing features like CLI or documentation updates
Linked issues closed with keywords

codecov-commenter · 2022-11-17T11:23:11Z

Codecov Report

Merging #1376 (0adaeff) into main (dd44463) will decrease coverage by 0.56%.
The diff coverage is 37.06%.

@@            Coverage Diff             @@
##             main    #1376      +/-   ##
==========================================
- Coverage   57.27%   56.72%   -0.56%     
==========================================
  Files         239      246       +7     
  Lines       15760    16158     +398     
==========================================
+ Hits         9027     9166     +139     
- Misses       5802     6049     +247     
- Partials      931      943      +12

Impacted Files	Coverage Δ
core/testing_grpc.go	`65.07% <ø> (ø)`
das/metrics.go	`46.66% <0.00%> (-0.40%)`	⬇️
libs/header/p2p/exchange_metrics.go	`0.00% <0.00%> (ø)`
libs/header/p2p/helpers.go	`64.58% <ø> (ø)`
libs/header/p2p/options.go	`63.41% <ø> (ø)`
nodebuilder/header/header.go	`33.33% <ø> (ø)`
nodebuilder/header/metrics.go	`0.00% <0.00%> (ø)`
nodebuilder/header/opts.go	`0.00% <0.00%> (ø)`
nodebuilder/header/service.go	`50.00% <0.00%> (-7.15%)`	⬇️
nodebuilder/share/module.go	`91.11% <ø> (ø)`
... and 14 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

walldiss · 2022-11-23T11:45:41Z

das/metrics.go

 	m.sampleTime.Record(ctx, sampleTime.Seconds(),
 		attribute.Bool("failed", err != nil),
-		attribute.Int("header_width", len(h.DAH.RowsRoots)))
+		attribute.Int("header_width", len(h.DAH.RowsRoots)),
+		attribute.Int("header", int(h.RawHeader.Height)),


If header high used as attribute, it will be evaluated as vector dimension in the prometheus. Header hight will have very high cardinality and will totally kill prometheus performance. Thats the primary reason I've not added it as an attribute.

Please consider using it as a metric value(Gauge) if it is a must for your observation or use an aggregated value for observation like das_sampled_chain_head.
Please

Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.

Wondertan

General Q:

A proxy for each module to meter each method is great. I am wondering if can automate this proxy creation somehow, or do we always have some manual things to track? For example, the filecoin lotus team has this interesting reflection-based proxy, which automagically meters the time it takes for a call and counts how many time the call was made. It's technically possible also to get the size of the method's responded values if we need to. Q: Is this enough for the goal, or do we still need some manual meters?

Review

It seems like PB was regenerated. It seems like it was unintentional, so we should revert this.
This PR is based on the branch where by mistake the node uses a real Keyring generating real keys, and they got to the diff. Pls rebase and remove those.
I need to look deeper into the motivation for refactorings in the share service, so this is def not the last review

nodebuilder/header/metrics.go

Wondertan · 2022-11-24T14:37:59Z

nodebuilder/header/metrics.go

+	)
+
+	// retrieve the binary format to get the size of the header
+	// TODO(@team): is ExtendedHeader.MarshalBinary() == ResponseSize? I am making this assumption for now


Unfortunately not. The ExtendedHeader size varies and we can only calculate it max size by knowing max valset size and max block size

Wondertan · 2022-11-24T14:47:39Z

das/metrics.go

 	m.sampleTime.Record(ctx, sampleTime.Seconds(),
 		attribute.Bool("failed", err != nil),
-		attribute.Int("header_width", len(h.DAH.RowsRoots)))
+		attribute.Int("header_width", len(h.DAH.RowsRoots)),
+		attribute.Int("header", int(h.RawHeader.Height)),


Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.

derrandz · 2022-12-07T14:23:46Z

@Wondertan thank you for the wonderful review! sorry it took me forever to get to it being all sick and stuff.

Q: Do you think it's worth it to automate this proxy creation?

The lotus team's example is great, but in our use cases, we wouldn't want to meter call counts, perhaps we want to meter call times, but only in specific contexts relative to our use case, example: DASing time, or GetShare time or GetByHeight time.

so I think that it would be of greater value to stick to manual metering at the moment to keep consciousness of what is being tracked individually, until a use case for metering lotus-team style shows up.

Let me know if you have any thoughts.

derrandz · 2022-12-12T21:38:49Z

Consensus has been reached through off line discussion, and we will be adding black-box instrumentation only to the parts we require as we go, no need to automate the creation of this at the moment.

…th share.Availability toenforce using service.ShareService for all Get operations

…is issue was observed mainly on macos

… height 1 in swamp tests

… blocks

…is issue was observed mainly on macos

… height 1 in swamp tests

… blocks

go.mod

This reverts commit 1409d53.

Wondertan

One big ask. We should split this PR into multiple smaller PRs and start with Exchange metrics, which is mostly LGTM

Wondertan · 2023-02-21T11:52:37Z

nodebuilder/tests/reconstruct_test.go

@@ -37,6 +37,7 @@ Steps:
 5. Check that a FN can retrieve shares from 1 to 20 blocks
 */
 func TestFullReconstructFromBridge(t *testing.T) {
+	t.Skip("Skipping TestFullReconstructFromBridge until acc not found issue is resolved on Mac")


it's resolved now. remove skips

Wondertan · 2023-02-21T11:53:11Z

.gitignore

@@ -20,3 +20,4 @@ vendor
 /cel-key
 coverage.txt
 go.work
+nodebuilder/keyring-test/


Wondertan · 2023-02-21T11:55:06Z

nodebuilder/header/service.go

@@ -45,6 +45,11 @@ func (s *Service) Head(ctx context.Context) (*header.ExtendedHeader, error) {
 	return s.store.Head(ctx)
 }

-func (s *Service) IsSyncing(context.Context) bool {
+func (s *Service) IsSyncing(ctx context.Context) bool {


unneeded change in the ctx

Wondertan · 2023-02-21T11:57:33Z

libs/header/p2p/metrics.go

+}
+
+var (
+	meter = global.MeterProvider().Meter("libs/header/p2p")


The meter naming is inconsistent.

derrandz · 2023-02-21T15:58:36Z

Closing in favor of:

derrandz mentioned this pull request Nov 16, 2022

[Das Benchmarks]: Stress test 1 Full or Bridge Node against x Light Nodes celestiaorg/test-infra#89

Merged

11 tasks

derrandz self-assigned this Nov 16, 2022

derrandz added area:header Extended header area:metrics Related to measuring/collecting node metrics labels Nov 16, 2022

derrandz force-pushed the blackbox-metrics branch 2 times, most recently from 3b717a5 to 5569d14 Compare November 22, 2022 17:26

walldiss reviewed Nov 23, 2022

View reviewed changes

Wondertan reviewed Nov 24, 2022

View reviewed changes

derrandz changed the title ~~[Benchmarking Telemetry]: Adding Blackbox Instrumentation to the HeaderService + ShareService~~ feat(telemetry): add blackbox instrumentation to the HeaderService + ShareService Nov 28, 2022

derrandz mentioned this pull request Dec 5, 2022

Metrics: Get DASer and Syncer metrics from node celestiaorg/test-infra#132

Closed

derrandz modified the milestone: 2023 Q1 Onsite Dec 5, 2022

derrandz force-pushed the blackbox-metrics branch 4 times, most recently from 4d59a1c to 7f496ce Compare December 7, 2022 15:00

derrandz force-pushed the blackbox-metrics branch from 7f496ce to d6a342c Compare December 12, 2022 21:43

derrandz mentioned this pull request Dec 12, 2022

feat(pkg): add new package "misc" with utility fn RandString celestiaorg/utils#5

Merged

3 tasks

derrandz force-pushed the blackbox-metrics branch 2 times, most recently from ae94706 to 2a96aa2 Compare December 13, 2022 22:44

derrandz added 7 commits January 9, 2023 17:02

feat(wip): add blackbox instrumentation

cf06f35

modify(fx): use decorate instead of invoke

20a6e05

fix: correct metric name and add random uniq id as label

c7f219e

fix: make linter happy

77c3ef4

feat: add blackbox instrumentation to the share service

9e2f8e7

refactor(share.Module): refactor service.ShareService relationship wi…

4873431

…th share.Availability toenforce using service.ShareService for all Get operations

feat: provide service directly

b7a9ce5

derrandz and others added 11 commits February 8, 2023 13:23

refactor: remove unnecessary by peer metrics, and add peerCount metric

9664666

pr: address ryan's comments

df68ff2

fix(core): Fix modified grpc testing util to not be flakey/fail -- th…

08e40b1

…is issue was observed mainly on macos

fix(core/testing): revert change to grpc server, and instead wait til…

98ff6fd

… height 1 in swamp tests

fix(nodebuilder/tests/swamp): Sleep for 50 ms before starting to fill…

9513399

… blocks

Merge branch 'main' into blackbox-metrics

2b647e5

fix(core): Fix modified grpc testing util to not be flakey/fail -- th…

b75ebf2

…is issue was observed mainly on macos

fix(core/testing): revert change to grpc server, and instead wait til…

e959de7

… height 1 in swamp tests

fix(nodebuilder/tests/swamp): Sleep for 50 ms before starting to fill…

8446602

… blocks

Merge branch 'main' into blackbox-metrics

81ee795

fix: minor import issue

d368579

derrandz requested review from distractedm1nd and walldiss February 8, 2023 16:48

Bidon15 reviewed Feb 9, 2023

View reviewed changes

go.mod Outdated Show resolved Hide resolved

derrandz added 2 commits February 14, 2023 14:43

Merge branch 'main' into blackbox-metrics

f996fce

feat: add get_headers requests metrics

1409d53

derrandz requested a review from Wondertan February 14, 2023 16:51

dep: use latest celestiaorg/utils instead of commit hash

c10a17b

Bidon15 mentioned this pull request Feb 17, 2023

chore!: migrate to official repos celestiaorg/test-infra#180

Merged

5 tasks

derrandz added 8 commits February 17, 2023 12:08

Merge branch 'main' into blackbox-metrics

e0ee3b9

git: rebase and disable header metrics for now

0adaeff

Revert "feat: add get_headers requests metrics"

d57e7b2

This reverts commit 1409d53.

feat: implement metrics for exchange using simpler approach

0993628

Merge branch 'main' into blackbox-metrics

b6fb9b6

feat: implement metrics for exchange using simpler approach

8f94b6f

Merge branch 'main' into blackbox-metrics

4c1fa3b

fix: remove bestHead metric

531abfb

Wondertan requested changes Feb 21, 2023

View reviewed changes

derrandz closed this Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

derrandz commented Nov 16, 2022 •

edited

Loading

codecov-commenter commented Nov 17, 2022 •

edited

Loading

walldiss Nov 23, 2022

Wondertan Nov 24, 2022

Wondertan left a comment

Wondertan Nov 24, 2022

Wondertan Nov 24, 2022

derrandz commented Dec 7, 2022

derrandz commented Dec 12, 2022

Wondertan left a comment

Wondertan Feb 21, 2023

Wondertan Feb 21, 2023

Wondertan Feb 21, 2023

Wondertan Feb 21, 2023

derrandz commented Feb 21, 2023

feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

Conversation

derrandz commented Nov 16, 2022 • edited Loading

Overview

Changes

Checklist

codecov-commenter commented Nov 17, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wondertan left a comment

Choose a reason for hiding this comment

General Q:

Review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derrandz commented Dec 7, 2022

derrandz commented Dec 12, 2022

Wondertan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derrandz commented Feb 21, 2023

derrandz commented Nov 16, 2022 •

edited

Loading

codecov-commenter commented Nov 17, 2022 •

edited

Loading