Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376

Closed
wants to merge 70 commits into from

Conversation

derrandz
Copy link
Contributor

@derrandz derrandz commented Nov 16, 2022

Overview

To enable blackbox instrumentation for our benchmarking tests, we need to enrich the celestia-node codebase with blackbox metrics, measurements that the benchmark will rely on to count the final results.

In this PR, we ship blackbox instrumentation for the HeaderService + ShareService in a well defined pattern that relies on:

  1. Proxying the existing service
  2. Decorating the injected service with the proxied service (fx)

This pattern will allow blackbox instrumentation to be:

  1. Optional
  2. Enable-able by using WithBlackBoxMetrics

Changes

  • Implement blackboxInstrument for the HeaderService
  • Implement blackboxInstrument for the ShareService
  • libp2p bandwidth metrics using open telemetry

Checklist

  • New and updated code has appropriate documentation
  • New and updated code has new and/or updated testing
  • Required CI checks are passing
  • Visual proof for any user facing features like CLI or documentation updates
  • Linked issues closed with keywords

@derrandz derrandz self-assigned this Nov 16, 2022
@derrandz derrandz added area:header Extended header area:metrics Related to measuring/collecting node metrics labels Nov 16, 2022
@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2022

Codecov Report

Merging #1376 (0adaeff) into main (dd44463) will decrease coverage by 0.56%.
The diff coverage is 37.06%.

@@            Coverage Diff             @@
##             main    #1376      +/-   ##
==========================================
- Coverage   57.27%   56.72%   -0.56%     
==========================================
  Files         239      246       +7     
  Lines       15760    16158     +398     
==========================================
+ Hits         9027     9166     +139     
- Misses       5802     6049     +247     
- Partials      931      943      +12     
Impacted Files Coverage Δ
core/testing_grpc.go 65.07% <ø> (ø)
das/metrics.go 46.66% <0.00%> (-0.40%) ⬇️
libs/header/p2p/exchange_metrics.go 0.00% <0.00%> (ø)
libs/header/p2p/helpers.go 64.58% <ø> (ø)
libs/header/p2p/options.go 63.41% <ø> (ø)
nodebuilder/header/header.go 33.33% <ø> (ø)
nodebuilder/header/metrics.go 0.00% <0.00%> (ø)
nodebuilder/header/opts.go 0.00% <0.00%> (ø)
nodebuilder/header/service.go 50.00% <0.00%> (-7.15%) ⬇️
nodebuilder/share/module.go 91.11% <ø> (ø)
... and 14 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@derrandz derrandz force-pushed the blackbox-metrics branch 2 times, most recently from 3b717a5 to 5569d14 Compare November 22, 2022 17:26
m.sampleTime.Record(ctx, sampleTime.Seconds(),
attribute.Bool("failed", err != nil),
attribute.Int("header_width", len(h.DAH.RowsRoots)))
attribute.Int("header_width", len(h.DAH.RowsRoots)),
attribute.Int("header", int(h.RawHeader.Height)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If header high used as attribute, it will be evaluated as vector dimension in the prometheus. Header hight will have very high cardinality and will totally kill prometheus performance. Thats the primary reason I've not added it as an attribute.

Please consider using it as a metric value(Gauge) if it is a must for your observation or use an aggregated value for observation like das_sampled_chain_head.
Please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.

Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Q:

A proxy for each module to meter each method is great. I am wondering if can automate this proxy creation somehow, or do we always have some manual things to track? For example, the filecoin lotus team has this interesting reflection-based proxy, which automagically meters the time it takes for a call and counts how many time the call was made. It's technically possible also to get the size of the method's responded values if we need to. Q: Is this enough for the goal, or do we still need some manual meters?

Review

  • It seems like PB was regenerated. It seems like it was unintentional, so we should revert this.
  • This PR is based on the branch where by mistake the node uses a real Keyring generating real keys, and they got to the diff. Pls rebase and remove those.
  • I need to look deeper into the motivation for refactorings in the share service, so this is def not the last review

nodebuilder/header/metrics.go Outdated Show resolved Hide resolved
)

// retrieve the binary format to get the size of the header
// TODO(@team): is ExtendedHeader.MarshalBinary() == ResponseSize? I am making this assumption for now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The ExtendedHeader size varies and we can only calculate it max size by knowing max valset size and max block size

m.sampleTime.Record(ctx, sampleTime.Seconds(),
attribute.Bool("failed", err != nil),
attribute.Int("header_width", len(h.DAH.RowsRoots)))
attribute.Int("header_width", len(h.DAH.RowsRoots)),
attribute.Int("header", int(h.RawHeader.Height)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.

@derrandz derrandz changed the title [Benchmarking Telemetry]: Adding Blackbox Instrumentation to the HeaderService + ShareService feat(telemetry): add blackbox instrumentation to the HeaderService + ShareService Nov 28, 2022
@derrandz derrandz modified the milestone: 2023 Q1 Onsite Dec 5, 2022
@derrandz
Copy link
Contributor Author

derrandz commented Dec 7, 2022

@Wondertan thank you for the wonderful review! sorry it took me forever to get to it being all sick and stuff.

Q: Do you think it's worth it to automate this proxy creation?

The lotus team's example is great, but in our use cases, we wouldn't want to meter call counts, perhaps we want to meter call times, but only in specific contexts relative to our use case, example: DASing time, or GetShare time or GetByHeight time.

so I think that it would be of greater value to stick to manual metering at the moment to keep consciousness of what is being tracked individually, until a use case for metering lotus-team style shows up.

Let me know if you have any thoughts.

@derrandz derrandz force-pushed the blackbox-metrics branch 4 times, most recently from 4d59a1c to 7f496ce Compare December 7, 2022 15:00
@derrandz
Copy link
Contributor Author

Consensus has been reached through off line discussion, and we will be adding black-box instrumentation only to the parts we require as we go, no need to automate the creation of this at the moment.

go.mod Outdated Show resolved Hide resolved
@derrandz derrandz requested a review from Wondertan February 14, 2023 16:51
Copy link
Member

@Wondertan Wondertan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big ask. We should split this PR into multiple smaller PRs and start with Exchange metrics, which is mostly LGTM

@@ -37,6 +37,7 @@ Steps:
5. Check that a FN can retrieve shares from 1 to 20 blocks
*/
func TestFullReconstructFromBridge(t *testing.T) {
t.Skip("Skipping TestFullReconstructFromBridge until acc not found issue is resolved on Mac")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's resolved now. remove skips

@@ -20,3 +20,4 @@ vendor
/cel-key
coverage.txt
go.work
nodebuilder/keyring-test/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New line

@@ -45,6 +45,11 @@ func (s *Service) Head(ctx context.Context) (*header.ExtendedHeader, error) {
return s.store.Head(ctx)
}

func (s *Service) IsSyncing(context.Context) bool {
func (s *Service) IsSyncing(ctx context.Context) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unneeded change in the ctx

}

var (
meter = global.MeterProvider().Meter("libs/header/p2p")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meter naming is inconsistent.

@derrandz
Copy link
Contributor Author

@derrandz derrandz closed this Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:header Extended header area:metrics Related to measuring/collecting node metrics kind:feat Attached to feature PRs
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants