-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(telemetry): add blackbox instrumentation to the header module + share module + p2p bandwidth metrics #1376
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1376 +/- ##
==========================================
- Coverage 57.27% 56.72% -0.56%
==========================================
Files 239 246 +7
Lines 15760 16158 +398
==========================================
+ Hits 9027 9166 +139
- Misses 5802 6049 +247
- Partials 931 943 +12
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
3b717a5
to
5569d14
Compare
m.sampleTime.Record(ctx, sampleTime.Seconds(), | ||
attribute.Bool("failed", err != nil), | ||
attribute.Int("header_width", len(h.DAH.RowsRoots))) | ||
attribute.Int("header_width", len(h.DAH.RowsRoots)), | ||
attribute.Int("header", int(h.RawHeader.Height)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If header high used as attribute, it will be evaluated as vector dimension in the prometheus. Header hight will have very high cardinality and will totally kill prometheus performance. Thats the primary reason I've not added it as an attribute.
Please consider using it as a metric value(Gauge) if it is a must for your observation or use an aggregated value for observation like das_sampled_chain_head
.
Please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General Q:
A proxy for each module to meter each method is great. I am wondering if can automate this proxy creation somehow, or do we always have some manual things to track? For example, the filecoin lotus team has this interesting reflection-based proxy, which automagically meters the time it takes for a call and counts how many time the call was made. It's technically possible also to get the size of the method's responded values if we need to. Q: Is this enough for the goal, or do we still need some manual meters?
Review
- It seems like PB was regenerated. It seems like it was unintentional, so we should revert this.
- This PR is based on the branch where by mistake the node uses a real Keyring generating real keys, and they got to the diff. Pls rebase and remove those.
- I need to look deeper into the motivation for refactorings in the share service, so this is def not the last review
nodebuilder/header/metrics.go
Outdated
) | ||
|
||
// retrieve the binary format to get the size of the header | ||
// TODO(@team): is ExtendedHeader.MarshalBinary() == ResponseSize? I am making this assumption for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. The ExtendedHeader size varies and we can only calculate it max size by knowing max valset size and max block size
m.sampleTime.Record(ctx, sampleTime.Seconds(), | ||
attribute.Bool("failed", err != nil), | ||
attribute.Int("header_width", len(h.DAH.RowsRoots))) | ||
attribute.Int("header_width", len(h.DAH.RowsRoots)), | ||
attribute.Int("header", int(h.RawHeader.Height)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, @rootulp was advocating long ago to keep our metrics with low cardinality and not to overwhelm our incentivized testnet OTEL collector and Prometheus behind it.
@Wondertan thank you for the wonderful review! sorry it took me forever to get to it being all sick and stuff. Q: Do you think it's worth it to automate this proxy creation? The lotus team's example is great, but in our use cases, we wouldn't want to meter call counts, perhaps we want to meter call times, but only in specific contexts relative to our use case, example: DASing time, or GetShare time or GetByHeight time. so I think that it would be of greater value to stick to manual metering at the moment to keep consciousness of what is being tracked individually, until a use case for metering lotus-team style shows up. Let me know if you have any thoughts. |
4d59a1c
to
7f496ce
Compare
Consensus has been reached through off line discussion, and we will be adding black-box instrumentation only to the parts we require as we go, no need to automate the creation of this at the moment. |
7f496ce
to
d6a342c
Compare
ae94706
to
2a96aa2
Compare
…th share.Availability toenforce using service.ShareService for all Get operations
…is issue was observed mainly on macos
… height 1 in swamp tests
…is issue was observed mainly on macos
… height 1 in swamp tests
This reverts commit 1409d53.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One big ask. We should split this PR into multiple smaller PRs and start with Exchange metrics, which is mostly LGTM
@@ -37,6 +37,7 @@ Steps: | |||
5. Check that a FN can retrieve shares from 1 to 20 blocks | |||
*/ | |||
func TestFullReconstructFromBridge(t *testing.T) { | |||
t.Skip("Skipping TestFullReconstructFromBridge until acc not found issue is resolved on Mac") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's resolved now. remove skips
@@ -20,3 +20,4 @@ vendor | |||
/cel-key | |||
coverage.txt | |||
go.work | |||
nodebuilder/keyring-test/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New line
@@ -45,6 +45,11 @@ func (s *Service) Head(ctx context.Context) (*header.ExtendedHeader, error) { | |||
return s.store.Head(ctx) | |||
} | |||
|
|||
func (s *Service) IsSyncing(context.Context) bool { | |||
func (s *Service) IsSyncing(ctx context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unneeded change in the ctx
} | ||
|
||
var ( | ||
meter = global.MeterProvider().Meter("libs/header/p2p") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meter naming is inconsistent.
Overview
To enable blackbox instrumentation for our benchmarking tests, we need to enrich the celestia-node codebase with blackbox metrics, measurements that the benchmark will rely on to count the final results.
In this PR, we ship blackbox instrumentation for the HeaderService + ShareService in a well defined pattern that relies on:
fx
)This pattern will allow blackbox instrumentation to be:
WithBlackBoxMetrics
Changes
blackboxInstrument
for theHeaderService
blackboxInstrument
for theShareService
Checklist