Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FNS polish - metrics, max msg size, default flag values #2517

Merged
merged 3 commits into from
Oct 18, 2024

Conversation

jonfung-dydx
Copy link
Contributor

@jonfung-dydx jonfung-dydx commented Oct 18, 2024

Porting over changes from https://github.com/dydxprotocol/v4-chain/commits/feat/full-node-streaming/ feature branch

  1. metrics, remove dead code, tag some metrics by subscription id
  2. max message size upped
  3. default flag values upped since more traffic from subaccounts + taker orders

Summary by CodeRabbit

  • New Features

    • Increased default gRPC streaming batch size and channel buffer size to enhance performance.
    • Introduced new metrics for tracking subscription updates and removed obsolete metrics.
  • Improvements

    • Enhanced subscription management with better validation and logging.
    • Improved handling of streaming updates and metrics emission for better reliability.
  • Bug Fixes

    • Adjusted metrics handling to ensure accurate tracking of streaming updates.
  • Chores

    • Updated test cases to align with new configuration standards.

@jonfung-dydx jonfung-dydx requested a review from a team as a code owner October 18, 2024 16:29
@jonfung-dydx jonfung-dydx requested a review from jayy04 October 18, 2024 16:29
Copy link
Contributor

coderabbitai bot commented Oct 18, 2024

Walkthrough

This pull request introduces several changes across multiple files related to gRPC streaming configuration and metrics management. Key modifications include updating default values for gRPC streaming constants, enhancing the FullNodeStreamingManagerImpl class with improved metric handling and subscription management, and adjusting test cases to reflect these changes. Additionally, a new maximum message size is set for WebSocket connections. These updates aim to refine the functionality and maintainability of the streaming and metrics components.

Changes

File Path Change Summary
protocol/app/flags/flags.go, Updated constants DefaultGrpcStreamingMaxBatchSize and DefaultGrpcStreamingMaxChannelBufferSize from 2000 to 10000.
protocol/app/flags/flags_test.go Modified test cases in TestValidate and TestGetFlagValuesFromOptions to reflect updated gRPC streaming constants.
protocol/lib/metrics/metric_keys.go Removed GrpcSendSubaccountSnapshotLatency, added GrpcSendSubaccountUpdateCount and SubscriptionId.
protocol/streaming/full_node_streaming_manager.go Enhanced EmitMetrics, updated Subscribe method for better logging, improved update handling, and refined internal logic for metrics emission.
protocol/streaming/ws/websocket_server.go Set maximum message size for WebSocket connections to 10 megabytes.

Possibly related PRs

Suggested labels

protocol, indexer, proto

Suggested reviewers

  • jayy04
  • dydxwill

🐇 In the meadow, changes bloom,
Constants rise, dispelling gloom.
Metrics dance, subscriptions sing,
WebSocket limits, a new spring.
With every hop, our code does thrive,
In this patch, our dreams arrive! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (5)
protocol/streaming/ws/websocket_server.go (1)

59-61: Approve with suggestions: Consider making the message size limit configurable

The addition of a maximum message size limit for WebSocket connections is a good security practice. It helps prevent potential denial-of-service attacks and improves overall system stability.

However, consider the following improvements:

  1. Make the limit configurable, perhaps through an environment variable or configuration file. This would allow for easier adjustments without code changes.

  2. Add a comment explaining the rationale behind the 10 MB limit.

Example implementation:

const defaultMaxMessageSize = 10 * 1024 * 1024 // 10 MB

// MaxMessageSize can be set through an environment variable
var MaxMessageSize = getEnvInt("WS_MAX_MESSAGE_SIZE", defaultMaxMessageSize)

// In the Handler method:
// Set maximum message size to prevent potential DoS attacks
conn.SetReadLimit(int64(MaxMessageSize))
protocol/lib/metrics/metric_keys.go (1)

88-88: Approved: New SubscriptionId metric key

The addition of SubscriptionId as a metric key is a valuable improvement. It aligns perfectly with the PR objective of tagging metrics by subscription ID, which will enable more granular analysis of system behavior.

Consider renaming the constant to SubscriptionID (with uppercase "ID") to be consistent with common Go naming conventions for acronyms.

protocol/app/flags/flags_test.go (1)

122-124: LGTM: Consistent update of gRPC streaming parameters with optimistic execution.

The increase in GrpcStreamingMaxBatchSize and GrpcStreamingMaxChannelBufferSize from 2000 to 10000 is consistently applied here, demonstrating compatibility with optimistic execution.

There's a minor typo in the test case name: "canbe" should be "can be". Consider updating it to improve readability:

-		"success - optimistic execution canbe  enabled with gRPC streaming": {
+		"success - optimistic execution can be enabled with gRPC streaming": {
protocol/streaming/full_node_streaming_manager.go (2)

723-723: Monitor buffer state with appropriate metric emission frequency

While emitting metrics after adding updates to the cache is useful for monitoring, ensure that the frequency of sm.EmitMetrics() calls does not introduce performance overhead, especially if updates are added to the cache very frequently.


761-761: Prevent excessive metric emissions during error handling

In RemoveSubscriptionsAndClearBufferIfFull, calling sm.EmitMetrics() immediately after clearing the buffer and subscriptions may not be necessary, as the buffer is now empty. Evaluate whether this metric emission provides additional value or if it can be omitted to reduce overhead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 66d7a5d and 030f952.

📒 Files selected for processing (5)
  • protocol/app/flags/flags.go (1 hunks)
  • protocol/app/flags/flags_test.go (4 hunks)
  • protocol/lib/metrics/metric_keys.go (2 hunks)
  • protocol/streaming/full_node_streaming_manager.go (9 hunks)
  • protocol/streaming/ws/websocket_server.go (1 hunks)
🧰 Additional context used
🔇 Additional comments (12)
protocol/lib/metrics/metric_keys.go (2)

76-76: Approved: Improved metric for subaccount updates

The change from GrpcSendSubaccountSnapshotLatency to GrpcSendSubaccountUpdateCount is a good improvement. This new metric will provide more actionable data on the frequency of subaccount updates, which aligns well with the PR objective of enhancing metrics. It will help in monitoring system activity more effectively, especially in light of the mentioned increase in traffic from subaccounts.


Line range hint 1-94: Summary: Effective metrics enhancements

The changes to this file effectively support the PR objectives by:

  1. Shifting focus from latency to count for subaccount updates, providing more actionable data.
  2. Introducing a subscription ID metric, enabling more granular analysis.

These modifications will help in better monitoring and understanding system behavior, especially in light of the increased traffic from subaccounts mentioned in the PR description.

protocol/app/flags/flags.go (3)

75-76: Summary: Increased gRPC streaming capacity

The changes to DefaultGrpcStreamingMaxBatchSize and DefaultGrpcStreamingMaxChannelBufferSize are consistent and align with the PR objectives to handle increased traffic. These modifications should improve the system's ability to handle higher loads, particularly from subaccounts and taker orders.

Key points:

  1. Both values increased from 2000 to 10000, providing a 5x capacity increase.
  2. The changes are isolated to default values and don't require modifications to other parts of the code.
  3. The Validate function remains valid as it only checks for positive values.

Recommendations:

  1. Conduct thorough performance testing to ensure the system can handle the increased capacity effectively.
  2. Monitor system resource usage, particularly memory consumption and CPU utilization, after deployment.
  3. Consider implementing adaptive scaling mechanisms for these values in future iterations if traffic patterns are highly variable.

76-76: Approved: Increased DefaultGrpcStreamingMaxChannelBufferSize

The increase from 2000 to 10000 for DefaultGrpcStreamingMaxChannelBufferSize is in line with the PR objectives to handle increased traffic. This change should improve the system's ability to handle bursts of messages without dropping them.

However, please consider the following:

  1. Monitor memory usage, as larger channel buffers will increase memory consumption.
  2. Keep an eye on message processing delays, as larger queues might lead to increased latency.
  3. Ensure that the system can effectively process messages at this increased rate to prevent buffer overflow.

To verify the impact of this change and ensure system stability, consider running the following commands:


75-75: Approved: Increased DefaultGrpcStreamingMaxBatchSize

The increase from 2000 to 10000 for DefaultGrpcStreamingMaxBatchSize aligns with the PR objectives to handle increased traffic. This change should improve throughput for gRPC streaming.

However, please consider the following:

  1. Monitor memory usage, as larger batches may increase memory consumption.
  2. Keep an eye on latency for processing individual messages within a batch.
  3. Ensure that clients can handle larger batch sizes effectively.

To verify the impact of this change, consider running the following commands:

protocol/app/flags/flags_test.go (4)

92-93: LGTM: Updated gRPC streaming parameters align with PR objectives.

The increase in GrpcStreamingMaxBatchSize and GrpcStreamingMaxChannelBufferSize from 2000 to 10000 is consistent with the PR's goal of updating default flag values to handle increased traffic from subaccounts and taker orders. This change should allow for processing larger batches and maintaining larger buffer sizes in gRPC streaming.


103-104: LGTM: Consistent update of gRPC streaming parameters.

The increase in GrpcStreamingMaxBatchSize and GrpcStreamingMaxChannelBufferSize from 2000 to 10000 is consistently applied here, maintaining parity with the gRPC-only configuration. This ensures that the updated values are used regardless of whether WebSocket streaming is enabled alongside gRPC streaming.


260-261: LGTM: Updated default values for gRPC streaming parameters.

The increase in the expected default values for GrpcStreamingMaxBatchSize and GrpcStreamingMaxChannelBufferSize from 2000 to 10000 ensures consistency with the actual implementation. This change aligns with the PR's objective of updating default flag values to handle increased traffic.


Line range hint 1-380: Summary: Consistent updates to gRPC streaming parameters in test cases.

The changes in this file consistently update the GrpcStreamingMaxBatchSize and GrpcStreamingMaxChannelBufferSize values from 2000 to 10000 across various test cases. These updates align well with the PR's objective of adjusting default flag values to handle increased traffic from subaccounts and taker orders. The changes are applied consistently in different scenarios, including gRPC-only streaming, combined gRPC and WebSocket streaming, and when optimistic execution is enabled.

A minor typo was found in one test case name, which has been noted for correction. Overall, these changes should improve the system's ability to handle larger batches and maintain larger buffer sizes in gRPC streaming, contributing to better performance under increased load.

protocol/streaming/full_node_streaming_manager.go (3)

377-381: Ensure metric increments align with monitoring objectives

In the sendStreamUpdates method, you're incrementing metrics.GrpcAddToSubscriptionChannelCount each time updates are attempted to be sent to a subscription's channel. Verify that this metric provides meaningful insights and aligns with your monitoring goals, without introducing significant overhead.


745-745: Check buffer overflow handling logic

After adding subaccount updates to the cache, you're calling sm.RemoveSubscriptionsAndClearBufferIfFull(). Verify that the conditions for buffer overflow are correctly set and that this function effectively prevents buffer overflows without unnecessarily disconnecting subscribers.


238-241: Confirm label correctness for metrics

When incrementing metrics.GrpcSendResponseToSubscriberCount, ensure that the labels applied correctly represent the subscription ID. Accurate labeling is crucial for effective metric aggregation and monitoring.

Run the following script to verify that metric labels are correctly applied:

Comment on lines +386 to +387
// Buffer is full. Emit metric and drop subscription.
sm.EmitMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid emitting metrics inside tight loops or error paths

Calling sm.EmitMetrics() inside the default case when the subscription channel is full could lead to performance issues, especially under high load. Emitting metrics is an I/O operation and might be resource-intensive if called frequently. Consider removing this call or ensuring it's throttled appropriately.

Comment on lines +797 to +800
metrics.IncrCounterWithLabels(
metrics.GrpcAddToSubscriptionChannelCount,
1,
metrics.GetLabelForIntValue(metrics.SubscriptionId, int(id)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid duplicate metric increments to prevent misleading data

You're incrementing metrics.GrpcAddToSubscriptionChannelCount in both sendStreamUpdates and FlushStreamUpdatesWithLock. This could result in counting the same event twice, leading to inaccurate metrics. Decide on a single place to increment this counter to ensure the metric reflects the actual number of events accurately.

Comment on lines +805 to +806
// Buffer is full. Emit metric and drop subscription.
sm.EmitMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Limit metric emissions in high-frequency error scenarios

Similar to previous comments, calling sm.EmitMetrics() when a subscription's channel is full within a select default case can lead to performance degradation under high load. Consider removing this call or implementing a rate-limiting mechanism to prevent excessive metric emissions.

Comment on lines +412 to +416
metrics.IncrCounter(
metrics.GrpcSendSubaccountUpdateCount,
1,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add labels to GrpcSendSubaccountUpdateCount for better granularity

Currently, metrics.GrpcSendSubaccountUpdateCount is incremented without any labels. Adding labels such as the subaccount ID can provide more granular insights into metric data, which can be valuable for debugging and monitoring.

Apply this diff to include the subaccount ID as a label:

metrics.IncrCounter(
-   metrics.GrpcSendSubaccountUpdateCount,
-   1,
+   metrics.GrpcSendSubaccountUpdateCount,
+   1,
+   metrics.GetLabelForStringValue(metrics.SubaccountId, subaccountUpdate.SubaccountId.String()),
)

Committable suggestion was skipped due to low confidence.

Comment on lines +156 to 158
metrics.AddSample(
metrics.GrpcStreamNumUpdatesBuffered,
float32(len(sm.streamUpdateCache)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use SetGauge for gauge metrics instead of AddSample

In the EmitMetrics method, you replaced metrics.SetGauge with metrics.AddSample for the GrpcStreamNumUpdatesBuffered metric. Since this metric represents the current number of updates buffered, which is a gauge (a value that can go up or down), it would be more appropriate to continue using metrics.SetGauge to accurately reflect the current state rather than accumulating samples.

Apply this diff to revert to using SetGauge:

- metrics.AddSample(
+ metrics.SetGauge(
    metrics.GrpcStreamNumUpdatesBuffered,
    float32(len(sm.streamUpdateCache)),
)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metrics.AddSample(
metrics.GrpcStreamNumUpdatesBuffered,
float32(len(sm.streamUpdateCache)),
metrics.SetGauge(
metrics.GrpcStreamNumUpdatesBuffered,
float32(len(sm.streamUpdateCache)),

Comment on lines +165 to +168
metrics.AddSampleWithLabels(
metrics.GrpcSubscriptionChannelLength,
float32(len(subscription.updatesChannel)),
metrics.GetLabelForIntValue(metrics.SubscriptionId, int(subscription.subscriptionId)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use SetGaugeWithLabels for current values

For the GrpcSubscriptionChannelLength metric, you're now using metrics.AddSampleWithLabels. If the intention is to record the current length of each subscription's update channel (which is a gauge), consider using metrics.SetGaugeWithLabels to set the value directly. This provides an accurate snapshot of each channel's length at the time of metric emission.

Apply this diff to use SetGaugeWithLabels:

- metrics.AddSampleWithLabels(
+ metrics.SetGaugeWithLabels(
    metrics.GrpcSubscriptionChannelLength,
    float32(len(subscription.updatesChannel)),
    metrics.GetLabelForIntValue(metrics.SubscriptionId, int(subscription.subscriptionId)),
)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metrics.AddSampleWithLabels(
metrics.GrpcSubscriptionChannelLength,
float32(len(subscription.updatesChannel)),
metrics.GetLabelForIntValue(metrics.SubscriptionId, int(subscription.subscriptionId)),
metrics.SetGaugeWithLabels(
metrics.GrpcSubscriptionChannelLength,
float32(len(subscription.updatesChannel)),
metrics.GetLabelForIntValue(metrics.SubscriptionId, int(subscription.subscriptionId)),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

2 participants