Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Das Benchmarks]: Stress test 1 Full or Bridge Node against x Light Nodes #89

Merged
merged 44 commits into from
Feb 8, 2023

Conversation

derrandz
Copy link
Contributor

@derrandz derrandz commented Oct 10, 2022

Important

Merge when celestiaorg/celestia-node#1376 is merged and go.mod is corrected to point to celestia-node main

Overview

We are interested in benchmarking the bridge node against a multitude of light node groups, starting from (say) a 100 up to 100K

To do so, this PR adds test cases and local telemetry to enable the benchmark alongside with metrics collection for benchmarking results visualization.

More details on #79

Changes

  • Adds documentation for the new test plan 002

  • Adds first test case

  • Adds local compositions

  • Display how many light clients are currently connected to the bridge node (concurrency level)

  • And finally, since the charts are rough, we will work on smoothening them out (cleaning labels, improving readability and so on)

  • Adds k8s compositions?

  • Add matrix for 64/128 square sizes (Check comment here)

    • 64
    • 128
  • Add resource consumption metrics for the bridge node to analyze how # of light nodes impacts resource consumption

  • Add comparative charts for results across multiple concurrency levels:
    example: DASing time for (100 light clients, 500 light clients, 1000 light clients) all in the same chart for comparison.

How to run:

  1. Start testground:
$ make tg-start
  1. In another terminal start the telemetry infrastructure
$ make telemetry-infra-up
  1. Run the test-case
$ make tg-run-composition RUNNER=local-docker TESTPLAN=das-benchmarks COMPOSITION=001-lights-dasing-latest-from-bridge-16-50-28
  1. Go to http://localhost:3000 to access Grafana

  2. Add prometheus as a data source
    5.1 For the URL, if you are running in a droplet, use your instance's IP instead of localhost

  3. Import the dashboard under ./build/grafana/dashboards/benchmarks.json

Dependencies

This PR depends on #132

Owners: @derrandz


1. The Full Node has the latest head
2. All light nodes are network-bootstrapped and connected to the full node (no discovery required)
3. Share size is 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be awesome to have 32/64/128 matrix as we are having an assumption that it works already here #96 😅

@derrandz derrandz force-pushed the tp002/das-benchmark branch 2 times, most recently from 9fd0724 to d2023b2 Compare October 20, 2022 15:23
@derrandz derrandz self-assigned this Oct 24, 2022
@derrandz derrandz added experiment Experiments to find out either the tech is suitable for our needs test Request for creating a test-case testground related to testground labels Oct 24, 2022
@derrandz derrandz force-pushed the tp002/das-benchmark branch from d2023b2 to d182c15 Compare October 25, 2022 15:51
@derrandz
Copy link
Contributor Author

To configure the celestia node with the benchmark parameters, these two PRs have to go in;

@derrandz derrandz force-pushed the tp002/das-benchmark branch 3 times, most recently from fcaea7e to 27ab5f7 Compare November 9, 2022 17:10
@derrandz
Copy link
Contributor Author

derrandz commented Nov 9, 2022

The DASer PR is not required to tell configure the light node to start dasing from a specific height. We changed the logic of the test to unblock this PR, so that's no longer required.

The other one that's mentioned in the comment might be required down the line for different sample amounts.

@derrandz
Copy link
Contributor Author

derrandz commented Nov 9, 2022

Define the Non Functional Requirements that have to be met for this test plan alongside the metrics to collect and their thresholds.

Referencing #108

@derrandz
Copy link
Contributor Author

derrandz commented Nov 11, 2022

Update

The course of this PR is changing to include test isolation from test setup, metrics' collection and the infrastructure to support the metrics collection.

Although hacky, it's convenient to take this route. After getting a fully working version, we will rewrite the history of this PR to clean this up.

More context in here

The required infrastructure efforts are document in #109

@derrandz derrandz force-pushed the tp002/das-benchmark branch from 39bdd5f to 03662cb Compare November 16, 2022 13:04
@derrandz
Copy link
Contributor Author

derrandz commented Nov 16, 2022

Ongoing work to enable blackbox telemetry is in:

@derrandz
Copy link
Contributor Author

derrandz commented Nov 22, 2022

Progress update regarding the blackbox telemetry efforts:

We managed to get a few charts to look at, by which we measure the performance of the bridge node in terms of how well it’s serving the DASing process. At the moment, since we would benchmark a bridge node against a multitude of light nodes, we will go with the option of displaying charts per light node instance (you can choose a random instance from the drop down in the screenshot)

Aggregate charts that display the overall state of the DASing process across all light nodes is the next step. (Check the PR's TODOs)

(Screenshots from a local run with 1 validator, 1 bridge node and 28 light nodes)
Screen Shot 2022-11-23 at 00 27 18

The Selection of Light Nodes from the dropdown
Screen Shot 2022-11-23 at 00 27 24

@derrandz
Copy link
Contributor Author

Some improvements to charting:

  • Added the native influx-db metrics for testground to track # of alive light nodes (a way to experiment with testground’s influxdb)

  • switched histogram chartings to use tabular like logic (see now DASing time and block time charts looking more clear per height)

Screen Shot 2022-11-24 at 00 39 38

@derrandz derrandz force-pushed the tp002/das-benchmark branch 2 times, most recently from 5b3778f to c4c03b9 Compare November 29, 2022 01:34
@derrandz derrandz requested a review from Bidon15 February 8, 2023 12:05
@derrandz derrandz changed the title TP002: Stress test 1 Full or Bridge Node against x Light Nodes [Das Benchmarks]: Stress test 1 Full or Bridge Node against x Light Nodes Feb 8, 2023
Makefile Show resolved Hide resolved
build/docker-compose.yml Show resolved Hide resolved
testkit/nodekit/node.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@derrandz derrandz requested a review from Bidon15 February 8, 2023 13:28
Copy link
Member

@Bidon15 Bidon15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get his home run and nits figure out later

@derrandz
Copy link
Contributor Author

derrandz commented Feb 8, 2023

Final nits tracked in an issue for future resolution #167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment Experiments to find out either the tech is suitable for our needs test Request for creating a test-case testground related to testground
Projects
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

Add telemetry infrastructure to collect metrics for local docker setups
2 participants