Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update resharding.md #10264

Merged
merged 2 commits into from
Dec 1, 2023
Merged

Update resharding.md #10264

merged 2 commits into from
Dec 1, 2023

Conversation

wacban
Copy link
Contributor

@wacban wacban commented Nov 29, 2023

A pretty heavy handed rewrite of the resharding documentation. I'm open to some heavy handed suggestions ;)

I'm not sure what's the best trade off between duplicating some content from the NEPs here and keeping in minimal with just links.

There is some new content here, mainly the rollout section, I would love to hear some thoughts about this in particular from @gmilescu and @posvyatokum.

docs/architecture/how/resharding.md Show resolved Hide resolved
* 1 - Building - resharding is running. Only one shard at a time can be in that state while the rest will be either finished or waiting in the Scheduled state.
* 2 - Finished - resharding is finished.
* -1 - Failed - resharding failed and manual recovery action is required. The node will operate as usual until the end of the epoch but will then stop being able to process blocks.
* near_resharding_batch_size and near_resharding_batch_count - those two metrics show how much data has been resharded. Both should be gradually increasing while the near_resharding_status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can split near_resharding_batch_size and near_resharding_batch_count across two points

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Both should be gradually increasing while the near_resharding_status -> Both metrics should progress with the near_resharding_status as follows

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can include graphics to show what this might look like. I've taken screenshots from grafana dashboard

Screenshot 2023-11-29 at 6 52 12 PM
Screenshot 2023-11-29 at 6 52 04 PM

Copy link

codecov bot commented Nov 29, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dd80b4c) 71.91% compared to head (f1c12be) 71.79%.
Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10264      +/-   ##
==========================================
- Coverage   71.91%   71.79%   -0.12%     
==========================================
  Files         707      711       +4     
  Lines      142268   142936     +668     
  Branches   142268   142936     +668     
==========================================
+ Hits       102306   102620     +314     
- Misses      35237    35583     +346     
- Partials     4725     4733       +8     
Flag Coverage Δ
backward-compatibility 0.08% <ø> (-0.01%) ⬇️
db-migration 0.08% <ø> (-0.01%) ⬇️
genesis-check 1.26% <ø> (-0.01%) ⬇️
integration-tests 36.20% <ø> (-0.09%) ⬇️
linux 71.66% <ø> (-0.11%) ⬇️
linux-nightly 71.39% <ø> (-0.11%) ⬇️
macos 53.92% <ø> (-1.17%) ⬇️
pytests 1.48% <ø> (-0.01%) ⬇️
sanity-checks 1.28% <ø> (-0.01%) ⬇️
unittests 68.13% <ø> (-0.14%) ⬇️
upgradability 0.13% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@posvyatokum posvyatokum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I can combine some info from this into shorter release notes. The only concern is about fine tuning, but it is not related to documentation. It is more like a late feature request.
You can say that you don't see more than one iteration of fine tuning happening at the very beginning of the resharding epoch, and that will be enough for me.
But otherwise I would really like an option to update batch size without loosing resharding progress.


Most of the keys in our tries, are in the form of ``${some short prefix}${account_id}${long suffix}``, we could use this fact, and move the whole 'trie subtrees' around.
A node needs to be restarted for the new config to take effect. This should be done only when absolutely necessary as restarting during resharding will interrupt it and resharding will need to start from beginning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered making these config params updatable through SIGHUP?

if sig == "SIGHUP" {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not considered it but it sounds real good. I added it to the resharding tracking issue but until it's done I'll leave the docs as is.

* While in the Scheduled state both metrics should remain 0.
* While in the Building state both metrics should be gradually increasing.
* While in the Finished state both metrics should remain at the same value.
* near_resharding_batch_prepare_time_bucket, near_resharding_batch_apply_time_bucket and near_resharding_batch_commit_time_bucket - those three metrics can be used to track the performance of resharding and fine tune throttling if needed. As a rule of thumb the combined time of prepare, apply and commit for a batch should remain at the 100ms-200ms level on average. Higher batch processing time may lead to disruptions in block processing, missing chunks and blocks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like some instructions on how to understand from metrics that resharding is going to finish before the start of the next epoch. In case we tuned too much into block processing optimisation.
Maybe some approximations for near_resharding_batch_size or near_resharding_batch_count for testnet, mainnet, or something based on specific size of some columns (that is also a metric).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'll follow up in a separate PR.

@wacban wacban enabled auto-merge December 1, 2023 10:20
@wacban wacban added this pull request to the merge queue Dec 1, 2023
Merged via the queue into master with commit 52b3287 Dec 1, 2023
17 of 18 checks passed
@wacban wacban deleted the waclaw-resharding-docs branch December 1, 2023 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants