Update resharding.md #10264

wacban · 2023-11-29T13:03:12Z

A pretty heavy handed rewrite of the resharding documentation. I'm open to some heavy handed suggestions ;)

I'm not sure what's the best trade off between duplicating some content from the NEPs here and keeping in minimal with just links.

There is some new content here, mainly the rollout section, I would love to hear some thoughts about this in particular from @gmilescu and @posvyatokum.

docs/architecture/how/resharding.md

shreyan-gupta · 2023-11-29T13:15:10Z

docs/architecture/how/resharding.md

+  * 1 - Building - resharding is running. Only one shard at a time can be in that state while the rest will be either finished or waiting in the Scheduled state. 
+  * 2 - Finished - resharding is finished. 
+  * -1 - Failed - resharding failed and manual recovery action is required. The node will operate as usual until the end of the epoch but will then stop being able to process blocks. 
+* near_resharding_batch_size and near_resharding_batch_count - those two metrics show how much data has been resharded. Both should be gradually increasing while the near_resharding_status. 


Can split near_resharding_batch_size and near_resharding_batch_count across two points

nit: Both should be gradually increasing while the near_resharding_status -> Both metrics should progress with the near_resharding_status as follows

Can include graphics to show what this might look like. I've taken screenshots from grafana dashboard

codecov · 2023-11-29T13:25:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dd80b4c) 71.91% compared to head (f1c12be) 71.79%.
Report is 6 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10264      +/-   ##
==========================================
- Coverage   71.91%   71.79%   -0.12%     
==========================================
  Files         707      711       +4     
  Lines      142268   142936     +668     
  Branches   142268   142936     +668     
==========================================
+ Hits       102306   102620     +314     
- Misses      35237    35583     +346     
- Partials     4725     4733       +8

Flag	Coverage Δ
backward-compatibility	`0.08% <ø> (-0.01%)`	⬇️
db-migration	`0.08% <ø> (-0.01%)`	⬇️
genesis-check	`1.26% <ø> (-0.01%)`	⬇️
integration-tests	`36.20% <ø> (-0.09%)`	⬇️
linux	`71.66% <ø> (-0.11%)`	⬇️
linux-nightly	`71.39% <ø> (-0.11%)`	⬇️
macos	`53.92% <ø> (-1.17%)`	⬇️
pytests	`1.48% <ø> (-0.01%)`	⬇️
sanity-checks	`1.28% <ø> (-0.01%)`	⬇️
unittests	`68.13% <ø> (-0.14%)`	⬇️
upgradability	`0.13% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

posvyatokum

Looks good. I can combine some info from this into shorter release notes. The only concern is about fine tuning, but it is not related to documentation. It is more like a late feature request.
You can say that you don't see more than one iteration of fine tuning happening at the very beginning of the resharding epoch, and that will be enough for me.
But otherwise I would really like an option to update batch size without loosing resharding progress.

posvyatokum · 2023-11-29T13:39:16Z

docs/architecture/how/resharding.md


-Most of the keys in our tries, are in the form of ``${some short prefix}${account_id}${long suffix}``, we could use this fact, and move the whole 'trie subtrees' around.
+A node needs to be restarted for the new config to take effect. This should be done only when absolutely necessary as restarting during resharding will interrupt it and resharding will need to start from beginning. 


Have you considered making these config params updatable through SIGHUP?

nearcore/neard/src/cli.rs

Line 570 in dd80b4c

if sig == "SIGHUP" {

I have not considered it but it sounds real good. I added it to the resharding tracking issue but until it's done I'll leave the docs as is.

posvyatokum · 2023-11-29T13:42:00Z

docs/architecture/how/resharding.md

+  * While in the Scheduled state both metrics should remain 0. 
+  * While in the Building state both metrics should be gradually increasing. 
+  * While in the Finished state both metrics should remain at the same value. 
+* near_resharding_batch_prepare_time_bucket, near_resharding_batch_apply_time_bucket and near_resharding_batch_commit_time_bucket - those three metrics can be used to track the performance of resharding and fine tune throttling if needed. As a rule of thumb the combined time of prepare, apply and commit for a batch should remain at the 100ms-200ms level on average. Higher batch processing time may lead to disruptions in block processing, missing chunks and blocks. 


I would like some instructions on how to understand from metrics that resharding is going to finish before the start of the next epoch. In case we tuned too much into block processing optimisation.
Maybe some approximations for near_resharding_batch_size or near_resharding_batch_count for testnet, mainnet, or something based on specific size of some columns (that is also a metric).

Sure thing, I'll follow up in a separate PR.

Update resharding.md

3597d03

wacban requested review from gmilescu, posvyatokum, shreyan-gupta and walnut-the-cat November 29, 2023 13:03

wacban requested a review from a team as a code owner November 29, 2023 13:03

shreyan-gupta approved these changes Nov 29, 2023

View reviewed changes

posvyatokum approved these changes Nov 29, 2023

View reviewed changes

per review

f1c12be

wacban enabled auto-merge December 1, 2023 10:20

wacban added this pull request to the merge queue Dec 1, 2023

Merged via the queue into master with commit 52b3287 Dec 1, 2023
17 of 18 checks passed

wacban deleted the waclaw-resharding-docs branch December 1, 2023 11:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update resharding.md #10264

Update resharding.md #10264

wacban commented Nov 29, 2023 •

edited

Loading

shreyan-gupta Nov 29, 2023

shreyan-gupta Nov 29, 2023

shreyan-gupta Nov 29, 2023

codecov bot commented Nov 29, 2023 •

edited

Loading

posvyatokum left a comment

posvyatokum Nov 29, 2023

wacban Dec 1, 2023

posvyatokum Nov 29, 2023

wacban Dec 1, 2023


		Most of the keys in our tries, are in the form of ``${some short prefix}${account_id}${long suffix}``, we could use this fact, and move the whole 'trie subtrees' around.
		A node needs to be restarted for the new config to take effect. This should be done only when absolutely necessary as restarting during resharding will interrupt it and resharding will need to start from beginning.

Update resharding.md #10264

Update resharding.md #10264

Conversation

wacban commented Nov 29, 2023 • edited Loading

shreyan-gupta Nov 29, 2023

Choose a reason for hiding this comment

shreyan-gupta Nov 29, 2023

Choose a reason for hiding this comment

shreyan-gupta Nov 29, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 29, 2023 • edited Loading

Codecov Report

posvyatokum left a comment

Choose a reason for hiding this comment

posvyatokum Nov 29, 2023

Choose a reason for hiding this comment

wacban Dec 1, 2023

Choose a reason for hiding this comment

posvyatokum Nov 29, 2023

Choose a reason for hiding this comment

wacban Dec 1, 2023

Choose a reason for hiding this comment

wacban commented Nov 29, 2023 •

edited

Loading

codecov bot commented Nov 29, 2023 •

edited

Loading