-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update resharding.md #10264
Update resharding.md #10264
Conversation
docs/architecture/how/resharding.md
Outdated
* 1 - Building - resharding is running. Only one shard at a time can be in that state while the rest will be either finished or waiting in the Scheduled state. | ||
* 2 - Finished - resharding is finished. | ||
* -1 - Failed - resharding failed and manual recovery action is required. The node will operate as usual until the end of the epoch but will then stop being able to process blocks. | ||
* near_resharding_batch_size and near_resharding_batch_count - those two metrics show how much data has been resharded. Both should be gradually increasing while the near_resharding_status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can split near_resharding_batch_size
and near_resharding_batch_count
across two points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Both should be gradually increasing while the near_resharding_status -> Both metrics should progress with the near_resharding_status
as follows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #10264 +/- ##
==========================================
- Coverage 71.91% 71.79% -0.12%
==========================================
Files 707 711 +4
Lines 142268 142936 +668
Branches 142268 142936 +668
==========================================
+ Hits 102306 102620 +314
- Misses 35237 35583 +346
- Partials 4725 4733 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I can combine some info from this into shorter release notes. The only concern is about fine tuning, but it is not related to documentation. It is more like a late feature request.
You can say that you don't see more than one iteration of fine tuning happening at the very beginning of the resharding epoch, and that will be enough for me.
But otherwise I would really like an option to update batch size without loosing resharding progress.
|
||
Most of the keys in our tries, are in the form of ``${some short prefix}${account_id}${long suffix}``, we could use this fact, and move the whole 'trie subtrees' around. | ||
A node needs to be restarted for the new config to take effect. This should be done only when absolutely necessary as restarting during resharding will interrupt it and resharding will need to start from beginning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered making these config params updatable through SIGHUP?
Line 570 in dd80b4c
if sig == "SIGHUP" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not considered it but it sounds real good. I added it to the resharding tracking issue but until it's done I'll leave the docs as is.
* While in the Scheduled state both metrics should remain 0. | ||
* While in the Building state both metrics should be gradually increasing. | ||
* While in the Finished state both metrics should remain at the same value. | ||
* near_resharding_batch_prepare_time_bucket, near_resharding_batch_apply_time_bucket and near_resharding_batch_commit_time_bucket - those three metrics can be used to track the performance of resharding and fine tune throttling if needed. As a rule of thumb the combined time of prepare, apply and commit for a batch should remain at the 100ms-200ms level on average. Higher batch processing time may lead to disruptions in block processing, missing chunks and blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like some instructions on how to understand from metrics that resharding is going to finish before the start of the next epoch. In case we tuned too much into block processing optimisation.
Maybe some approximations for near_resharding_batch_size
or near_resharding_batch_count
for testnet, mainnet, or something based on specific size of some columns (that is also a metric).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I'll follow up in a separate PR.
A pretty heavy handed rewrite of the resharding documentation. I'm open to some heavy handed suggestions ;)
I'm not sure what's the best trade off between duplicating some content from the NEPs here and keeping in minimal with just links.
There is some new content here, mainly the rollout section, I would love to hear some thoughts about this in particular from @gmilescu and @posvyatokum.