-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: state: Fast migration for v15 #7933
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7933 +/- ##
==========================================
+ Coverage 39.12% 39.33% +0.21%
==========================================
Files 656 658 +2
Lines 70924 71112 +188
==========================================
+ Hits 27746 27971 +225
+ Misses 38382 38321 -61
- Partials 4796 4820 +24
Continue to review full report at Codecov.
|
bc5337d
to
7559e43
Compare
blockstore/autobatch.go
Outdated
} | ||
|
||
func (bs *AutobatchBlockstore) View(ctx context.Context, cid cid.Cid, callback func([]byte) error) error { | ||
return xerrors.New("unsupported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd implement this, even if we just call get under the covers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just drop the methods -- we don't actually need this type to implement Lotus's blockstore interface
blockstore/autobatch_test.go
Outdated
require.NoError(t, ab.Put(ctx, b1)) | ||
require.NoError(t, ab.Put(ctx, b2)) | ||
|
||
ab.Flush(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not implement Flush in autobatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this will get updated. Thanks!
default: | ||
autolog.Errorf("FLUSH ERRORED: %w, retrying in %v", putErr, bs.flushRetryDelay) | ||
time.Sleep(bs.flushRetryDelay) | ||
putErr = bs.doFlush(bs.flushCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to me to retry putting it, what are the possible causes of an error that goes away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien thinks that could happen if the system is under stress -- with some backoff it may succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's not an issue with badger? But my thinking was:
- Transient error: retry will help.
- Non-transient error: we can't write anything else anyways so we might as well retry.
_, ok := bs.addedCids[blk.Cid()] | ||
if !ok { | ||
bs.addedCids[blk.Cid()] = struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the hit rate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not measured, but I expect it to be quite high -- think of the number of times we'll try to Put
the empty deadline object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm just wondering if it's worth the added memory use (tho if it's insignificant, this should be ok to keep even if it doesn't help that much).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- i do have a TODO to drop if the memory use is a problem, we'll see in experiments.
// may seem backward to check the backingBs first, but that is the likeliest case | ||
blk, err := bs.backingBs.Get(ctx, c) | ||
if err == nil { | ||
return blk, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we check backingBs isn't this potentially racy with the way we do locks here? If we really want to avoid locking, we probably want to check backingBs once more at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point -- i'm not super concerned about it because for the migration, we should actually never try to Get
anything we've Put
...but a second check of the bs at the end makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding taking the lock likely isn't worth it. If it's a problem, we could always use a read/write lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try and report back with perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this slowed us down -- trying again without to confirm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed locking in there slows us down from 870 migrations per sec to 350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You kind of need those locks, but they can be rwlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you mean the locking in this function? Shouldn't we never hit those locks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to lock to avoid raciness, it has to be at the very top of the method (so always gets hit)
default: | ||
autolog.Errorf("FLUSH ERRORED: %w, retrying in %v", putErr, bs.flushRetryDelay) | ||
time.Sleep(bs.flushRetryDelay) | ||
putErr = bs.doFlush(bs.flushCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's not an issue with badger? But my thinking was:
- Transient error: retry will help.
- Non-transient error: we can't write anything else anyways so we might as well retry.
// may seem backward to check the backingBs first, but that is the likeliest case | ||
blk, err := bs.backingBs.Get(ctx, c) | ||
if err == nil { | ||
return blk, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding taking the lock likely isn't worth it. If it's a problem, we could always use a read/write lock.
blockstore/autobatch.go
Outdated
|
||
func (bs *AutobatchBlockstore) Shutdown(ctx context.Context) error { | ||
// shutdown the flush worker | ||
bs.shutdownCh <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. It's unfortunate that this will block indefinitely if we shutdown twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can iterate on master.
// may seem backward to check the backingBs first, but that is the likeliest case | ||
blk, err := bs.backingBs.Get(ctx, c) | ||
if err == nil { | ||
return blk, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You kind of need those locks, but they can be rwlocks.
// may seem backward to check the backingBs first, but that is the likeliest case | ||
blk, err := bs.backingBs.Get(ctx, c) | ||
if err == nil { | ||
return blk, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you mean the locking in this function? Shouldn't we never hit those locks?
Related Issues
fixes #7870
Proposed Changes
The chief problem is that we can't hold the entire migrated state in memory (too much changes), but also need it to be fast enough that the premigration doesn't take too long and the migration (ideally) doesn't lead to null blocks.
The newly added autobatch store achieves this with a big caveat: It will take potentially very long to
Get
things that have not already beenFlush
ed. Deletion is not supported at all.Additional Info
Checklist
Before you mark the PR ready for review, please make sure that:
<PR type>: <area>: <change being made>
fix: mempool: Introduce a cache for valid signatures
PR type
: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs, misc,perf, refactor, revert, style, testarea
: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet