Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: state: Fast migration for v15 #7933

Merged
merged 10 commits into from
Jan 12, 2022
Merged

feat: state: Fast migration for v15 #7933

merged 10 commits into from
Jan 12, 2022

Conversation

arajasek
Copy link
Contributor

@arajasek arajasek commented Jan 11, 2022

Related Issues

fixes #7870

Proposed Changes

The chief problem is that we can't hold the entire migrated state in memory (too much changes), but also need it to be fast enough that the premigration doesn't take too long and the migration (ideally) doesn't lead to null blocks.

The newly added autobatch store achieves this with a big caveat: It will take potentially very long to Get things that have not already been Flushed. Deletion is not supported at all.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

  • All commits have a clear commit message.
  • The PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs, misc,perf, refactor, revert, style, test
    • area: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet
  • This PR has tests for new functionality or change in behaviour
  • If new user-facing features are introduced, clear usage guidelines and / or documentation updates should be included in https://lotus.filecoin.io or Discussion Tutorials.
  • CI is green

@arajasek arajasek requested a review from a team as a code owner January 11, 2022 22:23
@arajasek arajasek changed the title Fast migration for v15 feat: state: Fast migration for v15 Jan 11, 2022
blockstore/autobatch.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #7933 (3464dc2) into master (207d33e) will increase coverage by 0.21%.
The diff coverage is 43.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7933      +/-   ##
==========================================
+ Coverage   39.12%   39.33%   +0.21%     
==========================================
  Files         656      658       +2     
  Lines       70924    71112     +188     
==========================================
+ Hits        27746    27971     +225     
+ Misses      38382    38321      -61     
- Partials     4796     4820      +24     
Impacted Files Coverage Δ
chain/actors/builtin/paych/message5.go 0.00% <0.00%> (ø)
chain/actors/builtin/paych/message6.go 0.00% <0.00%> (ø)
cmd/lotus-shed/main.go 0.00% <0.00%> (ø)
cmd/lotus-shed/migrations.go 0.00% <0.00%> (ø)
chain/consensus/filcns/upgrades.go 33.82% <16.66%> (+2.29%) ⬆️
blockstore/autobatch.go 61.40% <61.40%> (ø)
chain/actors/builtin/paych/message7.go 86.66% <100.00%> (ø)
chain/actors/builtin/paych/v7.go 72.41% <100.00%> (+7.96%) ⬆️
chain/stmgr/forks.go 46.84% <100.00%> (ø)
chain/stmgr/stmgr.go 64.94% <100.00%> (ø)
... and 38 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 207d33e...3464dc2. Read the comment docs.

@arajasek arajasek force-pushed the asr/migration-autobatch branch from bc5337d to 7559e43 Compare January 12, 2022 01:30
blockstore/autobatch.go Outdated Show resolved Hide resolved
}

func (bs *AutobatchBlockstore) View(ctx context.Context, cid cid.Cid, callback func([]byte) error) error {
return xerrors.New("unsupported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd implement this, even if we just call get under the covers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just drop the methods -- we don't actually need this type to implement Lotus's blockstore interface

blockstore/autobatch.go Outdated Show resolved Hide resolved
blockstore/autobatch.go Outdated Show resolved Hide resolved
blockstore/autobatch.go Outdated Show resolved Hide resolved
blockstore/autobatch.go Outdated Show resolved Hide resolved
blockstore/autobatch.go Outdated Show resolved Hide resolved
require.NoError(t, ab.Put(ctx, b1))
require.NoError(t, ab.Put(ctx, b2))

ab.Flush(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not implement Flush in autobatch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will get updated. Thanks!

default:
autolog.Errorf("FLUSH ERRORED: %w, retrying in %v", putErr, bs.flushRetryDelay)
time.Sleep(bs.flushRetryDelay)
putErr = bs.doFlush(bs.flushCtx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me to retry putting it, what are the possible causes of an error that goes away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien thinks that could happen if the system is under stress -- with some backoff it may succeed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not an issue with badger? But my thinking was:

  1. Transient error: retry will help.
  2. Non-transient error: we can't write anything else anyways so we might as well retry.

Comment on lines +61 to +63
_, ok := bs.addedCids[blk.Cid()]
if !ok {
bs.addedCids[blk.Cid()] = struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the hit rate on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not measured, but I expect it to be quite high -- think of the number of times we'll try to Put the empty deadline object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm just wondering if it's worth the added memory use (tho if it's insignificant, this should be ok to keep even if it doesn't help that much).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good -- i do have a TODO to drop if the memory use is a problem, we'll see in experiments.

Comment on lines +130 to +134
// may seem backward to check the backingBs first, but that is the likeliest case
blk, err := bs.backingBs.Get(ctx, c)
if err == nil {
return blk, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we check backingBs isn't this potentially racy with the way we do locks here? If we really want to avoid locking, we probably want to check backingBs once more at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point -- i'm not super concerned about it because for the migration, we should actually never try to Get anything we've Put...but a second check of the bs at the end makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding taking the lock likely isn't worth it. If it's a problem, we could always use a read/write lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try and report back with perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this slowed us down -- trying again without to confirm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed locking in there slows us down from 870 migrations per sec to 350

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You kind of need those locks, but they can be rwlocks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you mean the locking in this function? Shouldn't we never hit those locks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to lock to avoid raciness, it has to be at the very top of the method (so always gets hit)

default:
autolog.Errorf("FLUSH ERRORED: %w, retrying in %v", putErr, bs.flushRetryDelay)
time.Sleep(bs.flushRetryDelay)
putErr = bs.doFlush(bs.flushCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's not an issue with badger? But my thinking was:

  1. Transient error: retry will help.
  2. Non-transient error: we can't write anything else anyways so we might as well retry.

blockstore/autobatch.go Show resolved Hide resolved
blockstore/autobatch.go Show resolved Hide resolved
blockstore/autobatch.go Outdated Show resolved Hide resolved
blockstore/autobatch.go Show resolved Hide resolved
Comment on lines +130 to +134
// may seem backward to check the backingBs first, but that is the likeliest case
blk, err := bs.backingBs.Get(ctx, c)
if err == nil {
return blk, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding taking the lock likely isn't worth it. If it's a problem, we could always use a read/write lock.

blockstore/autobatch.go Outdated Show resolved Hide resolved

func (bs *AutobatchBlockstore) Shutdown(ctx context.Context) error {
// shutdown the flush worker
bs.shutdownCh <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. It's unfortunate that this will block indefinitely if we shutdown twice.

Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can iterate on master.

Comment on lines +130 to +134
// may seem backward to check the backingBs first, but that is the likeliest case
blk, err := bs.backingBs.Get(ctx, c)
if err == nil {
return blk, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You kind of need those locks, but they can be rwlocks.

Comment on lines +130 to +134
// may seem backward to check the backingBs first, but that is the likeliest case
blk, err := bs.backingBs.Get(ctx, c)
if err == nil {
return blk, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you mean the locking in this function? Shouldn't we never hit those locks?

@arajasek arajasek merged commit b161f56 into master Jan 12, 2022
@arajasek arajasek deleted the asr/migration-autobatch branch January 12, 2022 22:17
@arajasek arajasek mentioned this pull request Oct 19, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v7 migration integration
6 participants