feat: state: Fast migration for v15 #7933

arajasek · 2022-01-11T22:23:13Z

Related Issues

Proposed Changes

The chief problem is that we can't hold the entire migrated state in memory (too much changes), but also need it to be fast enough that the premigration doesn't take too long and the migration (ideally) doesn't lead to null blocks.

The newly added autobatch store achieves this with a big caveat: It will take potentially very long to Get things that have not already been Flushed. Deletion is not supported at all.

Additional Info

Checklist

Before you mark the PR ready for review, please make sure that:

All commits have a clear commit message.
The PR title is in the form of of <PR type>: <area>: <change being made>
- example: fix: mempool: Introduce a cache for valid signatures
- PR type: fix, feat, INTERFACE BREAKING CHANGE, CONSENSUS BREAKING, build, chore, ci, docs, misc,perf, refactor, revert, style, test
- area: api, chain, state, vm, data transfer, market, mempool, message, block production, multisig, networking, paychan, proving, sealing, wallet
This PR has tests for new functionality or change in behaviour
If new user-facing features are introduced, clear usage guidelines and / or documentation updates should be included in https://lotus.filecoin.io or Discussion Tutorials.
CI is green

blockstore/autobatch.go

codecov · 2022-01-11T23:17:05Z

Codecov Report

Merging #7933 (3464dc2) into master (207d33e) will increase coverage by 0.21%.
The diff coverage is 43.78%.

@@            Coverage Diff             @@
##           master    #7933      +/-   ##
==========================================
+ Coverage   39.12%   39.33%   +0.21%     
==========================================
  Files         656      658       +2     
  Lines       70924    71112     +188     
==========================================
+ Hits        27746    27971     +225     
+ Misses      38382    38321      -61     
- Partials     4796     4820      +24

Impacted Files	Coverage Δ
chain/actors/builtin/paych/message5.go	`0.00% <0.00%> (ø)`
chain/actors/builtin/paych/message6.go	`0.00% <0.00%> (ø)`
cmd/lotus-shed/main.go	`0.00% <0.00%> (ø)`
cmd/lotus-shed/migrations.go	`0.00% <0.00%> (ø)`
chain/consensus/filcns/upgrades.go	`33.82% <16.66%> (+2.29%)`	⬆️
blockstore/autobatch.go	`61.40% <61.40%> (ø)`
chain/actors/builtin/paych/message7.go	`86.66% <100.00%> (ø)`
chain/actors/builtin/paych/v7.go	`72.41% <100.00%> (+7.96%)`	⬆️
chain/stmgr/forks.go	`46.84% <100.00%> (ø)`
chain/stmgr/stmgr.go	`64.94% <100.00%> (ø)`
... and 38 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 207d33e...3464dc2. Read the comment docs.

blockstore/autobatch.go

Stebalien · 2022-01-12T01:41:10Z

blockstore/autobatch.go

+}
+
+func (bs *AutobatchBlockstore) View(ctx context.Context, cid cid.Cid, callback func([]byte) error) error {
+	return xerrors.New("unsupported")


I'd implement this, even if we just call get under the covers.

I might just drop the methods -- we don't actually need this type to implement Lotus's blockstore interface

blockstore/autobatch.go

hunjixin · 2022-01-12T09:00:05Z

blockstore/autobatch_test.go

+	require.NoError(t, ab.Put(ctx, b1))
+	require.NoError(t, ab.Put(ctx, b2))
+
+	ab.Flush(ctx)


not implement Flush in autobatch

Yeah, this will get updated. Thanks!

Kubuxu · 2022-01-12T15:56:30Z

blockstore/autobatch.go

+				default:
+					autolog.Errorf("FLUSH ERRORED: %w, retrying in %v", putErr, bs.flushRetryDelay)
+					time.Sleep(bs.flushRetryDelay)
+					putErr = bs.doFlush(bs.flushCtx)


It seems weird to me to retry putting it, what are the possible causes of an error that goes away?

@Stebalien thinks that could happen if the system is under stress -- with some backoff it may succeed?

Maybe it's not an issue with badger? But my thinking was:

Transient error: retry will help.

Non-transient error: we can't write anything else anyways so we might as well retry.

magik6k · 2022-01-12T17:37:41Z

blockstore/autobatch.go

+	_, ok := bs.addedCids[blk.Cid()]
+	if !ok {
+		bs.addedCids[blk.Cid()] = struct{}{}


What's the hit rate on this?

Not measured, but I expect it to be quite high -- think of the number of times we'll try to Put the empty deadline object.

Yeah, I'm just wondering if it's worth the added memory use (tho if it's insignificant, this should be ok to keep even if it doesn't help that much).

Sounds good -- i do have a TODO to drop if the memory use is a problem, we'll see in experiments.

magik6k · 2022-01-12T17:42:24Z

blockstore/autobatch.go

+	// may seem backward to check the backingBs first, but that is the likeliest case
+	blk, err := bs.backingBs.Get(ctx, c)
+	if err == nil {
+		return blk, nil
+	}


If we check backingBs isn't this potentially racy with the way we do locks here? If we really want to avoid locking, we probably want to check backingBs once more at the end.

that's a good point -- i'm not super concerned about it because for the migration, we should actually never try to Get anything we've Put...but a second check of the bs at the end makes sense

Avoiding taking the lock likely isn't worth it. If it's a problem, we could always use a read/write lock.

Will try and report back with perf

Looks like this slowed us down -- trying again without to confirm

Confirmed locking in there slows us down from 870 migrations per sec to 350

You kind of need those locks, but they can be rwlocks.

And you mean the locking in this function? Shouldn't we never hit those locks?

if we want to lock to avoid raciness, it has to be at the very top of the method (so always gets hit)

Stebalien · 2022-01-12T18:50:44Z

blockstore/autobatch.go

+				default:
+					autolog.Errorf("FLUSH ERRORED: %w, retrying in %v", putErr, bs.flushRetryDelay)
+					time.Sleep(bs.flushRetryDelay)
+					putErr = bs.doFlush(bs.flushCtx)


Maybe it's not an issue with badger? But my thinking was:

Transient error: retry will help.

Non-transient error: we can't write anything else anyways so we might as well retry.

blockstore/autobatch.go

Stebalien · 2022-01-12T18:57:47Z

blockstore/autobatch.go

+	// may seem backward to check the backingBs first, but that is the likeliest case
+	blk, err := bs.backingBs.Get(ctx, c)
+	if err == nil {
+		return blk, nil
+	}


Avoiding taking the lock likely isn't worth it. If it's a problem, we could always use a read/write lock.

blockstore/autobatch.go

Stebalien · 2022-01-12T19:00:03Z

blockstore/autobatch.go

+
+func (bs *AutobatchBlockstore) Shutdown(ctx context.Context) error {
+	// shutdown the flush worker
+	bs.shutdownCh <- struct{}{}


Hm. It's unfortunate that this will block indefinitely if we shutdown twice.

Stebalien

We can iterate on master.

Stebalien · 2022-01-12T22:08:18Z

blockstore/autobatch.go

+	// may seem backward to check the backingBs first, but that is the likeliest case
+	blk, err := bs.backingBs.Get(ctx, c)
+	if err == nil {
+		return blk, nil
+	}


You kind of need those locks, but they can be rwlocks.

Stebalien · 2022-01-12T22:09:26Z

blockstore/autobatch.go

+	// may seem backward to check the backingBs first, but that is the likeliest case
+	blk, err := bs.backingBs.Get(ctx, c)
+	if err == nil {
+		return blk, nil
+	}


And you mean the locking in this function? Shouldn't we never hit those locks?

arajasek added 4 commits January 11, 2022 17:19

Fast migration for v15

8aabe1b

Update to actors v7.0.0-rc1

19bd9cf

Implement an autobatcher

25768a2

cache added cids

544cfa6

arajasek requested a review from a team as a code owner January 11, 2022 22:23

arajasek mentioned this pull request Jan 11, 2022

feat: state: Fast migration for v15 #7901

Closed

5 tasks

arajasek changed the title ~~Fast migration for v15~~ feat: state: Fast migration for v15 Jan 11, 2022

arajasek commented Jan 11, 2022

View reviewed changes

blockstore/autobatch.go Outdated Show resolved Hide resolved

jennijuju added the release/backport label Jan 11, 2022

implement stubs

5ff6148

arajasek added 2 commits January 11, 2022 19:44

Use channels to trigger flushes in a dedicated goroutine

a41b4ac

Support faster Get, retry flushes on error

7559e43

arajasek force-pushed the asr/migration-autobatch branch from bc5337d to 7559e43 Compare January 12, 2022 01:30

Stebalien reviewed Jan 12, 2022

View reviewed changes

hunjixin reviewed Jan 12, 2022

View reviewed changes

Kubuxu reviewed Jan 12, 2022

View reviewed changes

magik6k reviewed Jan 12, 2022

View reviewed changes

Address review

083c5b0

Stebalien reviewed Jan 12, 2022

View reviewed changes

arajasek added 2 commits January 12, 2022 15:03

Address review part 2

893998c

Don't lock in Get

3464dc2

Stebalien approved these changes Jan 12, 2022

View reviewed changes

arajasek merged commit b161f56 into master Jan 12, 2022

arajasek deleted the asr/migration-autobatch branch January 12, 2022 22:17

arajasek mentioned this pull request Jan 12, 2022

fix: blockstore: Add missing locks to autobatch::Get() #7939

Merged

5 tasks

jennijuju removed the release/backport label Jan 12, 2022

arajasek mentioned this pull request Oct 19, 2022

Migration: Use autobatch bs #9518

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: state: Fast migration for v15 #7933

feat: state: Fast migration for v15 #7933

arajasek commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading

Stebalien Jan 12, 2022

arajasek Jan 12, 2022

hunjixin Jan 12, 2022

arajasek Jan 12, 2022

Kubuxu Jan 12, 2022

arajasek Jan 12, 2022

Stebalien Jan 12, 2022

magik6k Jan 12, 2022

arajasek Jan 12, 2022

magik6k Jan 12, 2022

arajasek Jan 12, 2022

magik6k Jan 12, 2022

arajasek Jan 12, 2022

Stebalien Jan 12, 2022

arajasek Jan 12, 2022

arajasek Jan 12, 2022

arajasek Jan 12, 2022

Stebalien Jan 12, 2022

Stebalien Jan 12, 2022

arajasek Jan 12, 2022

Stebalien Jan 12, 2022

Stebalien Jan 12, 2022

Stebalien Jan 12, 2022

Stebalien left a comment

Stebalien Jan 12, 2022

Stebalien Jan 12, 2022

feat: state: Fast migration for v15 #7933

feat: state: Fast migration for v15 #7933

Conversation

arajasek commented Jan 11, 2022 • edited Loading

Related Issues

Proposed Changes

Additional Info

Checklist

codecov bot commented Jan 11, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stebalien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arajasek commented Jan 11, 2022 •

edited

Loading

codecov bot commented Jan 11, 2022 •

edited

Loading