Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving GC for badger #6854

Merged
merged 35 commits into from
Jul 27, 2021
Merged

Moving GC for badger #6854

merged 35 commits into from
Jul 27, 2021

Conversation

vyzo
Copy link
Contributor

@vyzo vyzo commented Jul 23, 2021

Cherry-picked from #6728.

This adds the option to perform moving GC on a badger blockstore; this effectively reclaims all space.
It is also hooked into the splitstore, to perform moving GC once every 20 compactions (about once a week, user configurable).

Rationale: badger supports online GC, which is quite fast but ultimately ineffective at reclaiming space. It has been observed that the hotstore size slowly creeps up, despite frequent garbage collections.
This addresses the situation with real moving GC.

@vyzo vyzo requested review from Stebalien and raulk July 23, 2021 20:17
@vyzo vyzo requested a review from a team as a code owner July 23, 2021 20:17
@vyzo vyzo added epic/splitstore team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs labels Jul 23, 2021
@vyzo
Copy link
Contributor Author

vyzo commented Jul 23, 2021

dunno what's up with the gen-check failure, there is no information there.

@magik6k
Copy link
Contributor

magik6k commented Jul 23, 2021

dunno what's up with the gen-check failure, there is no information there.

When you expand the second-to-last thing on circle it shows you the diff. The usual solution is to just run make gen which should regenerate whatever it's complaining about (in this case config doc-gen stuff which just landed in #6848 - so just running make cfgdoc-gen should also be enough, and faster)

node/config/types.go Outdated Show resolved Hide resolved
@vyzo
Copy link
Contributor Author

vyzo commented Jul 24, 2021

Effectiveness of moving GC:

Before moving GC, after about a week with online GC in every compaction:

132G    ./datastore/splitstore/hot.badger

An hour after moving GC:

57G     ./datastore/splitstore/hot.badger.1627141269

@vyzo vyzo force-pushed the feat/moving-gc branch from e597f7c to 4a9c498 Compare July 25, 2021 06:00
@vyzo
Copy link
Contributor Author

vyzo commented Jul 25, 2021

rebased on master.

vyzo added 7 commits July 27, 2021 10:27
we can't really continue and leave a ticking bomb for the next restart; the user might not see it.
it was there to support a potential CopyTo interface; but we'll cross that bridge when we get there.
@vyzo
Copy link
Contributor Author

vyzo commented Jul 27, 2021

Changes per review comments (quite a few!):

  • Updated the interface to use functional options.
  • Renamed MovingGC to FullGC in the nomenclature.
  • Removed the target path option and made the new path resolve links.
  • Dropped the filter from the copy; we are not currently using it anyway.
  • Implemented the copy using the streaming interface.
  • Made path manipulation failures that require action panic; that seems to be the only sane way to proceed.
  • Made the state variables typed.
  • Renamed db2 to dbNext in the data structure.
  • Used EvalSymlinks where appropriate

@vyzo
Copy link
Contributor Author

vyzo commented Jul 27, 2021

ok, this is ready for the next round of review; in the meantime I'll deploy it in my discard node to get some timings and real-world burn.

@vyzo
Copy link
Contributor Author

vyzo commented Jul 27, 2021

Timings are looking really good:

2021-07-27T11:31:33.039Z        INFO    splitstore      splitstore/splitstore_gc.go:30  garbage collecting hotstore done        {"took": 535.387156281}

previously it was taking about 40min; that's 4.5 times faster.

@jennijuju jennijuju added this to the 1.11.1 milestone Jul 27, 2021
@jennijuju jennijuju added P1 P1: Must be resolved P2 P2: Should be resolved and removed P1 P1: Must be resolved labels Jul 27, 2021
@jennijuju
Copy link
Member

jennijuju commented Jul 27, 2021

We really want this in, however, it's not a hard blocker for v1.11.1

@vyzo vyzo merged commit 6751185 into master Jul 27, 2021
@vyzo vyzo deleted the feat/moving-gc branch July 27, 2021 20:27
@vyzo vyzo restored the feat/moving-gc branch July 27, 2021 20:27
@vyzo vyzo deleted the feat/moving-gc branch July 27, 2021 20:27
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly naming comments but also some questions around file names.

TL;DR: When I have 2 databases and 4 different paths (old, new, link target, backup), the names really matter.

blockstore/badger/blockstore.go Show resolved Hide resolved
blockstore/badger/blockstore.go Show resolved Hide resolved
blockstore/badger/blockstore.go Show resolved Hide resolved
blockstore/badger/blockstore.go Show resolved Hide resolved
blockstore/badger/blockstore.go Show resolved Hide resolved
panic(fmt.Errorf("error renaming old badger db dir from %s to %s: %w; USER ACTION REQUIRED", dbpath, oldpath, err)) //nolint
}

if err = os.Symlink(path, dbpath); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I wasn't using symlinks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that a problem?
The issue with not using symlinks is moving across filesystems, which can take a long time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, what if I had the database at "foo". Now I'll end up with a symlink at "foo" pointing to a random directory.

If I started out with plain directories, I'd rather end up with plain directories. It's not critical, but the current version is going to confuse users.

  1. They initialize their repo.
  2. They see where their hotstore is.
  3. The hotstore gets moved and replaced by a symlink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, it complicates things to try and keep it symlink-free, as we'll need two versions of things.

We should probably be clear to users that the splitstore owns the splitstore directory; they shouldn't muck with it.
The coldstore is another matter though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

dbpath := b.opts.Dir
oldpath := fmt.Sprintf("%s.old.%d", dbpath, time.Now().Unix())

if err = os.Rename(dbpath, oldpath); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this just rename the symlink? Don't we need to move the actual file?

We need integration tests for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see... we are renaming the symlink intentionally. Why not just delete it then delete linkPath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's fine -- we'll delete deep in deleteDB which follows symlinks.

Also, I don't think we need an integration test for this, we can test pretty much everything with unit tests as the changes are pretty self-contained.

What do you want to add to the unit test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's fine -- we'll delete deep in deleteDB which follows symlinks.

It's fine, but a bit weird.

What do you want to add to the unit test?

The unit test is fine. I thought there was a bug here which would have been caught by a basic test. I revisited this because I then did look at the test, and it clearly checked this case.

Your tests are good.

@Stebalien
Copy link
Member

Other feedback: The renaming logic with the symlink dance is strange. We should be preserving symlinks if the user is using them, and not using symlinks if the user isn't already using them.

  1. Not all filesystems support them.
  2. Are we sure we're not creating absolute symlinks? That could cause trouble when moving repos.

@vyzo
Copy link
Contributor Author

vyzo commented Jul 28, 2021

Not all filesystems support them.

Wait, is there a realistic situation where lotus would run in a filesystem that doesn't support symlinks?
Even windows support them these days.

Are we sure we're not creating absolute symlinks? That could cause trouble when moving repos.

Gah, you are right about this. We are creating absolute links, I'll fix in follow up.

@Stebalien
Copy link
Member

Wait, is there a realistic situation where lotus would run in a filesystem that doesn't support symlinks?

Well... probably not. Given that NFS does. Although someone probably does it anyways...

Sorry, I've had some recent encounters with symlinks.

@vyzo vyzo mentioned this pull request Jul 28, 2021
@vyzo
Copy link
Contributor Author

vyzo commented Jul 28, 2021

Follw up in #6905

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 P2: Should be resolved team/ignite Issues and PRs being tracked by Team Ignite at Protocol Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants