Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size split the local store #18048

Closed
stuhood opened this issue Jan 20, 2023 · 0 comments · Fixed by #18153
Closed

Size split the local store #18048

stuhood opened this issue Jan 20, 2023 · 0 comments · Fixed by #18153
Assignees

Comments

@stuhood
Copy link
Member

stuhood commented Jan 20, 2023

Currently, all blobs (no matter their size) are stored in LMDB. But that is not necessarily optimal for larger blobs.


To resolve this, the local::Store could be split at a size threshold, such that small files continued to be stored in LMDB, but large files were stored directly as content addressed files (in storage similar to our existing immutable inputs storage).

This would involve (in no particular order):

  • introducing a store directory for large files, with a layout similar/identical to immutable inputs
  • adding a size threshold in Store to switch between small/large strategies.
  • adjusting Store::store_file (and the corresponding local::Store method) to take a file path
    • In the case of an immutable / destructive capture of a large file, the file could be digested in place, and then moved (or copied, if on another filesystem) into the store.
  • implementing Store::materialize_file by directly hardlinking where possible.
@thejcannon thejcannon self-assigned this Feb 1, 2023
huonw added a commit that referenced this issue Feb 12, 2023
This fixes #17065 by having remote cache loads be able to be streamed to
disk. In essence, the remote store now has a `load_file` method in addition to
`load_bytes`, and thus the caller can decide to download to a file instead.

This doesn't make progress towards #18048 (this PR doesn't touch the local
store at all), but I think it will help with integrating the remote store with
that code: in theory the `File` could be provided in a way that can be part of
the "large file pool" directly (and indeed, the decision about whether to
download to a file or into memory ties into that).

This also does a theoretically unnecessary extra pass over the data (as
discussed in #18231) to verify the digest, but I think it'd make sense to do
that as a future optimisation, since it'll require refactoring more deeply
(down into `sharded_lmdb` and `hashing`, I think) and is best to build on
#18153 once that lands.
stuhood added a commit that referenced this issue Mar 23, 2023
…n sandboxes (#18153)

Fixes #18048 by having `store` code now store/load from a different root
if the file size is above some threshold. Also fixes #18231 by verifying
content while downloading.

`materialize_file` now hardlinks (falling back to copying when the store
and workdirs are on separate filesystems) directly from this RO on-disk
file if the file doesn't need to be mutable. This means that
`--no-pantsd` and daemon restarts get the benefits of linking.

---------

Co-authored-by: Stu Hood <stuhood@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants