-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload large files directly from disk, without loading into memory #19049
Comments
Hmmm, I wonder if the use of an mmap'd file in async code is also an async hazard: I'm imagining memory accesses to pages of the file that aren't yet in memory will trigger effectively a sync read (i.e. the page fault will deschedule the whole OS thread including the Rust-level async runtime, rather than letting the runtime work on another coroutine...). |
huonw
added a commit
that referenced
this issue
Jul 12, 2023
This does preparatory refactoring towards #11149 (and probably also #19049), by adjusting `store::remote::ByteStore` in a few ways to make it easier to plop in new 'providers': - package the various options for creating a provider into a struct, and a bunch of mechanical refactoring to use that struct - improved tests: - explicit tests for the REAPI provider - more extensive exercising of the `ByteStore` interface, including extra tests for combinations like "`load_file` when digest is missing", and (to me) more logical test names - testing the 'normal' `ByteStore` interface via fully in-memory simple `Provider` instead of needing to run the `StubCAS` - testing some more things at lower levels, like the REAPI provider doing hash verification, the `LoadDestination::reset` methods doing the right thing and `ByteStore::store_buffered` The commits are mostly individually reviewable, although I'd recommend having a sense of the big picture as described above before going through them one-by-one. After this, the next steps towards #11149 will be: 1. do something similar for the action cache 2. implement new providers, with some sort of factory function for going from `RemoteOptions` to an appropriate `Arc<dyn ByteStoreProvider + 'static>` 3. expose settings to select and configure those new providers
huonw
added a commit
that referenced
this issue
Sep 12, 2023
…9711) This (hopefully) optimises storing large blobs to a remote cache, by streaming them directly from the file stored on disk in the "FSDB". This builds on the FSDB local store work (#18153), relying on large objects being stored as an immutable file on disk, in the cache managed by Pants. This is an optimisation in several ways: - Cutting out an extra temporary file: - Previously `Store::store_large_blob_remote` would load the whole blob from the local store and then write that to a temporary file. This was appropriate with LMBD-backed blobs. - With new FSDB, there's already a file that can be used, no need for that temporary, and so the file creation and writing overhead can be eliminated . - Reducing sync IO in async tasks, due to mmap: - Previously `ByteStore::store_buffered` would take that temporary file and mmap it, to be able to slice into `Bytes` more efficiently... except this is secretly blocking/sync IO, happening within async tasks (AIUI: when accessing a mmap'd byte that's only on disk, not yet in memory, the whole OS thread is blocked/descheduled while the OS pulls the relevant part of the file into memory, i.e. `tokio` can't run another task on that thread). - This new approach uses normal `tokio` async IO mechanisms to read the file, and thus hopefully this has higher concurrency. - (This also eliminates the unmaintained `memmap` dependency.) I haven't benchmarked this though. My main motivation for this is firming up the provider API before adding new byte store providers, for #11149. This also resolves some TODOs and even eliminates some `unsafe`, yay! The commits are individually reviewable. Fixes #19049, fixes #14341 (`memmap` removed), closes #17234 (solves the same problem but with an approach that wasn't possible at the time).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
With #18153, large files produced are now cached as standalone files, on disk, rather than in the core LMDB structures. Uploading to a remote cache should be able to pipe these directly from that location on disk.
Currently they're pulled into memory, written to a temporary file (with an additional bug: it's sync IO in async code), and then that temporary file is uploaded via mmap. Breadcrumbs, in
src/rust/engine/fs/store
:store::Store::ensure_remote_has_recursive
callsstore::Store::store_large_blob_remote
store_large_blob_remote
callsstore::remote::ByteStore::store_buffered
store_buffered
creates temporary files...store_large_blob_remote
provided with astd::fs::File
for that temporary fileDescribe the solution you'd like
store::Store::ensure_remote_has_recursive
should pass a file handle into the remote cache, and that's manipulated/uploaded directly, without going through memory.The following code could be adjusted to just decide whether the blob is on disk (if so, go via the disk) or is in LMDB (if so, just pull into memory), resolving the TODO by removing the linking to wire chunk size:
pants/src/rust/engine/fs/store/src/lib.rs
Lines 831 to 839 in 4c708a0
Describe alternatives you've considered
N/A
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: