-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New File System Implementation #15041
Comments
Thanks @ethanalee ! I'd add
(no need to actually edit the top comment). |
I'd like to help however I can in moving this forward. |
@hcldan Help would certainly be appreciated here! For now, it's still early for writing code, but we've been reading Another issue is testing. Reviewing the current tests and seeing if there is any lack of coverage would be good, though that might not be easy to do. One possible way to combine this with the previous point is to find interesting/important areas in |
Emscripten's FS shouldn't require multithreading though. |
Yes, definitely! The singlethreaded case should continue to behave as it has been. Perhaps it will be smaller, but that's about it. The multithreaded case should get a lot faster (by avoiding constant proxying to the main thread all the time). |
I've been looking at It looks like FS is a MEMFS that lives in the main browser thread, and we rely on these syncfs calls to keep the data current in the memfs as it's stored in the indexeddb. But that just seems to only take care of persistence, and not actually offload the storage to disk and out of memory... am I correct here? Right now the system calls in emscripten proxy to the main ui thread (with pthreads)... that means these calls are already async... Where are they async? Can we push that async interface down to the FS layer... and then perhaps shim in the same async adaptation that's being done for proxying in the case where it's a single threaded app? I started all this because the SF apis dont have sync counterparts in the main browser thread, and I can plug in the async SF apis inside the sidebar: also looking for help understanding this module architecture... any good place to start looking? |
I think I found some of the answers to my question here: https://github.com/emscripten-core/emscripten/blob/main/system/lib/pthread/library_pthread.c#L485 |
would always proxying fs calls deadlock a single threaded program? I don't know how this wait works. I'm having a hard time grokking the pthread code that looks at __proxy. it's a lot of code writing code. |
As far as requirements go:
|
I actually think it might be a nice IDBFS enhancement to leave all file metadata in IDBFS and put file content in SF (if available). But we would need to eliminate syncfs and the MEMFS copy of FS data and that would mean needing an async FS layer. |
I think that's not accurate. The filesystem operations are proxied synchronously to the main thread. We wait until they finish, so that they complete (write to memory, etc.), and also we need the return value from them. Any proxying is not good, but synchronous is especially bad, and proxying to main thread is also very bad. The new version should avoid all that in most cases. Most of the code and metadata (directory structure) could be in wasm, and so it's accessible from all threads. Actual file data might be present in JS on a particular thread, and proxied to, but we can cache those in memory as well. |
I understand the distinction, but I call this async because we could simply wait for a callback instead of waiting for it to finish (cb sets the result).. in any case, I agree... we don't want to be proxying if we can avoid it, and we don't want to have to run this stuff on the main thread. idbfs, even in a worker, is going to need some sort of async api, right? |
I actually think that syncfs will need to remain in some form. This is for persistence usage. Think of this as writing to disk in a traditional file system. Persisting data would be on-demand, as requested by the user-level application code. |
Oh, I see, sorry for misunderstanding you. Yes, an async operation can happen in the thread that is proxied to. (It won't always be async, if the API it calls is sync, but in general async is necessary to support.) |
Why? When you can write directly to a file with StorageFoundation? This only really only seems necessary because FS is sync and not all mount drivers can be.
I think writing the to the filesystem in a traditional file system is a good indication that the data should be persisted. Keeping a copy of the FS in memory is absolutely horrible for our use cases. It's a non-starter. |
FS operations for an async backend would need data to be present to be synchronously read. For example, you want all the data for video game graphics available to be read synchronously instead of waiting asynchronously. However, persisting to a file system could be async since we need it does not matter to the user when this action will complete. |
Not necessarily, we just discussed an approach currently in use with PTHREAD support to proxy (async) a sync api and wait for the result. |
Google's behavior for PWA experience is pretty clear in what they expect (the browser could go away at any time). |
Agreed. But that is a problem in the domain of that application, not the FS, I would argue. |
Its not the application that require synchronous reads.. its the low level POSIX API that that we are implementing: We do have ASYNCIFY that can come to rescue, but I'm not sure we can ask all filesystem users to use ASYNCIFY. |
@sbc100 I understand that it's the low level posix api. I meant that if a game application needs FS data in memory, that's up to the game application (which would likely be multithreaded anyway). I just want to avoid holding this data in memory... and I'd like to support the existing apis that have no sync options (idbfs) but if everyone here is unwilling to make them use asyncify for it, I guess we are at an impasse. Hopefully with the FS in a worker thread instead we can proxy calls synchronously and make good use of SF apis and that will have much better performance characteristics and leave the idbfs driver seldom used. If we decide to keep syncfs, I really hope that we can relegate it to only be needed for fs apis that are async. BTW, the indexeddb spec does say the sync apis could be brought back if needed. This sounds like it might qualify as a good reason to bring them back to the workers. |
Sync/AsyncThis is the matrix of the possible sync/async combinations:
The only really interesting one is when the user API is synchronous and the backend is asynchronous. Here are the available options in that situation:
Am I missing any useful options? Each of these options has a downside: (1) is too slow for many applications, (2) doesn't work on the main browser thread, and (3) may lead to surprising data loss. So IMO the best solution is to offer an asynchronous API for users who can rewrite their applications to use it, and also support all these other options where possible so that users can pick the one that works best for their use case otherwise. Memory residencyExcept in the case of option (3) above, I agree that we should make it possible for users to configure the system to not keep entire files in memory by default, but rather have a configurable in-memory (either Wasm memory or JS heap) cache of hot data backed by one of the persistent Web storage backends (whether or not the data is actually persisted across page loads). |
Would making it easier to feed async data into stdin be in scope for the FS rewrite? |
I think this is an important distinction to make. syncfs is only used when we are persisting to cold storage, which will 1) be optional for users and 2) happen relatively infrequently (for example when a user wants to save a game level perhaps). I also agree that we should have a backing store, which could be in-memory. This would not invoke syncfs and would instead by "automatic" in the sense that segments of files could be lazily loaded when required |
I think this requirement is already covered by the above specifications, no? |
Just checking that stdio is included, not just other FSs. |
Hi everyone, just wanted to check on the current status... According to https://emscripten.org/docs/api_reference/Filesystem-API.html#new-file-system-wasmfs this still seems as "Work in Progress". Is that the case? Or can be already used with some limitations? Are there any instructions how to start, or too early? |
Yes, still a work in progress. You can track landing PRs as they have If you don't need specific backends that don't exist yet (most of them, except for Memory files and JS File), and use only common syscalls, then things might work for you. Building with One use case that might already work well enough, though, is simple file usage with pthreads. That will avoid most of the old FS's proxying overhead. Testing and finding bugs there would be helpful. And in general, contributions are welcome as always - the project linked to before has open issues for things. |
Hi @kripken I am about to invest some tome to testing this next days. My use case is as following:
Sounds achievable? Before the related docs exists, how do I do the actual mounting? Previously I used
Now with
|
@jozefchutka, I would take a look at the WasmFS node backend to get a sense for how a new WasmFS backend would be structured. Note that the backend interface is not stable and is actively changing, so you'll have to keep up with that for now. Eventually we will have a stable interface and proper documentation on bringing up a new backend as well as utilities to make it easier to do so. Relevant files:
|
I wonder if mapping between external file paths and internal-to-Emscripten file paths could be a common concern? Previously I mounted one FS folder to a custom FS, but I didn't mount the whole FS as I still wanted to use MemFS for /tmp (and also I don't know if mounting the whole FS works or not.) But the custom FS could be given any arbitrary path, so I had to add my own filename mapping system so that any external path could be converted into a single-level filename within the mounted folder. This was a little bit hacky, and I'm not certain my system worked perfectly. Perhaps it would have been better to just append the external path to the mounted folder? But would that involve creating all the intermediate folders on the fly? I really don't know what the best solution is. If anyone else had the same need, and if Emscripten could offer its own utility, then I'd definitely want to switch to it. |
Yep, I definitely expect that to be common. I'm currently building out the Node backend (linked above) that follows this pattern, but once that works I expect to extract the structure into a common utility that could be used for accessing remote files via other APIs like WASI or XHR. |
I'm working on an audio editing app that frequently works with 50MB+ files, but have been hitting all kinds of memory limits on mobile devices. If I understand WasmFS correctly, if I implement a idbfs_backend module and link with -sWASMFS, that will replace the in-memory file system and c/c++ functions will use the new file system (avoiding memory). Is that correct? |
Yes, that's the idea. However I expect that you wouldn't need to write the backend yourself - we'll implement an indexed db backend as we bring WasmFS up to feature parity with the current file system. |
I've created an IDBFS backend and am running tests, but noticed some odd behaviour. Given the following test code:
This creates an IDBFS file, but appears to use the memory backend to insertChild instead of the IDBFS backend. The same problem occurs with wasmfs_create_directory. The design always seems to assume that the root (parent) backend is the memory backend and so new files and directories are not inserted in the correct backend. I have to create a '/first' directory and then a 'first/second' directory for the correct backend to be used. However the IDBFS backend only has the 'second' directory and not the 'first' directory. |
We should add an option to change the root directory, which would allow setting its backend - just no one has gotten around to it, I think. I'm also not sure offhand what that API should look like. (A PR would be welcome.)
I'm not sure I understand, but that sounds like the expected behavior? The root is a MemoryBackend (until we add an API to allow other stuff) so we call |
If IDBFS children are inserted into the memory backend file system, then when the app closes and the memory file system is gone, all the IDBFS children are lost because they were never inserted in a permanent storage directory. There is no way to recover the files (their filenames, inodes, blocks, etc.). This also causes lost inodes and blocks in permanent storage. I'm not sure I understand the overall vision of the new file system. Are all the backends supposed to exists off of the memory backend as directories, such as /node, /idbfs, etc.? Or are they intended to be completely independent? Or both? Inserting a child from one backend into a different backend probably should not be allowed. Imagine inserting a memory backend file into an IDBFS backend and restarting the app. The file data will not exist, but the directory entry will. Backends may have to be treated like separate devices and files copied between them. |
Right now the expectation is that arbitrary backends will be able to be mounted under the root in-memory backend, but otherwise backends should not be mounted on each other. It is also expected that applications will mount their persistent backends on startup to "discover" the previously-written data rather than having them be mounted automatically. I am contemplating a change that would allow backends to be arbitrarily mounted onto each other, although information about those mount points would be in-memory only and would not be visible to the backend implementations. That avoids the dangling directory entry problem you mentioned, but means that applications would be responsible for re-mounting all backends on every run. Given that capability, we could probably provide a weak function definition for creating the root backend. Users would be able to provide alternative definitions that return a different backend to be used as the root backend. That would be simpler and more flexible than providing e.g. new command line options for choosing the root backend. |
Closing this in favor of tracking progress on https://github.com/orgs/emscripten-core/projects/1. |
FWIW, from isomorphic-git:
|
Did you find out a WASMFS solution for this? :) |
@patrickcorrigan, the JS API for wasmFS is marked still as "todo", so I'm not sure it's useable yet. See #15976. However, over at Kiwix PWA, we use WORKERFS in combination with the File System API, and it works very well even with files of 97GB. However, we don't pass a whole directory to the FS, we only pass one file at a time, which seems different from your use case. |
Thanks @Jaifroid. One file is all I need too :) I have been using the MEMFS so far and works great but run into issues with files of about 300 - 400 MB on Safari iOS. I was looking into reading directly from the OPFS and said I'd try WASMFS. I will use workerfs. Thank you for letting me know. I really appreciate it :) I spend about an hour yesterday messing around with WASMFS and exploring the FS object it exposed but could not figure out a way to do it. |
Yes, MEMFS simulates a file system in memory, so you'll hit memory issues as soon as you go over the device's memory allocation. Since we work routinely with ZIM archives larger than 1GB, and often much larger, that was a non-starter for us. You have to be able to receive messages, and the file as a transferrable object via postMessage in the Worker JS that loads the WASM, so it means you probably have to build your WASM with that code, using |
@patrickcorrigan for the time being I am still using WORKERFS and sending Blobs and Files to and from Worker (using stdout to build output Blob). The benefit is undefined file size, the downside is its readonly. |
@jozefchutka Read only is all I need. The only problem is my app draws to canvas and I can't get it to run with proxy_to_worker or proxy_to_pthread. So I think I'm stuck for the moment. |
Objective: Write a high-performance multi-threaded file system layer that replaces the current implementation in
library_fs.js
.Goals:
Design Sketch Link
Click here to view/comment
The text was updated successfully, but these errors were encountered: