-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading HDF5 file that is larger than client memory #40
Comments
This is an interesting question! The combination you want: local files, accessed through the browser, larger than 2GB (or even larger than the system memory) presents several challenges... For h5wasm, you could indeed use lazyFileLRU to deal with the large size issue, but local filesystem access is only asynchronous and (at the moment) the emscripten emulated filesystem can only be used in a synchronous mode. This is why lazyFile (and lazyFileLRU) have to run in a worker, because there they can take advantage of synchronous fetch calls that are no longer allowed in the main JS thread. For jsfive, the library was written to be synchronous and load the whole file into an ArrayBuffer on instantiation of the root jsfive.File object. For your use case, this is bad because the maximum size of an ArrayBuffer is set by the browser and is e.g. 2GB in the current Chrome version on OSX, which is much smaller than the files you are wanting to process. The only possibility I can see for solving your problem is to create a new version of jsfive ('jsfive-async') that replaces all buffer.slice operations with async function calls that read a local file through the javascript File API, which seems to allow random access once a file is picked through an class AsyncBuffer {
constructor(file_obj) {
this.file_obj = file_obj;
}
async slice(start, stop) {
return (await this.file_obj.slice(start, stop)).arrayBuffer();
}
} On top of that, all the classes that are instantiated by reading from the file (which is most of the classes in jsfive) would have to be rewritten to have an async It's probably doable, but it would be a bit of work. |
Note that loading HDF5 requires random access to the file, not sequential "streaming" access, as the internal structure of the file is not linear (and is in fact fractal at times! - see FractalHeap) |
Thanks Brian, that's rich and insightful guidance. Your outline broadly makes sense to me. RecapMy main take away from your comments above is to use the File API for random access without loading the whole file into memory as an ArrayBuffer. Your suggestions to use work workers and refactor to async functions also seem on point. Values -> pointersBeyond async, I think your suggestions entail that a new jsfive-async library would benefit from using pointers instead of directly loading values that contain large amounts of data. So, for example, I might read the whole file in small chunks via the File API, and track the byte offset of where large datasets in the HDF5 file start and stop. Then, in subsequent operations, I could quickly look up addresses for datasets A, B, C and so forth in the source HDF5 file byte array stored outside memory. It'd let me load only dataset A after the initial whole scan, rather than needing to stream-read the whole file more than once. Such an instantiated jsfive-async HDF5 File object would be more of an index than a file with directly-useful content. HDF5 index file companionI could also see making that jsfive-async HDF5 index object into a file itself. The index file would be much smaller than the source HDF5 file, but enable fast retrieval and random access. The index file might even travel alongside the larger source HDF5 file. That'd help memory-constrained clients. The index file would aid my particular use case, but I suspect it'd be even more valuable for fast random-access retrieval from remote servers via HTTP range requests. I imagine that'd be a more prevalent use case. This idea underpins BAM and BAI files, which are common in genomics. This new index file would be like BAI files for HDF5. HDF5 files seem more structurally complex than BAMs, which I think would be the main barrier here. QuestionsBeyond the considerable effort, do you see any fundamental issues with the outline above? Also, it's worth noting that I'm an HDF5 novice, so I may well be overlooking something. Briefly researching, the closest construct I found to an HDF5 index file is HDF5 virtual datasets (VDS). However, the VDS reference doesn't mention "stream" and has no relevant hits for "memory", so at a glance my hunch is that VDS does not address the use cases that an HDF5 index file would. Is there an existing solution in the HDF5 community for what HDF5 index files would solve? |
Already jsfive does loading of datasets (and groups, for that matter) "on demand", in the sense that calling Dataset.value triggers a read of the relevant bytes (either chunked or contiguous) to construct the desired output value. The Group and Dataset classes hold a reference to a Dataobjects instance once they are loaded, which has a bunch of addressing information similar to what you describe. If the underlying buffer for jsfive.File is changed to a random-access async system I think it will already be pretty efficient. |
I think I have something that works... you can extract the compiled esm/index.mjs from the attachment at the bottom (had to fix a problem with the filter pipeline, it's all working now...), and then in your page do like below. I tested on a 16 GB local file and was able to load and browse the file just fine. import * as jsfive_async from './jsfive/dist/esm/index.mjs';
class AsyncBuffer {
constructor(file_obj) {
this.file_obj = file_obj;
}
async slice(start, stop) {
return (await this.file_obj.slice(start, stop)).arrayBuffer();
}
}
const file_input = document.getElementById("file_input");
file_input.onchange = async function() {
const file = file_input.files[0];
const async_buf = new AsyncBuffer(file);
const f = new jsfive_async.File(async_buf);
await f.ready;
// ... then do stuff with the file, e.g. :
window.f = f; // now you can play with it in the console
console.log(f.keys);
// if you have a group called 'entry':
let entry = await f.get('entry');
let dataset = await entry.get('data');
// shape, dtype and value are all async now:
console.log(await dataset.shape); // shape,
console.log(await dataset.dtype);
console.log(await dataset.value); // don't do this if your dataset is big!
} (this is built from the |
Would it be realistic for |
Yes, @axelboc you can use HTTP range requests (lazyFile, lazyFileLRU etc.) with h5wasm, but only in synchronous mode, not async. For the moment I think there is not support for async filesystems in emscripten, even if you write your own FS driver (the interface on the emscripten side is synchronous). |
Right okay, so While taking a closer look at Emscripten's file system API, I noticed that they provide a Would building EDIT: found this StackOverflow thread that might be of help: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time |
That's a good catch! Maybe this is already supported out of the box with WORKERFS... it's worth a try. |
See dist.zip in usnistgov/h5wasm#40 for experimental upstream code
Hi, Can I use h5wasm for this? Thanks |
@Carnageous the issue is
This clearly makes it impossible to use a file that exceed client memory, effectively removing a key feature of What are my options here? |
You don't have to use the emscripten MEMFS filesystem if you don't want to. You'll get a synchronous, MEMFS "traditional" file backed by an ArrayBuffer if you invoke "FS.writeFile" as in the comment above, and in the h5wasm example docs. There are other virtual filesystems to choose from: see https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems. The one that might solve the problem here is the WORKERFS filesystem. I haven't found any examples of usage yet, but the documentation suggests it does exactly what you are looking for. The h5wasm library would have to be compiled with support for the WORKERFS, which is not happening right now. I am building it with support for IDBFS (allowing persisting files to browser IndexedDB storage between sessions) and can easily add support for WORKERFS. I don't think that will cause any conflicts or add much size to the library. EDIT: here is an example of using WORKERFS I found: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time EDITED again: I can't believe I missed that @axelboc had already posted the same stackoverflow link earlier in this thread. |
@bmaranville thanks for the rapid response. I'll check out your example. For more context. My use case involves surper-resolution microscopy for 3D imaging of chromosomes. So, thousands of raw high res. images (video frames) that get processed downstream into 3D models and Hi-C maps. Until I found HDF5 these files have all been separate entities. HDF5 could be a game changer for us. |
I haven't made a new release yet, but if you want to try it out, here is a zipped version of the library with WORKERFS support built in: h5wasm-0.4.7-extra.tar.gz It also has an IIFE build, which you might want for playing around with workers (though you could bundle your own worker script of course). Here is a working demo based on the SO post before (I put it in a folder within the unpacked h5wasm package, so that the path <html>
<head>
<script>
const worker = new Worker("worker.js");
function onClick() {
const f = document.getElementById("in-file").files[0];
worker.postMessage([ f ]);
}
</script>
</head>
<body>
<input type="file" id="in-file" />
<input type="button" onClick="onClick()" value="ok" />
</body>
</html> // worker.js
onmessage = async function(e) {
const { FS } = await h5wasm.ready;
const f_in = e.data[0];
FS.mkdir('/work');
FS.mount(FS.filesystems.WORKERFS, { files: [f_in] }, '/work');
const f = new h5wasm.File(`/work/${f_in.name}`, 'r');
console.log(f);
}
self.importScripts('../dist/iife/h5wasm.js'); |
I will start experimenting with this worker approach to handling large files (greater then available RAM). I am a bit unclear on how to interactively retrieve various datasets from within my app. Once mounted, how is this file made available to the hosting app? |
The downside of the worker is that you have to indirectly access the h5wasm object through the worker interface. With a service worker you can intercept fetch requests and define a REST API in the worker that is accessed through the main page script, or you can use an add-on like 'promise-worker' with a regular webworker if you want to get responses to messages sent to a worker. |
Brian, this is a bit off topic but I have a basic question about h5wasm regarding these larger-than-memory files. As a sanity check I threw together a Jupyter notebook to play with 6.5GB file I am working with. The notebook uses |
The issue is really two things - the emscripten file system does not allow async access right now, and all major browsers forbid running synchronous file API access from the main page (thread). You are only allowed to run synchronous (blocking) file access (or URL fetch!) from a web worker, so that you don't block the main page (javascript thread). The second thing is not likely to ever change - they are probably not going to allow sync file/fetch access from the main javascript thread again. The first thing might change - emscripten might support async file access at some point. I don't completely understand all the discussions along this topic but you can see emscripten-core/emscripten#15041 I don't usually recommend jsfive over h5wasm, but for this particular use case it is possible to build jsfive in async mode, and make all accessors (getting data, attributes etc.) async as well, and then you can use it in the main thread. See #40 (comment) above. If there is demand for this, I will release this async version of jsfive, probably as a separate package. EDIT: to answer your question more directly, the reason it works in jupyter is because the HDF5 file is being read by the python process running directly on your OS, while the h5wasm code is running in your browser. You can also run h5wasm in nodejs, and it can do random-access on files in the OS just fine - just like h5py! |
Hi all, I might have a solution for this based on jsfive here: https://github.com/jrobinso/hdf5-indexed-reader. We are now remotely accessing individual datasets from HDF5 file ~ 180 GB in size in the spacewalk project. @bmaranville had done most of the work in the async branch of jsfive. For now I have forked jsfive and added additions to (1) use range byte requests for just the portions of the file needed, and (2) support an optional index to find dataset offsets without walking the tree of ancestors and siblings. The tree walking to build hdf5's internal b-tree container index turns out to be very expensive over the web, as this metadata can be anywhere in the file. You end up generating an http request for each individual container visited. In our use case this can be in the thousands, thus the need for the index. However in some schemas the index might not be needed, it is optional. |
The h5wasm and jsfive libraries look valuable for processing HDF5 files in web browsers. Thanks for making them available!
Use case
I want to read large HDF5 files in a web browser. More specifically, I want to let users select an HDF5 file from their local machine, then have my web application read that file, validate the HDF5 file's content per some custom rules, then upload the file to a server. This could turn an important yet disjointed 10-15 minute process into a coherent 30 second process for my users.
My users' HDF5 files are often too big to load completely into client RAM. So I want to read and validate HDF5 content in a streaming manner -- never loading the full file into RAM, only a part at a time, before involving a server.
My investigation
I explored whether jsfive could load partial HDF5 files into memory, but your June presentation, GitHub comments, examples, and my own experiments indicate to me that's not yet possible in jsfive.
Maybe h5wasm is a better bet for stream-processing local HDF5 files in a web browser.
It seems this wasn't possible as of January 2022. Comments from June '22 look related: I see HDF5 data being requested in chunks (via HTTP range requests) in your lazyFileLRU demo. At a glance that seems like progress towards but not quite sufficient for my use case.
Feature request
So my understanding is that HDF5 files can be read in small chunks via h5wasm, but there's no current way to load, say, a 16 GB HDF5 file in a web browser if your computer only has 8 GB RAM.
Is that right? If so, please consider this a feature request to enable that! If not, could you point me to any examples?
The text was updated successfully, but these errors were encountered: