Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading HDF5 file that is larger than client memory #40

Open
eweitz opened this issue Nov 3, 2022 · 22 comments
Open

Reading HDF5 file that is larger than client memory #40

eweitz opened this issue Nov 3, 2022 · 22 comments

Comments

@eweitz
Copy link

eweitz commented Nov 3, 2022

The h5wasm and jsfive libraries look valuable for processing HDF5 files in web browsers. Thanks for making them available!

Use case

I want to read large HDF5 files in a web browser. More specifically, I want to let users select an HDF5 file from their local machine, then have my web application read that file, validate the HDF5 file's content per some custom rules, then upload the file to a server. This could turn an important yet disjointed 10-15 minute process into a coherent 30 second process for my users.

My users' HDF5 files are often too big to load completely into client RAM. So I want to read and validate HDF5 content in a streaming manner -- never loading the full file into RAM, only a part at a time, before involving a server.

My investigation

I explored whether jsfive could load partial HDF5 files into memory, but your June presentation, GitHub comments, examples, and my own experiments indicate to me that's not yet possible in jsfive.

Maybe h5wasm is a better bet for stream-processing local HDF5 files in a web browser.

It seems this wasn't possible as of January 2022. Comments from June '22 look related: I see HDF5 data being requested in chunks (via HTTP range requests) in your lazyFileLRU demo. At a glance that seems like progress towards but not quite sufficient for my use case.

Feature request

So my understanding is that HDF5 files can be read in small chunks via h5wasm, but there's no current way to load, say, a 16 GB HDF5 file in a web browser if your computer only has 8 GB RAM.

Is that right? If so, please consider this a feature request to enable that! If not, could you point me to any examples?

@bmaranville
Copy link
Member

This is an interesting question! The combination you want: local files, accessed through the browser, larger than 2GB (or even larger than the system memory) presents several challenges...

For h5wasm, you could indeed use lazyFileLRU to deal with the large size issue, but local filesystem access is only asynchronous and (at the moment) the emscripten emulated filesystem can only be used in a synchronous mode. This is why lazyFile (and lazyFileLRU) have to run in a worker, because there they can take advantage of synchronous fetch calls that are no longer allowed in the main JS thread.

For jsfive, the library was written to be synchronous and load the whole file into an ArrayBuffer on instantiation of the root jsfive.File object. For your use case, this is bad because the maximum size of an ArrayBuffer is set by the browser and is e.g. 2GB in the current Chrome version on OSX, which is much smaller than the files you are wanting to process.

The only possibility I can see for solving your problem is to create a new version of jsfive ('jsfive-async') that replaces all buffer.slice operations with async function calls that read a local file through the javascript File API, which seems to allow random access once a file is picked through an <input type="file" /> element. That access is always async though, so the ArrayBuffer used as the main "storage" of jsfive would be replaced with a wrapper of the File object...

class AsyncBuffer {
  constructor(file_obj) {
    this.file_obj = file_obj;
  }
  async slice(start, stop) {
    return (await this.file_obj.slice(start, stop)).arrayBuffer();
  }
}

On top of that, all the classes that are instantiated by reading from the file (which is most of the classes in jsfive) would have to be rewritten to have an async init() method in addition to the constructor, that would have to be awaited after each construction.

It's probably doable, but it would be a bit of work.

@bmaranville
Copy link
Member

Note that loading HDF5 requires random access to the file, not sequential "streaming" access, as the internal structure of the file is not linear (and is in fact fractal at times! - see FractalHeap)

@eweitz
Copy link
Author

eweitz commented Nov 4, 2022

Thanks Brian, that's rich and insightful guidance. Your outline broadly makes sense to me.

Recap

My main take away from your comments above is to use the File API for random access without loading the whole file into memory as an ArrayBuffer. Your suggestions to use work workers and refactor to async functions also seem on point.

Values -> pointers

Beyond async, I think your suggestions entail that a new jsfive-async library would benefit from using pointers instead of directly loading values that contain large amounts of data. So, for example, I might read the whole file in small chunks via the File API, and track the byte offset of where large datasets in the HDF5 file start and stop. Then, in subsequent operations, I could quickly look up addresses for datasets A, B, C and so forth in the source HDF5 file byte array stored outside memory. It'd let me load only dataset A after the initial whole scan, rather than needing to stream-read the whole file more than once.

Such an instantiated jsfive-async HDF5 File object would be more of an index than a file with directly-useful content.

HDF5 index file companion

I could also see making that jsfive-async HDF5 index object into a file itself. The index file would be much smaller than the source HDF5 file, but enable fast retrieval and random access. The index file might even travel alongside the larger source HDF5 file. That'd help memory-constrained clients. The index file would aid my particular use case, but I suspect it'd be even more valuable for fast random-access retrieval from remote servers via HTTP range requests. I imagine that'd be a more prevalent use case.

This idea underpins BAM and BAI files, which are common in genomics. This new index file would be like BAI files for HDF5. HDF5 files seem more structurally complex than BAMs, which I think would be the main barrier here.

Questions

Beyond the considerable effort, do you see any fundamental issues with the outline above?

Also, it's worth noting that I'm an HDF5 novice, so I may well be overlooking something. Briefly researching, the closest construct I found to an HDF5 index file is HDF5 virtual datasets (VDS). However, the VDS reference doesn't mention "stream" and has no relevant hits for "memory", so at a glance my hunch is that VDS does not address the use cases that an HDF5 index file would. Is there an existing solution in the HDF5 community for what HDF5 index files would solve?

@bmaranville
Copy link
Member

Already jsfive does loading of datasets (and groups, for that matter) "on demand", in the sense that calling Dataset.value triggers a read of the relevant bytes (either chunked or contiguous) to construct the desired output value. The Group and Dataset classes hold a reference to a Dataobjects instance once they are loaded, which has a bunch of addressing information similar to what you describe. If the underlying buffer for jsfive.File is changed to a random-access async system I think it will already be pretty efficient.

@bmaranville
Copy link
Member

bmaranville commented Nov 4, 2022

I think I have something that works... you can extract the compiled esm/index.mjs from the attachment at the bottom (had to fix a problem with the filter pipeline, it's all working now...), and then in your page do like below. I tested on a 16 GB local file and was able to load and browse the file just fine.

import * as jsfive_async from './jsfive/dist/esm/index.mjs';

class AsyncBuffer {
  constructor(file_obj) {
    this.file_obj = file_obj;
  }
  async slice(start, stop) {
    return (await this.file_obj.slice(start, stop)).arrayBuffer();
  }
}

const file_input = document.getElementById("file_input");
file_input.onchange = async function() {
  const file = file_input.files[0];
  const async_buf = new AsyncBuffer(file);
  const f = new jsfive_async.File(async_buf);
  await f.ready;
  // ... then do stuff with the file, e.g. :
  window.f = f; // now you can play with it in the console
  console.log(f.keys);
  // if you have a group called 'entry':
  let entry = await f.get('entry');
  let dataset = await entry.get('data');
  // shape, dtype and value are all async now:
  console.log(await dataset.shape); // shape, 
  console.log(await dataset.dtype);
  console.log(await dataset.value); // don't do this if your dataset is big!
}

(this is built from the async branch of jsfive, that I just pushed)
dist.zip

@axelboc
Copy link
Collaborator

axelboc commented Nov 7, 2022

For h5wasm, you could indeed use lazyFileLRU to deal with the large size issue, but local filesystem access is only asynchronous and (at the moment) the emscripten emulated filesystem can only be used in a synchronous mode. This is why lazyFile (and lazyFileLRU) have to run in a worker, because there they can take advantage of synchronous fetch calls that are no longer allowed in the main JS thread.

Would it be realistic for h5wasm to provide a way to bypass Emscripten and let JavaScript take care of random file access with await file.slice(start, end).arrayBuffer() or HTTP range requests?

@bmaranville
Copy link
Member

Yes, @axelboc you can use HTTP range requests (lazyFile, lazyFileLRU etc.) with h5wasm, but only in synchronous mode, not async. For the moment I think there is not support for async filesystems in emscripten, even if you write your own FS driver (the interface on the emscripten side is synchronous).
Note that because the access has to be synchronous, you have to make sync fetch calls with the range requests, and sync fetch calls are only permitted in a worker.

@axelboc
Copy link
Collaborator

axelboc commented Nov 7, 2022

Right okay, so await file.slice(start, end).arrayBuffer() is just not possible because file system calls have to be synchronous.

While taking a closer look at Emscripten's file system API, I noticed that they provide a WORKERFS file system that supposedly "provides read-only access to File and Blob objects inside a worker without copying the entire data into memory and can potentially be used for huge files."

Would building h5wasm with WORKERFS and mounting it instead of the default MEMFS solve the issue of reading huge local files?

EDIT: found this StackOverflow thread that might be of help: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time

@bmaranville
Copy link
Member

That's a good catch! Maybe this is already supported out of the box with WORKERFS... it's worth a try.

eweitz added a commit to broadinstitute/single_cell_portal_core that referenced this issue Nov 10, 2022
See dist.zip in usnistgov/h5wasm#40 for experimental upstream code
@turner
Copy link

turner commented Dec 2, 2022

Hi,
I just found this thread. I am a bit unclear on strategy for reading a file that exceeds client memory. My files range from a few hundred MEG to a few GIG. I will need the ability to retrieve a user selected chunk from the larger file. The chunks are of a size that will fit in client memory.

Can I use h5wasm for this?

Thanks

@turner
Copy link

turner commented Dec 3, 2022

@Carnageous the issue is h52wasm - as an intermediate step - immediately creates an arraybuffer which then gets written to disk:

        const { FS } = await ready
        FS.writeFile(name, new Uint8Array(arrayBuffer))
        const hdf5 = new h5wasmFile(name, 'r')

This clearly makes it impossible to use a file that exceed client memory, effectively removing a key feature of HDF5: the ability to work easily with humungous files.

What are my options here?

@bmaranville
Copy link
Member

bmaranville commented Dec 3, 2022

You don't have to use the emscripten MEMFS filesystem if you don't want to. You'll get a synchronous, MEMFS "traditional" file backed by an ArrayBuffer if you invoke "FS.writeFile" as in the comment above, and in the h5wasm example docs.

There are other virtual filesystems to choose from: see https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems. The one that might solve the problem here is the WORKERFS filesystem. I haven't found any examples of usage yet, but the documentation suggests it does exactly what you are looking for.

The h5wasm library would have to be compiled with support for the WORKERFS, which is not happening right now. I am building it with support for IDBFS (allowing persisting files to browser IndexedDB storage between sessions) and can easily add support for WORKERFS. I don't think that will cause any conflicts or add much size to the library.

EDIT: here is an example of using WORKERFS I found: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time

EDITED again: I can't believe I missed that @axelboc had already posted the same stackoverflow link earlier in this thread.

@turner
Copy link

turner commented Dec 3, 2022

You don't have to use the emscripten MEMFS filesystem if you don't want to. You'll get a synchronous, MEMFS "traditional" file backed by an ArrayBuffer if you invoke "FS.writeFile" as in the comment above, and in the h5wasm example docs.

There are other virtual filesystems to choose from: see https://emscripten.org/docs/api_reference/Filesystem-API.html#file-systems. The one that might solve the problem here is the WORKERFS filesystem. I haven't found any examples of usage yet, but the documentation suggests it does exactly what you are looking for.

The h5wasm library would have to be compiled with support for the WORKERFS, which is not happening right now. I am building it with support for IDBFS (allowing persisting files to browser IndexedDB storage between sessions) and can easily add support for WORKERFS. I don't think that will cause any conflicts or add much size to the library.

EDIT: here is an example of using WORKERFS I found: https://stackoverflow.com/questions/59128901/reading-large-user-provided-file-from-emscripten-chunk-at-a-time

@bmaranville thanks for the rapid response. I'll check out your example. For more context. My use case involves surper-resolution microscopy for 3D imaging of chromosomes. So, thousands of raw high res. images (video frames) that get processed downstream into 3D models and Hi-C maps. Until I found HDF5 these files have all been separate entities. HDF5 could be a game changer for us.

@bmaranville
Copy link
Member

I haven't made a new release yet, but if you want to try it out, here is a zipped version of the library with WORKERFS support built in: h5wasm-0.4.7-extra.tar.gz It also has an IIFE build, which you might want for playing around with workers (though you could bundle your own worker script of course). Here is a working demo based on the SO post before (I put it in a folder within the unpacked h5wasm package, so that the path ../dist/iife/h5wasm.js made sense. YMMV)

<html>
<head>
<script>
    const worker = new Worker("worker.js");
    function onClick() {
        const f = document.getElementById("in-file").files[0];
        worker.postMessage([ f ]);
    }
</script>
</head>
<body>
    <input type="file" id="in-file" />
    <input type="button" onClick="onClick()" value="ok" />
</body>
</html>
// worker.js
onmessage = async function(e) {
    const { FS } = await h5wasm.ready;
    
    const f_in = e.data[0];

    FS.mkdir('/work');
    FS.mount(FS.filesystems.WORKERFS, { files: [f_in] }, '/work');

    const f = new h5wasm.File(`/work/${f_in.name}`, 'r');
    console.log(f);
}

self.importScripts('../dist/iife/h5wasm.js');

@bmaranville
Copy link
Member

See new release v0.4.8 on npm and github

@turner
Copy link

turner commented Dec 5, 2022

See new release v0.4.8 on npm and github

Very cool. Thanks Brian.

@turner
Copy link

turner commented Dec 21, 2022

I will start experimenting with this worker approach to handling large files (greater then available RAM). I am a bit unclear on how to interactively retrieve various datasets from within my app. Once mounted, how is this file made available to the hosting app?

@bmaranville
Copy link
Member

The downside of the worker is that you have to indirectly access the h5wasm object through the worker interface. With a service worker you can intercept fetch requests and define a REST API in the worker that is accessed through the main page script, or you can use an add-on like 'promise-worker' with a regular webworker if you want to get responses to messages sent to a worker.

@turner
Copy link

turner commented Dec 22, 2022

Brian, this is a bit off topic but I have a basic question about h5wasm regarding these larger-than-memory files. As a sanity check I threw together a Jupyter notebook to play with 6.5GB file I am working with. The notebook uses h5py and works perfectly. Is there some fundamental limitation of the JS implementation that prevents using large files directly (without resorting to a worker)? Or is it just an issue of this being early in the development of h5wasm and it is in the roadmap for sometime in the future? Thanks.

@bmaranville
Copy link
Member

bmaranville commented Dec 22, 2022

The issue is really two things - the emscripten file system does not allow async access right now, and all major browsers forbid running synchronous file API access from the main page (thread). You are only allowed to run synchronous (blocking) file access (or URL fetch!) from a web worker, so that you don't block the main page (javascript thread).

The second thing is not likely to ever change - they are probably not going to allow sync file/fetch access from the main javascript thread again. The first thing might change - emscripten might support async file access at some point. I don't completely understand all the discussions along this topic but you can see emscripten-core/emscripten#15041

I don't usually recommend jsfive over h5wasm, but for this particular use case it is possible to build jsfive in async mode, and make all accessors (getting data, attributes etc.) async as well, and then you can use it in the main thread. See #40 (comment) above. If there is demand for this, I will release this async version of jsfive, probably as a separate package.

EDIT: to answer your question more directly, the reason it works in jupyter is because the HDF5 file is being read by the python process running directly on your OS, while the h5wasm code is running in your browser. You can also run h5wasm in nodejs, and it can do random-access on files in the OS just fine - just like h5py!

@jrobinso
Copy link

jrobinso commented Mar 2, 2023

Hi all, I might have a solution for this based on jsfive here: https://github.com/jrobinso/hdf5-indexed-reader. We are now remotely accessing individual datasets from HDF5 file ~ 180 GB in size in the spacewalk project. @bmaranville had done most of the work in the async branch of jsfive. For now I have forked jsfive and added additions to (1) use range byte requests for just the portions of the file needed, and (2) support an optional index to find dataset offsets without walking the tree of ancestors and siblings. The tree walking to build hdf5's internal b-tree container index turns out to be very expensive over the web, as this metadata can be anywhere in the file. You end up generating an http request for each individual container visited. In our use case this can be in the thousands, thus the need for the index. However in some schemas the index might not be needed, it is optional.

@bmaranville
Copy link
Member

For local files, I'm working on a web worker proxy that exposes most of the h5wasm API (though all of it becomes async) through Comlink

See the new PR at: #70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants