Make fst work without memory mapping with an arbitrary "fake array" #15

phiresky · 2021-05-29T12:24:37Z

A better description is here: quickwit-oss/tantivy#1067

This allows fst to work without having the whole index in memory and on systems without memory mapping supported (e.g. WASM).

To make this viable for everything else, the dynamic invocation needs to be removed again or put behind a compile time flag.

BurntSushi · 2021-06-07T15:52:11Z

FWIW, this diff is taking this library further away from fst. See this discussion where (I think) we are trying to get tantivy back on to fst proper: BurntSushi#94

phiresky · 2021-06-10T11:50:16Z

@BurntSushi Yes, it does.. It would be great if the main fst crate had a way to be used with a generic get_bytes(from, to) API instead of forcibly relying on memory mapping. That would be useful for environments where memory mapping is not available (e.g. WASM), as well as environments where the file is not available directly - for example when the index is stored in an encrypted file and needs to be decrypted first, or it is fetched from a remote location. Encryption is how Element (Matrix) Desktop does search in messages, and due to the reliance on memory mapping in FST it has to load the whole search index into memory instead of only reading chunks.

From your blog article about fst:

Another way [...] is to maintain a cache that is bounded in size and stores chunks of the file in memory, but probably not all of it. When a region of the file needs to be accessed that’s not in the cache, a chunk of the file that includes that region is read and added to the cache (possibly evicting a chunk that hasn’t been accessed in a while). [...] Unfortunately, this approach is complex to implement and has other problems.

This POC proves that it is possible to use your FST crate without memory mapping in a pretty efficient manner, though as it is now I'm sure it reduces the performance for the mem-map use case due to dynamic invocation.

Would you be interested in modifying the upstream FST crate to allow it to work on non-slices? No need to actually implement a caching system, just make it possible to hook into the read requests if needed. I'm sure it can be done without any performance loss compared to the way it works now, though I don't know enough about Rust perf to say what the best approach is.

BurntSushi · 2021-06-10T12:20:57Z

@phiresky It does seem like a very good use case to satisfy, but I'd probably want to look at a design first before doing so. And yes, performance is definitely a concern, but so is complexity of implementation and API complexity.

There are other practical problems... e.g., While I might agree that it's a good idea and am OK adding something like it, the time and energy I have available to devote towards it is pretty limited and that might be a problem for y'all. So in that sense, I understand going your own way with tantivy-fst. It's easier to just iterate on what y'all need instead of trying to build it into a more generic library. I get that. But I wanted to chime in here and see if there was a way to move forward with one library, particularly since this change really ups the ante in terms of divergence. :-)

Anyway, this is a really cool change. It's awesome to see FST working without relying on mmap. I had thought it would be a lot harder than this.

fulmicoton · 2021-06-10T12:42:39Z

@BurntSushi @phiresky I am not merging this. @phiresky's original use case is to fetch the data from an http server or IPFS.
An fst is not good implementation for that.

We developed a sstable based dictionary that is better suited to that use case.
(The block size might be too big for your use case if you are close to the http server though )
https://github.com/quickwit-inc/tantivy/tree/quickwit/src/termdict/sstable_termdict

We will opensource a cleanup version. The block may actually small fst instead of sstable but the idea is the same.
Small blocks of dictionary, and an overall index to find the right block.

We'll probably only do 2 levels. (because of the properties of the storage we are working with: high latency but decent throughput)

phiresky added 5 commits May 17, 2021 17:26

kinda working

6562bb4

working

1ee0e5e

less indirection

e6a30a5

ulen

00d4b26

WORKS!

edfdf0a

phiresky mentioned this pull request May 29, 2021

Make tantivy work in the browser with a statically hosted database and on-demand fetching quickwit-oss/tantivy#1067

Draft

fulmicoton closed this Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make fst work without memory mapping with an arbitrary "fake array" #15

Make fst work without memory mapping with an arbitrary "fake array" #15

phiresky commented May 29, 2021

BurntSushi commented Jun 7, 2021

phiresky commented Jun 10, 2021

BurntSushi commented Jun 10, 2021

fulmicoton commented Jun 10, 2021 •

edited

Loading

Make fst work without memory mapping with an arbitrary "fake array" #15

Make fst work without memory mapping with an arbitrary "fake array" #15

Conversation

phiresky commented May 29, 2021

BurntSushi commented Jun 7, 2021

phiresky commented Jun 10, 2021

BurntSushi commented Jun 10, 2021

fulmicoton commented Jun 10, 2021 • edited Loading

fulmicoton commented Jun 10, 2021 •

edited

Loading