Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fst work without memory mapping with an arbitrary "fake array" #15

Closed
wants to merge 5 commits into from

Conversation

phiresky
Copy link

A better description is here: quickwit-oss/tantivy#1067

This allows fst to work without having the whole index in memory and on systems without memory mapping supported (e.g. WASM).

To make this viable for everything else, the dynamic invocation needs to be removed again or put behind a compile time flag.

@BurntSushi
Copy link

FWIW, this diff is taking this library further away from fst. See this discussion where (I think) we are trying to get tantivy back on to fst proper: BurntSushi#94

@phiresky
Copy link
Author

@BurntSushi Yes, it does.. It would be great if the main fst crate had a way to be used with a generic get_bytes(from, to) API instead of forcibly relying on memory mapping. That would be useful for environments where memory mapping is not available (e.g. WASM), as well as environments where the file is not available directly - for example when the index is stored in an encrypted file and needs to be decrypted first, or it is fetched from a remote location. Encryption is how Element (Matrix) Desktop does search in messages, and due to the reliance on memory mapping in FST it has to load the whole search index into memory instead of only reading chunks.

From your blog article about fst:

Another way [...] is to maintain a cache that is bounded in size and stores chunks of the file in memory, but probably not all of it. When a region of the file needs to be accessed that’s not in the cache, a chunk of the file that includes that region is read and added to the cache (possibly evicting a chunk that hasn’t been accessed in a while). [...] Unfortunately, this approach is complex to implement and has other problems.

This POC proves that it is possible to use your FST crate without memory mapping in a pretty efficient manner, though as it is now I'm sure it reduces the performance for the mem-map use case due to dynamic invocation.

Would you be interested in modifying the upstream FST crate to allow it to work on non-slices? No need to actually implement a caching system, just make it possible to hook into the read requests if needed. I'm sure it can be done without any performance loss compared to the way it works now, though I don't know enough about Rust perf to say what the best approach is.

@BurntSushi
Copy link

@phiresky It does seem like a very good use case to satisfy, but I'd probably want to look at a design first before doing so. And yes, performance is definitely a concern, but so is complexity of implementation and API complexity.

There are other practical problems... e.g., While I might agree that it's a good idea and am OK adding something like it, the time and energy I have available to devote towards it is pretty limited and that might be a problem for y'all. So in that sense, I understand going your own way with tantivy-fst. It's easier to just iterate on what y'all need instead of trying to build it into a more generic library. I get that. But I wanted to chime in here and see if there was a way to move forward with one library, particularly since this change really ups the ante in terms of divergence. :-)

Anyway, this is a really cool change. It's awesome to see FST working without relying on mmap. I had thought it would be a lot harder than this.

@fulmicoton
Copy link

fulmicoton commented Jun 10, 2021

@BurntSushi @phiresky I am not merging this. @phiresky's original use case is to fetch the data from an http server or IPFS.
An fst is not good implementation for that.

We developed a sstable based dictionary that is better suited to that use case.
(The block size might be too big for your use case if you are close to the http server though )
https://github.com/quickwit-inc/tantivy/tree/quickwit/src/termdict/sstable_termdict

We will opensource a cleanup version. The block may actually small fst instead of sstable but the idea is the same.
Small blocks of dictionary, and an overall index to find the right block.

We'll probably only do 2 levels. (because of the properties of the storage we are working with: high latency but decent throughput)

@fulmicoton fulmicoton closed this Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants