Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

phiresky
Copy link

@phiresky phiresky commented May 29, 2021

hi everyone!

I've managed to make tantivy work in WASM with partial loading of queries via HTTP Range requests. This means it's possible to host a tantivy search engine on a statically hosted website (or a distributed website on IPFS).

For example, doing a full text search on an index of size 14 GByte takes 2 seconds, and only needs to download only ~1.5MByte of the index.

Here's a demo using the english Wikipedia as well as the OpenLibrary metadata:

https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.html

image

Since tantivy heavily relies on memory mapping, this required some pretty deep changes both here and in tantivy-fst.

Of course, this whole thing is much less efficient than just using tantivy on the backend, but needing only a static file host is much cheaper and this is much more efficient than when doing the same thing with SQLite. This is also a unique feature no other database or search engine has (as far as i know).

Here's some details about how it works:

  • Instead of OwnedBytes being passed around, we pass around FileSlices and only get the actual content out of it as late as possible. I've added a new trait FakeArr<> that is used in tantivy-fst as well as tantivy for slicing operations etc to replace the native ones.
  • I've replaced most instances of usize with type Ulen = u64 because why would I want to be able to only load databases < 4GByte on a 32 bit system (like wasm)? :)
  • I've added a new implementation of Directory and FileHandle that hooks into JavaScript instead of the file system. This implementation uses two layers of caching and prefetching:
    1. Files are always read in chunks of (configurable) 32kByte. If a single byte is read, the whole 32kByte chunk is fetched and cached indefinitely.
    2. Each file handle has three virtual "read heads" with a current position. If a chunk is fetched that is sequential to the previous request, an increasing number of chunks is prefetched speculatively to reduce the number of requests in the future. This was very useful for SQLite but I'm not sure how much it does for tantivy since the DB structure is very different and most requests have predictable sizes anyways.
  • I've adjusted the tests to work with the new interfaces
  • The changes to tantivy-fst are in this companion PR: Make fst work without memory mapping with an arbitrary "fake array" quickwit-inc/fst#15

And now here's the reasons this probably can't directly be merged:

  • I've changed instances of usize to Ulen somewhat blindly, because it's tons of changes, so there might be some cases where it should be kept as usize

  • No compile time flag to change Ulen back to 32 bit if someone really needs it. I doubt many people actually use tantivy on 32bit systems though.

  • Dynamic invocation that probably causes unacceptable performance loss when using tantivy normally. There's dynamic invokation between FileSlice and FileHandle as well as between FakeArrSlice and FakeArr (the FileSlice and FakeArrSlice traits could be unified probably).

    I think this could be solved without losing any flexibility by either adding more generics and using static invocation, or by adding a compile time flag.

  • I based it on tantivy 0.14 not master

  • I did not care about the .pos file so far so that's still fetched as a whole if you need phrase queries

Here's the code of the demo: https://github.com/phiresky/tantivy-wasm/

@phiresky
Copy link
Author

phiresky commented May 29, 2021

Another note: The same thing could be used with an IndexedDB backend to e.g. fix the Matrix search in the browser: matrix-org/seshat#84

So far I've only looked at read-only but writing should be possible similarily

@ngbrown
Copy link

ngbrown commented May 29, 2021

It's not clear that IPFS supports byte range requests apart from the gateway-browser connection. So the actual files stored may have to be pre-chunked and stored as separate files. If the "read heads" read dynamically vary the ranges they read over, then would it be simple to replace the range query with a numbered file fetch?

@phiresky
Copy link
Author

It's not clear that IPFS supports byte range requests apart from the gateway-browser connection.
may have to be pre-chunked

In my above-linked article I actually do split the database file into multiple chunks (db chunked in 10MB, fetched in chunks of 1kB), so my code here kinda already supports it. So that's useful if e.g. the CDN (≅ IPFS gateway) can only fetch and cache whole files, not file chunks.

Someone actually is using the same method of fetching file chunks with IPFS and SQLite and it works pretty well (can't link it sadly). The main limitation seems to be that most IPFS gateways throw 429 errors after ~20 requests. I also don't know if IPFS supports fetching only parts of files from the network, but if not pre-chunking the files definitely works.

@fulmicoton
Copy link
Collaborator

(For obvious reason, this is unmergeable but I assume this was not the purpose of this PR?)

Good job finding out how to use the new FileSlice API without any guidance.
We (Quickwit) introduced the API precisely to be able to fetch information from a distant directory.

I am curious on how you bridged the gap between the FS syscall to HTTP get requests in your WASM demo?

@lidel
Copy link

lidel commented Jun 2, 2021

This is an exciting PoC – demonstrates a potential way of solving ipfs/distributed-wikipedia-mirror#76 ❤️

I can provide an answer for:

It's not clear that IPFS supports byte range requests [..]
I also don't know if IPFS supports fetching only parts of files from the network [..]

IPFS supports range requests, either as HTTP range requests to a gateway, or by passing offset / length to ipfs cat command. Data stored on IPFS is already chunked and represented as a DAG, and a range request will traverse the graph in a way that fetches the minimal amount of chunks to fulfill the range request.

Due to this, for most use cases, as long IPFS is used, there is no need to do additional chunking at the filesystem level.

main limitation seems to be that most IPFS gateways throw 429 errors after ~20 requests

AFAIK this is not IPFS, but an artificial limitation introduced by a specific gateway instance on Nginx or similar reverse proxy (to mitigate ongoing abuse). Solutions: switch gateway, your own, or use IPFS directly.

@phiresky
Copy link
Author

phiresky commented Jun 10, 2021

For obvious reason, this is unmergeable but I assume this was not the purpose of this PR?

Actually, it would be great to get as much as possible of this merged :) I put this draft PR up to find out if you're interested and how best to proceed, I could split it up into separate PRs for example. None of the changes in this PR are wasm-specific and they could be useful for other use cases as well.

For example, the usize -> Ulen=u64 change is the most number of lines changed and could fairly easily be integrated without affecting any other use cases of tantivy, while allowing use of large indexes on 32-bit systems.

Good job finding out how to use the new FileSlice API without any guidance

The FileSlice API is good, but not that useful with how it is currently handled in the main branch I think - in many cases it is converted to OwnedBytes very soon when it could be converted later on:

  • the field norm file is immediately converted to OwnedBytes / loaded fully loaded into memory. in my case, this would need fetching 60MB via http, so my PR makes it read only the needed parts instead
  • the FST is loaded fully into memory. This is a multiple-gigabyte file in my case, so I changed it to use FileSlice until the actual bytes are read. This way I only have to fetch <100kB instead of 10GB.
  • the term info file is also loaded as a single OwnedBytes although the reads in the term info file are always for an exact and small byte range within the file.

Note that these changes aren't just useful for wasm, but also when memory mapping is not available in general as well as other use cases. For example, Element (Matrix) uses tantivy to index messages, but the search index is stored on disk encrypted. Their current implementation thus decrypts and loads the whole index into memory since you can't dynamically decrypt chunks with memory mapping. With these changes that could be changed to only decrypt the needed parts on demand when a query is run.

@fulmicoton I am curious on how you bridged the gap between the FS syscall to HTTP get requests in your WASM demo?

I'm not sure what exactly you mean - but that part is simple, I just compile tantivy to wasm, and implement the Directory and FileHandle trait with functions that hook into Typescript XMLHttpRequests. The hard part was making sure the read_bytes() method of FileSlice is only called when actually needed, not just on the whole files.

Edit: I see there's already some comments by @fulmicoton about reducing reliance to memmaps here

@fulmicoton
Copy link
Collaborator

Apart from fieldnorms, we solved all of the problems you mentioned above in a more efficient way here.
https://github.com/quickwit-inc/tantivy/tree/quickwit/src/termdict/sstable_termdict
FST are by nature (at least not with this layout) not suited for this.

That work will be cleaned up and added to tantivy soonish.

@fulmicoton
Copy link
Collaborator

I'm not sure what exactly you mean - but that part is simple, I just compile tantivy to wasm, and implement the Directory and FileHandle trait with functions that hook into Typescript XMLHttpRequests. The hard part was making sure the read_bytes() method of FileSlice is only called when actually needed, not just on the whole files.

Did you find a way to parallelize the requests?

@phiresky
Copy link
Author

Did you find a way to parallelize the requests?

Most of the requests are sequential and synchronous, I didn't change anything there, except that it optimistically prefetches more data than needed using heuristics. I only parallelized the fetching of the actual document contents, since that was a pretty easy change - now it's a single HTTP request with multiple file ranges to get all the matched documents instead of one read for each document. I added the FileSlice method fn read_bytes_slice_multiple(&self, ranges: &[Range<Ulen>]) for that. The same could be done to fetch the terms and maybe more, but I'm not sure how that could transfer to the memory mapped implementation without using threads or async.

Apart from fieldnorms, we solved all of the problems you mentioned above in a more efficient way here.

That sounds great! Is there documentation about that somewhere?

@fulmicoton
Copy link
Collaborator

@phiresky no, but you can have a look at the dictionary we use here..
tantivy = { git= "https://github.com/quickwit-inc/tantivy", rev="6d3e9087c"}

The idea is rather simple. Fst suck by essence for that use case, because the locality of the piece of data you need to read to lookup one term is very bad.

Since we know precisely the characteristics of the storage we are dealing with, we just divide our dictionary into a tree of blocks.
The current implem uses sstable into the block because we want a faster iteration over the terms but it could be fsts.

@ppodolsky
Copy link
Contributor

For further researchers, I've embedded Tantivy in WASM and IPFS too together with its latest optimizations like CachingDirectory, HotCache etc.

WASM module: https://github.com/izihawa/summa/tree/master/summa-wasm

Web interface: https://github.com/izihawa/summa/tree/master/summa-web

Blog post: https://habr.com/ru/post/690252/

@fulmicoton
Copy link
Collaborator

@ppodolsky

First of all that's super cool! Google translate did wonders on translating russian. You should take the time to translate it and republish in English, it should interest a lot of people.

It's also impressive that you understood all of our little quickwit tricks to make this possible :).

One point where I am unhappy however:
You copy-pasted our code in your repository. You were kind enough to retain the license header and be transparent about the paternity. The code is under AGPL license which as a copyleft clause. Your repo cannot be under MIT if you use this code.

@ppodolsky
Copy link
Contributor

Thank you for the note. I'm not happy with this vendoring too. The only reason of doing it is that quickwit-directory package brings some dependencies that are not compilable in WASM. And you know, while you are experimenting it is hard to wait for accepting patches in upstream, sry.

If you are OK with it, I can refactor it and put some parts of the quickwit-directory under feature flag to make it usable in WASM.

@fulmicoton
Copy link
Collaborator

@ppodolsky Yes I am not a lawyer but I think putting that code behind a feature flag and somehow clarifying the situation should be ok.

I am not sure what you are doing with summa, but if you want to use & take part in the dev of quickwit instead, I'm happy to discuss how we can offer you a more lenient license.

@mre
Copy link

mre commented Nov 4, 2022

What would be the best way to move forward with this?

From what I can see

  • Make a decision on the usize to Ulen=u64 change. I think this could be moved out and merged as a preliminary step. If needed we can add the mentioned feature-lag for the Ulen change to guarantee backwards compatibility.
  • Decide if we want to keep the FakeArr trait and in general how the I/O handling is done. With conditional compilation we could build the I/O handling part for wasm targets only at no cost to the existing implementation.

@phiresky
Copy link
Author

phiresky commented Nov 4, 2022

@mre

Since I wrote this PR tantivy has had lots of code changes. At least some of them (the removal of FST) should make WASM support easier. So it probably makes more sense to either start from scratch or look at one of the other approaches above (summa) than to base new work on the code in this PR.

The Usize->u64 change is only needed if you want to load databases > 4GB on 32bit. It could be done in a completely standalone PR but idk if quickwit cares about it.

@ppodolsky
Copy link
Contributor

ppodolsky commented Nov 4, 2022

@phiresky
I'm finishing guide and code refactorings for Summa. I hope everything will be production ready next week. Feel free to contact me if you have any questions. I saw your activity in making Wiki (or at least in SQLite in browser, excuse me if confused) on IPFS and we may have something in common here.

@ppodolsky
Copy link
Contributor

I've posted documentation on Summa related to its WASM-part.

Sources and docs may contain some valuable hints for those who wanted to launch search index inside the browser and to integrate it with IPFS or any other system that provides Range request to files over HTTP.

https://izihawa.github.io/summa/ipfs-wasm-guide

@alzinging
Copy link

Is it possible to get this demo and code back?

@ppodolsky
Copy link
Contributor

@alzinging

Is it possible to get this demo and code back?

Both code and demo live in repo, together with documentation and guides

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants