-
-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tantivy work in the browser with a statically hosted database and on-demand fetching #1067
base: main
Are you sure you want to change the base?
Conversation
Another note: The same thing could be used with an IndexedDB backend to e.g. fix the Matrix search in the browser: matrix-org/seshat#84 So far I've only looked at read-only but writing should be possible similarily |
It's not clear that IPFS supports byte range requests apart from the gateway-browser connection. So the actual files stored may have to be pre-chunked and stored as separate files. If the "read heads" read dynamically vary the ranges they read over, then would it be simple to replace the range query with a numbered file fetch? |
In my above-linked article I actually do split the database file into multiple chunks (db chunked in 10MB, fetched in chunks of 1kB), so my code here kinda already supports it. So that's useful if e.g. the CDN (≅ IPFS gateway) can only fetch and cache whole files, not file chunks. Someone actually is using the same method of fetching file chunks with IPFS and SQLite and it works pretty well (can't link it sadly). The main limitation seems to be that most IPFS gateways throw 429 errors after ~20 requests. I also don't know if IPFS supports fetching only parts of files from the network, but if not pre-chunking the files definitely works. |
(For obvious reason, this is unmergeable but I assume this was not the purpose of this PR?) Good job finding out how to use the new FileSlice API without any guidance. I am curious on how you bridged the gap between the FS syscall to HTTP get requests in your WASM demo? |
This is an exciting PoC – demonstrates a potential way of solving ipfs/distributed-wikipedia-mirror#76 ❤️ I can provide an answer for:
IPFS supports range requests, either as HTTP range requests to a gateway, or by passing Due to this, for most use cases, as long IPFS is used, there is no need to do additional chunking at the filesystem level.
AFAIK this is not IPFS, but an artificial limitation introduced by a specific gateway instance on Nginx or similar reverse proxy (to mitigate ongoing abuse). Solutions: switch gateway, your own, or use IPFS directly. |
Actually, it would be great to get as much as possible of this merged :) I put this draft PR up to find out if you're interested and how best to proceed, I could split it up into separate PRs for example. None of the changes in this PR are wasm-specific and they could be useful for other use cases as well. For example, the usize -> Ulen=u64 change is the most number of lines changed and could fairly easily be integrated without affecting any other use cases of tantivy, while allowing use of large indexes on 32-bit systems.
The FileSlice API is good, but not that useful with how it is currently handled in the main branch I think - in many cases it is converted to OwnedBytes very soon when it could be converted later on:
Note that these changes aren't just useful for wasm, but also when memory mapping is not available in general as well as other use cases. For example, Element (Matrix) uses tantivy to index messages, but the search index is stored on disk encrypted. Their current implementation thus decrypts and loads the whole index into memory since you can't dynamically decrypt chunks with memory mapping. With these changes that could be changed to only decrypt the needed parts on demand when a query is run.
I'm not sure what exactly you mean - but that part is simple, I just compile tantivy to wasm, and implement the Directory and FileHandle trait with functions that hook into Typescript XMLHttpRequests. The hard part was making sure the Edit: I see there's already some comments by @fulmicoton about reducing reliance to memmaps here |
Apart from fieldnorms, we solved all of the problems you mentioned above in a more efficient way here. That work will be cleaned up and added to tantivy soonish. |
Did you find a way to parallelize the requests? |
Most of the requests are sequential and synchronous, I didn't change anything there, except that it optimistically prefetches more data than needed using heuristics. I only parallelized the fetching of the actual document contents, since that was a pretty easy change - now it's a single HTTP request with multiple file ranges to get all the matched documents instead of one read for each document. I added the FileSlice method
That sounds great! Is there documentation about that somewhere? |
@phiresky no, but you can have a look at the dictionary we use here.. The idea is rather simple. Fst suck by essence for that use case, because the locality of the piece of data you need to read to lookup one term is very bad. Since we know precisely the characteristics of the storage we are dealing with, we just divide our dictionary into a tree of blocks. |
0c60a44
to
9f32b22
Compare
e550a98
to
84e0c75
Compare
For further researchers, I've embedded Tantivy in WASM and IPFS too together with its latest optimizations like CachingDirectory, HotCache etc. WASM module: https://github.com/izihawa/summa/tree/master/summa-wasm Web interface: https://github.com/izihawa/summa/tree/master/summa-web Blog post: https://habr.com/ru/post/690252/ |
First of all that's super cool! Google translate did wonders on translating russian. You should take the time to translate it and republish in English, it should interest a lot of people. It's also impressive that you understood all of our little quickwit tricks to make this possible :). One point where I am unhappy however: |
Thank you for the note. I'm not happy with this vendoring too. The only reason of doing it is that quickwit-directory package brings some dependencies that are not compilable in WASM. And you know, while you are experimenting it is hard to wait for accepting patches in upstream, sry. If you are OK with it, I can refactor it and put some parts of the quickwit-directory under feature flag to make it usable in WASM. |
@ppodolsky Yes I am not a lawyer but I think putting that code behind a feature flag and somehow clarifying the situation should be ok. I am not sure what you are doing with summa, but if you want to use & take part in the dev of quickwit instead, I'm happy to discuss how we can offer you a more lenient license. |
What would be the best way to move forward with this? From what I can see
|
Since I wrote this PR tantivy has had lots of code changes. At least some of them (the removal of FST) should make WASM support easier. So it probably makes more sense to either start from scratch or look at one of the other approaches above (summa) than to base new work on the code in this PR. The Usize->u64 change is only needed if you want to load databases > 4GB on 32bit. It could be done in a completely standalone PR but idk if quickwit cares about it. |
@phiresky |
I've posted documentation on Summa related to its WASM-part. Sources and docs may contain some valuable hints for those who wanted to launch search index inside the browser and to integrate it with IPFS or any other system that provides Range request to files over HTTP. |
Is it possible to get this demo and code back? |
hi everyone!
I've managed to make tantivy work in WASM with partial loading of queries via HTTP Range requests. This means it's possible to host a tantivy search engine on a statically hosted website (or a distributed website on IPFS).
For example, doing a full text search on an index of size 14 GByte takes 2 seconds, and only needs to download only ~1.5MByte of the index.
Here's a demo using the english Wikipedia as well as the OpenLibrary metadata:
https://demo.phiresky.xyz/tmp-ytccrzsovkcjoylr/dist/index.html
Since tantivy heavily relies on memory mapping, this required some pretty deep changes both here and in tantivy-fst.
Of course, this whole thing is much less efficient than just using tantivy on the backend, but needing only a static file host is much cheaper and this is much more efficient than when doing the same thing with SQLite. This is also a unique feature no other database or search engine has (as far as i know).
Here's some details about how it works:
OwnedBytes
being passed around, we pass around FileSlices and only get the actual content out of it as late as possible. I've added a new traitFakeArr<>
that is used in tantivy-fst as well as tantivy for slicing operations etc to replace the native ones.usize
withtype Ulen = u64
because why would I want to be able to only load databases < 4GByte on a 32 bit system (like wasm)? :)And now here's the reasons this probably can't directly be merged:
I've changed instances of
usize
toUlen
somewhat blindly, because it's tons of changes, so there might be some cases where it should be kept asusize
No compile time flag to change
Ulen
back to 32 bit if someone really needs it. I doubt many people actually use tantivy on 32bit systems though.Dynamic invocation that probably causes unacceptable performance loss when using tantivy normally. There's dynamic invokation between
FileSlice and FileHandle
as well as betweenFakeArrSlice and FakeArr
(the FileSlice and FakeArrSlice traits could be unified probably).I think this could be solved without losing any flexibility by either adding more generics and using static invocation, or by adding a compile time flag.
I based it on tantivy 0.14 not master
I did not care about the .pos file so far so that's still fetched as a whole if you need phrase queries
Here's the code of the demo: https://github.com/phiresky/tantivy-wasm/