Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can FST read bytes forward? #12355

Open
jpountz opened this issue Jun 7, 2023 · 7 comments
Open

Can FST read bytes forward? #12355

jpountz opened this issue Jun 7, 2023 · 7 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Jun 7, 2023

Description

Forked from https://lists.apache.org/thread/1fskhmz84pp60o41txsxj2193vt9txod: the fact that FST reads bytes backwards doesn't play well with BufferedIndexInput, which triggers a refill on every byte read. Could we reverse these sequences of bytes at index time so that they would be read in forward order at search time?

@mikemccand
Copy link
Member

I think FST.Builder's pack method (long removed -- it was complex and scary optimization) did exactly this? In addition to renumbering the nodes to make the FST smaller, it would reverse the byte order? Or am I mis-remembering :)

@jpountz
Copy link
Contributor Author

jpountz commented Jun 9, 2023

You remember correctly. I refrained from mentioning FST#pack explicitly exactly because, like you said, it was scary. 😆

@mikemccand
Copy link
Member

Maybe we could de-scare (de-fang?) it upon resurrection? E.g. don't try to renumber the nodes for compression ... keep the nodes existing numbers (or maybe, reversed order, so forward only properly still holds), while reversing the bytes ... it might make it just non-scary enough to be worthwhile. Not sure.

Reading backwards is also bad even for MMapDirectory ... OS's expect and optimize for forward reading (readahead). If the FSTs are fully hot it won't matter so much (though maybe even CPU caching does some kind of readahead opto?), but if the FSTs are cold (which really ought to be rare -- these are indices of indices afterall), then it could matter quite a bit.

@dungba88
Copy link
Contributor

I just stumbled this, I agreed that reading backward is not cache-friendly. Is there a reason why we write it in backward in the first place? We are specially reversing the byte order on addNode so that we would read later in backward.

@mikemccand
Copy link
Member

+1 to find a way to reverse the bytes at compilation time.

The reversal of bytes during FST compilation is so hard to think about! It happens because the FST is logically append-only, and sort of grows backwards (from the suffixes, inwards onto prefixes), and the newly written nodes always point backwards to the already written (appended to growing byte[], or, soon DataOutput).

But logically we ought to be able to write all the bytes backwards, then reverse them, but then when resolving absolute or relative node addresses at FST read time, we'd need to re-reverse those addresses. Or, we could try to rewrite the embedded node address references during/after reversal so we don't need to re-reverse on each node read? The pointers will necessarily be different (take different number of byte[] after reversal) since small node addresses would become big node addresses and take more bytes to encode absolute. It might even make the FST larger, since the common suffixes today will have smallish/earliesh node addresses. This is similar to what pack used to do (actually rewrite addresses), and it was hairy.

So maybe for starters we do the simple "reverse byte[] after writing them all" and then "re-reverse addresses on decode"? I wonder if Tantivy FST has some sort of post-write-reverse step? Or does it always do cache-unfriendly read backwards during FST traversal?

@dungba88
Copy link
Contributor

Looking at https://github.com/BurntSushi/fst/blob/master/src/raw/node.rs, it seems Tantivy also read bytes in backward. However this Node class only works with byte array. I think the array was memory-mapped from file before that.

@dungba88
Copy link
Contributor

reverse byte[] after writing them all

Interestingly we are specifically reverse the byte[] after the write to make it backward. To make it forward we can simply not do the reverse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants