Can FST read bytes forward? #12355

jpountz · 2023-06-07T11:17:30Z

Description

Forked from https://lists.apache.org/thread/1fskhmz84pp60o41txsxj2193vt9txod: the fact that FST reads bytes backwards doesn't play well with BufferedIndexInput, which triggers a refill on every byte read. Could we reverse these sequences of bytes at index time so that they would be read in forward order at search time?

mikemccand · 2023-06-09T10:43:52Z

I think FST.Builder's pack method (long removed -- it was complex and scary optimization) did exactly this? In addition to renumbering the nodes to make the FST smaller, it would reverse the byte order? Or am I mis-remembering :)

jpountz · 2023-06-09T11:59:34Z

You remember correctly. I refrained from mentioning FST#pack explicitly exactly because, like you said, it was scary. 😆

mikemccand · 2023-06-09T12:03:01Z

Maybe we could de-scare (de-fang?) it upon resurrection? E.g. don't try to renumber the nodes for compression ... keep the nodes existing numbers (or maybe, reversed order, so forward only properly still holds), while reversing the bytes ... it might make it just non-scary enough to be worthwhile. Not sure.

Reading backwards is also bad even for MMapDirectory ... OS's expect and optimize for forward reading (readahead). If the FSTs are fully hot it won't matter so much (though maybe even CPU caching does some kind of readahead opto?), but if the FSTs are cold (which really ought to be rare -- these are indices of indices afterall), then it could matter quite a bit.

dungba88 · 2023-11-11T04:08:31Z

I just stumbled this, I agreed that reading backward is not cache-friendly. Is there a reason why we write it in backward in the first place? We are specially reversing the byte order on addNode so that we would read later in backward.

mikemccand · 2023-11-13T12:35:30Z

+1 to find a way to reverse the bytes at compilation time.

The reversal of bytes during FST compilation is so hard to think about! It happens because the FST is logically append-only, and sort of grows backwards (from the suffixes, inwards onto prefixes), and the newly written nodes always point backwards to the already written (appended to growing byte[], or, soon DataOutput).

But logically we ought to be able to write all the bytes backwards, then reverse them, but then when resolving absolute or relative node addresses at FST read time, we'd need to re-reverse those addresses. Or, we could try to rewrite the embedded node address references during/after reversal so we don't need to re-reverse on each node read? The pointers will necessarily be different (take different number of byte[] after reversal) since small node addresses would become big node addresses and take more bytes to encode absolute. It might even make the FST larger, since the common suffixes today will have smallish/earliesh node addresses. This is similar to what pack used to do (actually rewrite addresses), and it was hairy.

So maybe for starters we do the simple "reverse byte[] after writing them all" and then "re-reverse addresses on decode"? I wonder if Tantivy FST has some sort of post-write-reverse step? Or does it always do cache-unfriendly read backwards during FST traversal?

dungba88 · 2023-11-13T13:37:04Z

Looking at https://github.com/BurntSushi/fst/blob/master/src/raw/node.rs, it seems Tantivy also read bytes in backward. However this Node class only works with byte array. I think the array was memory-mapped from file before that.

dungba88 · 2023-11-15T00:18:01Z

reverse byte[] after writing them all

Interestingly we are specifically reverse the byte[] after the write to make it backward. To make it forward we can simply not do the reverse.

jpountz mentioned this issue Jun 7, 2023

Can backward reading be less adversarial to BufferedIndexInput? #12356

Closed

mikemccand mentioned this issue Nov 14, 2023

Allow FST builder to use different writer (#12543) #12624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can FST read bytes forward? #12355

Can FST read bytes forward? #12355

jpountz commented Jun 7, 2023

mikemccand commented Jun 9, 2023

jpountz commented Jun 9, 2023

mikemccand commented Jun 9, 2023

dungba88 commented Nov 11, 2023

mikemccand commented Nov 13, 2023

dungba88 commented Nov 13, 2023

dungba88 commented Nov 15, 2023

Can FST read bytes forward? #12355

Can FST read bytes forward? #12355

Comments

jpountz commented Jun 7, 2023

Description

mikemccand commented Jun 9, 2023

jpountz commented Jun 9, 2023

mikemccand commented Jun 9, 2023

dungba88 commented Nov 11, 2023

mikemccand commented Nov 13, 2023

dungba88 commented Nov 13, 2023

dungba88 commented Nov 15, 2023