Format #94

fulmicoton · 2020-01-17T09:02:33Z

No description provided.

Expose StreamState

BurntSushi · 2020-01-17T11:32:25Z

@fulmicoton You seem to keep submitting PRs by mistake. Is there a way to prevent that from happening?

Also, this has caused me to take a look at your fork, and to be honest, this seems like a pretty poor situation to be in. Would it be possible for you to elaborate on the motivation for some of these changes? In particular, I notice that many of them are cosmetic, which is going to make it quite difficult to bring in any new changes to fst.

I also notice that you removed my name from the license copyright and also the authors list in the Cargo.toml. Could you explain why you did that?

fulmicoton · 2020-01-17T12:43:26Z

You seem to keep submitting PRs by mistake. Is there a way to prevent that from happening?

I sincerely apologize for all the noise. I'll try to investigate if it is possible to change the default destination branch when creating a PR, but I think this is hardcoded in github.

Also, this has caused me to take a look at your fork, and to be honest, this seems like a pretty poor > situation to be in.

Would it be possible for you to elaborate on the motivation for some of these changes?

The original motivating changes were :

Changing the data from an enum to a generics over Data: Deref<Target=[u8]>.
I need to manage my own Arc<MMap> objects.
Adding a way to stream the intersection with an automaton, with the state of the automaton. This is useful for the levenshtein automaton, if someone want to incorporate the levenshtein distance into the score

Post fork, a recent change makes it possible to stream the fst backward.

In particular, I notice that many of them are cosmetic, which is going to make it quite difficult to
bring in any new changes to fst.

I was under the impression that you were not welcoming the two motivating changes anyway, so
I ended up going through the cargo clippy warnings today, yes.

I also notice that you removed my name from the license copyright and also the authors list in the
Cargo.toml. Could you explain why you did that?

Your name is in a comment right next to the authors section.
I wanted to make sure you would not get mails about a bug in this repo.
I can put your name only or both our names if your prefer.
Let me know if you want more explicit marks in the LICENSE, README etc.

BurntSushi · 2020-01-17T12:50:57Z

Yeah I'd rather have my name added back to both places. I appreciate you being conscientious about bug reports. :-) If that becomes a serious issue that we can certainly re-evaluate. But I think it should be fine.

I think in general it would be nice to find a way to re-align here and share one crate. I'm not sure when I'll have time to think about that more deeply, but for when that time comes, to be clear, the important things blocking that for you are the generics over Deref<Target=[u8]> and streaming the intersection of an automaton along with its state? Anything else?

Post fork, a recent change makes it possible to stream the fst backward.

Oh yeah that would be great to get back into fst proper.

fulmicoton · 2020-01-17T13:09:30Z

Yeah I'd rather have my name added back to both places

I put your name back in the Cargo.toml.
https://crates.io/crates/tantivy-fst

What is the second place? If it is the license your name is in it of course.
I can add your name in the README. (Right now it is just a disclaimer asking people to depend on your crate not this one.)

I think in general it would be nice to find a way to re-align here and share one crate.

That would be great!

the important things blocking that for you are the generics over Deref<Target=[u8]> and streaming > the intersection of an automaton along with its state?

To be entirely honest, AFAIK noone is relying on streaming the state with the automaton (I should doublecheck) and I am more and more convinced that it is better to use automaton that are matching exact distance. (LevDistance=1, LevDistance=2) as opposed to (LevDistance<=2)
The rationale is that increasing the lev distance should only happen if it can yield result that could make it to the top 10. In most case, this should be avoided.

streaming backward, and Deref<Target=[u8]> are required though.

Also, just to put that on your radar : there is a bit of a push for moving tantivy IO to something that is not married with MMap. The motivation here is having tantivy work on WASI, having a distant index (on S3 for instance), and more recently matrix wants an encrypted directory.

I think this is not for tomorrow, but tantivy dictionary may move to something that can work over an io::Read. I am not entirely sure of what would be the implication, but it is possible we might use fst differently. (as an index over an SSTable, or in a pyramid of blocks to be more cache friendly and be pageable)

fulmicoton · 2020-01-17T13:22:56Z

(I asked on the github community forums if it is possible to set the default PR branch.
https://github.uint.cloudmunity/t5/How-to-use-Git-and-GitHub/Changing-the-default-PR-target-when-creating-a-PR-from-a-fork/td-p/43662 )

BurntSushi · 2020-01-17T13:26:50Z

If it is the license your name is in it of course.

Oh, I must have misread the diff. All is good then. Thanks!

Also, just to put that on your radar : there is a bit of a push for moving tantivy IO to something that is not married with MMap. The motivation here is having tantivy work on WASI, having a distant index (on S3 for instance), and more recently matrix wants an encrypted directory.

I think this is not for tomorrow, but tantivy dictionary may move to something that can work over an io::Read. I am not entirely sure of what would be the implication, but it is possible we might use fst differently. (as an index over an SSTable, or in a pyramid of blocks to be more cache friendly and be pageable)

Yeah, I think it'd be pretty challenging to use fsts without memory maps.

BurntSushi · 2020-02-21T21:11:10Z

@fulmicoton Is there any particular reason that you use Deref<[u8]> instead of AsRef<[u8]>? Generally Deref isn't meant to be parameterized over like this, and AsRef<[u8]> would be more appropriate. If I used AsRef<[u8]>, would that work for you?

I'm going to try exploring adding your two proposed features along with polishing up a bunch of stuff. I'd really like to get you back on fst proper, assuming you would like that. If there's anything else that's required, please let me know, as I'll probably be working on this over the weekend.

BurntSushi · 2020-02-21T21:13:27Z

I'm also planning on dropping the fst-regex and fst-levenshtein crates, and instead bring those back into fst proper as optional features. I know you already have your own Levenshtein implementation, so I don't plan to improve on the existing one too much with the hope that you'll upstream your implementation. As for fst-regex, my hope there is to transition to using the regex-automata crate instead of the ad hoc implementation that exists now. regex-automata has oodles of features, including performance tuning options, as well as DFA minimization and all that jazz.

This change was not nearly as bad as I thought it would be. Instead of trying to provide all the different possible ways to store some bytes, we instead make FSTs maximally flexible by accepting any type that can cheaply provide byte slice. This should resolve a number of issues with folks constructing FSTs in ways that weren't supported by the old constructors. As a bonus, we no longer need to directly depend on a specific implementation of memory maps. Conveniently, the `Mmap` type in the `memmap` crate already implements `AsRef<[u8]>`, so using memory maps is as simple as let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap()); let fst = Fst::new(mmap).unwrap(); Fixes #92, Fixes #94, Fixes #97

fulmicoton · 2020-02-26T07:03:03Z

I know you already have your own Levenshtein implementation, so I don't plan to improve on the existing one too much with the hope that you'll upstream your implementation.

If by upstreaming the implementation you mean sending a PR with all of the levenshtein_automata code, I don't think this is a good idea.
These automatons can be useful without fsts and should live in a separate crate. For instance, I already used them on a trie and on an sstable.

I can add a levenshtein_automata as a dependency in the fst-levenshtein crate, and add the code to integrate it in the fst crate, is this what you had in mind?

As for fst-regex, my hope there is to transition to using the regex-automata crate instead of the ad hoc implementation that exists now. regex-automata has oodles of features, including performance tuning options, as well as DFA minimization and all that jazz.

That would be awesome.

BurntSushi · 2020-02-26T12:02:26Z

@fulmicoton Ah yeah, keeping it as a separate crate makes sense then! You can make support for fst::Automaton optional like I did for regex-automata.

fulmicoton · 2020-02-26T13:07:33Z

@BurntSushi I believe this is already the case today.

Meilisearch for instance uses your fst crate and the levenshtein_automata crate.

fulmicoton · 2020-02-26T13:09:58Z

Is there any particular reason that you use Deref<[u8]> instead of AsRef<[u8]>?

Apologies I forgot to answer this question.
My choice was only motivated by ignorance and laziness.

I am not sure what is the right one for this job. If you tell me AsRef makes more sense here, I believe you.

I think i picked Deref<Target=[u8]> because it implied a smaller diff.

BurntSushi · 2020-02-26T13:15:57Z

@fulmicoton Gotya, great. The next release of fst (whether 0.4 or 1.0) will have AsRef<[u8]>.

Another question: how big a deal is it to have the FST format break on you? That is, if fst 0.4 comes out and refuses to read older indexes, is that something you can deal with on your end? (i.e., Do your end users expect to have to re-index things occasionally?)

fulmicoton · 2020-02-26T15:16:05Z

Another question: how big a deal is it to have the FST format break on you? That is, if fst 0.4 comes out and refuses to read older indexes, is that something you can deal with on your end? (i.e., Do your end users expect to have to re-index things occasionally?)

Users are expected to keep the mean to reindex everything.
Tantivy still breaks the index format about once every two non hot-fix release nowadays.
I try to reduce it, but we switching back to the fst crate is reason enough to justify breaking the index format.

In the future, if you update the fst crate often and break the index format, I will simply delay the updates version to every year or every 6 months or so.

BurntSushi · 2020-02-26T15:19:27Z

@fulmicoton Aye. I do not anticipate breaking the index format often. If I do a 1.0 release, then I'll commit to index format stability for 1.x.y.

(My plan is to add optional CRC32 checking on the FST. Building an FST will always write a checksum, but opening an FST will not check it by default. It will be opt-in.)

fulmicoton · 2020-02-26T15:40:57Z

Understood.

FYI, the CRC32 check is a bit of anti feature for tantivy.
Tantivy already has a footer for each files it writes, that contains a CRC32 check and information about tantivy version.

BurntSushi · 2020-02-26T16:02:24Z

Yeah. If computing the CRC32 leads to measurable overhead during construction, then I'll make it optional to compute as well. Otherwise it should be practically free. But other people have experienced index corruption probably due to hardware failurea, so this is an important diagnostic tool.

fulmicoton · 2020-02-27T01:06:20Z

@BurntSushi I would be surprised if you see any overhead. No need to make it optional. I just wanted to let you know this is not something that is useful for tantivy...

This change was not nearly as bad as I thought it would be. Instead of trying to provide all the different possible ways to store some bytes, we instead make FSTs maximally flexible by accepting any type that can cheaply provide byte slice. This should resolve a number of issues with folks constructing FSTs in ways that weren't supported by the old constructors. As a bonus, we no longer need to directly depend on a specific implementation of memory maps. Conveniently, the `Mmap` type in the `memmap` crate already implements `AsRef<[u8]>`, so using memory maps is as simple as let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap()); let fst = Fst::new(mmap).unwrap(); Fixes #92, Fixes #94, Fixes #97

fulmicoton and others added 30 commits February 20, 2019 09:16

Modified license.

21d2d70

Removed mmap and set

ade4483

Removed FstData from Streams.

b4650d1

Using Into

3a0925a

Using Vec<u8> by default

4b2f4c9

Removing FstData

49a4a9f

Not copying the fst data

a445080

simpler way to deal with the borrow checker

121f4bb

Moving fst-regex within the same crate

c5f7147

Editing version and cleanup

cc2d05b

Version 0.1

8e40cb5

Merge branch 'master' of https://github.com/BurntSushi/fst

c5c2854

Added the PR BurntSushi#61 to tantivy-fst

e10e671

Adapted travis ci

abe48ea

Fixing doctest

eec2957

Added travis ci badge in README

943d01c

Address review commits

5f50407

Merge pull request #1 from hntd187/master

b4da669

Expose StreamState

Added vscode to gitignore

24c2f88

Refactored to support min and max as bound

36953d1

Refactored to support min and max as bound

2205238

added basic needed functions

5f39bf3

Refactored with starting and next transition methods

54d501e

Added reverse to stream interface

f2e2d98

added tests for next transtion and starting transition

14ba77a

Refactored stack logic

1ff80bb

made basic reverse traversal work

df92a73

made more return stack work

10cc7ef

refactored done

6286c77

fixed bug with seek done

ada4235

Edition 2018

617b0b8

fulmicoton closed this Jan 17, 2020

BurntSushi mentioned this pull request Jun 7, 2021

Make fst work without memory mapping with an arbitrary "fake array" quickwit-inc/fst#15

Closed

phiresky mentioned this pull request Jun 10, 2021

Make tantivy work in the browser with a statically hosted database and on-demand fetching quickwit-oss/tantivy#1067

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format #94

Format #94

fulmicoton commented Jan 17, 2020

BurntSushi commented Jan 17, 2020

fulmicoton commented Jan 17, 2020

BurntSushi commented Jan 17, 2020 •

edited

Loading

fulmicoton commented Jan 17, 2020

fulmicoton commented Jan 17, 2020

BurntSushi commented Jan 17, 2020

BurntSushi commented Feb 21, 2020

BurntSushi commented Feb 21, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 27, 2020

Format #94

Format #94

Conversation

fulmicoton commented Jan 17, 2020

BurntSushi commented Jan 17, 2020

fulmicoton commented Jan 17, 2020

BurntSushi commented Jan 17, 2020 • edited Loading

fulmicoton commented Jan 17, 2020

fulmicoton commented Jan 17, 2020

BurntSushi commented Jan 17, 2020

BurntSushi commented Feb 21, 2020

BurntSushi commented Feb 21, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 26, 2020

BurntSushi commented Feb 26, 2020

fulmicoton commented Feb 27, 2020

BurntSushi commented Jan 17, 2020 •

edited

Loading