Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format #94

Closed
wants to merge 72 commits into from
Closed

Format #94

wants to merge 72 commits into from

Conversation

fulmicoton
Copy link
Contributor

No description provided.

@fulmicoton fulmicoton closed this Jan 17, 2020
@BurntSushi
Copy link
Owner

@fulmicoton You seem to keep submitting PRs by mistake. Is there a way to prevent that from happening?

Also, this has caused me to take a look at your fork, and to be honest, this seems like a pretty poor situation to be in. Would it be possible for you to elaborate on the motivation for some of these changes? In particular, I notice that many of them are cosmetic, which is going to make it quite difficult to bring in any new changes to fst.

I also notice that you removed my name from the license copyright and also the authors list in the Cargo.toml. Could you explain why you did that?

@fulmicoton
Copy link
Contributor Author

You seem to keep submitting PRs by mistake. Is there a way to prevent that from happening?

I sincerely apologize for all the noise. I'll try to investigate if it is possible to change the default destination branch when creating a PR, but I think this is hardcoded in github.

Also, this has caused me to take a look at your fork, and to be honest, this seems like a pretty poor > situation to be in.

Would it be possible for you to elaborate on the motivation for some of these changes?

The original motivating changes were :

  • Changing the data from an enum to a generics over Data: Deref<Target=[u8]>.
    I need to manage my own Arc<MMap> objects.
  • Adding a way to stream the intersection with an automaton, with the state of the automaton. This is useful for the levenshtein automaton, if someone want to incorporate the levenshtein distance into the score

Post fork, a recent change makes it possible to stream the fst backward.

In particular, I notice that many of them are cosmetic, which is going to make it quite difficult to
bring in any new changes to fst.

I was under the impression that you were not welcoming the two motivating changes anyway, so
I ended up going through the cargo clippy warnings today, yes.

I also notice that you removed my name from the license copyright and also the authors list in the
Cargo.toml. Could you explain why you did that?

Your name is in a comment right next to the authors section.
I wanted to make sure you would not get mails about a bug in this repo.
I can put your name only or both our names if your prefer.
Let me know if you want more explicit marks in the LICENSE, README etc.

@BurntSushi
Copy link
Owner

BurntSushi commented Jan 17, 2020

Yeah I'd rather have my name added back to both places. I appreciate you being conscientious about bug reports. :-) If that becomes a serious issue that we can certainly re-evaluate. But I think it should be fine.

I think in general it would be nice to find a way to re-align here and share one crate. I'm not sure when I'll have time to think about that more deeply, but for when that time comes, to be clear, the important things blocking that for you are the generics over Deref<Target=[u8]> and streaming the intersection of an automaton along with its state? Anything else?

Post fork, a recent change makes it possible to stream the fst backward.

Oh yeah that would be great to get back into fst proper.

@fulmicoton
Copy link
Contributor Author

Yeah I'd rather have my name added back to both places

I put your name back in the Cargo.toml.
https://crates.io/crates/tantivy-fst

What is the second place? If it is the license your name is in it of course.
I can add your name in the README. (Right now it is just a disclaimer asking people to depend on your crate not this one.)

I think in general it would be nice to find a way to re-align here and share one crate.

That would be great!

the important things blocking that for you are the generics over Deref<Target=[u8]> and streaming > the intersection of an automaton along with its state?

To be entirely honest, AFAIK noone is relying on streaming the state with the automaton (I should doublecheck) and I am more and more convinced that it is better to use automaton that are matching exact distance. (LevDistance=1, LevDistance=2) as opposed to (LevDistance<=2)
The rationale is that increasing the lev distance should only happen if it can yield result that could make it to the top 10. In most case, this should be avoided.

streaming backward, and Deref<Target=[u8]> are required though.

Also, just to put that on your radar : there is a bit of a push for moving tantivy IO to something that is not married with MMap. The motivation here is having tantivy work on WASI, having a distant index (on S3 for instance), and more recently matrix wants an encrypted directory.

I think this is not for tomorrow, but tantivy dictionary may move to something that can work over an io::Read. I am not entirely sure of what would be the implication, but it is possible we might use fst differently. (as an index over an SSTable, or in a pyramid of blocks to be more cache friendly and be pageable)

@fulmicoton
Copy link
Contributor Author

(I asked on the github community forums if it is possible to set the default PR branch.
https://github.uint.cloudmunity/t5/How-to-use-Git-and-GitHub/Changing-the-default-PR-target-when-creating-a-PR-from-a-fork/td-p/43662 )

@BurntSushi
Copy link
Owner

If it is the license your name is in it of course.

Oh, I must have misread the diff. All is good then. Thanks!

Also, just to put that on your radar : there is a bit of a push for moving tantivy IO to something that is not married with MMap. The motivation here is having tantivy work on WASI, having a distant index (on S3 for instance), and more recently matrix wants an encrypted directory.

I think this is not for tomorrow, but tantivy dictionary may move to something that can work over an io::Read. I am not entirely sure of what would be the implication, but it is possible we might use fst differently. (as an index over an SSTable, or in a pyramid of blocks to be more cache friendly and be pageable)

Yeah, I think it'd be pretty challenging to use fsts without memory maps.

@BurntSushi
Copy link
Owner

@fulmicoton Is there any particular reason that you use Deref<[u8]> instead of AsRef<[u8]>? Generally Deref isn't meant to be parameterized over like this, and AsRef<[u8]> would be more appropriate. If I used AsRef<[u8]>, would that work for you?

I'm going to try exploring adding your two proposed features along with polishing up a bunch of stuff. I'd really like to get you back on fst proper, assuming you would like that. If there's anything else that's required, please let me know, as I'll probably be working on this over the weekend.

@BurntSushi
Copy link
Owner

I'm also planning on dropping the fst-regex and fst-levenshtein crates, and instead bring those back into fst proper as optional features. I know you already have your own Levenshtein implementation, so I don't plan to improve on the existing one too much with the hope that you'll upstream your implementation. As for fst-regex, my hope there is to transition to using the regex-automata crate instead of the ad hoc implementation that exists now. regex-automata has oodles of features, including performance tuning options, as well as DFA minimization and all that jazz.

BurntSushi added a commit that referenced this pull request Feb 24, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this pull request Feb 25, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
@fulmicoton
Copy link
Contributor Author

I know you already have your own Levenshtein implementation, so I don't plan to improve on the existing one too much with the hope that you'll upstream your implementation.

If by upstreaming the implementation you mean sending a PR with all of the levenshtein_automata code, I don't think this is a good idea.
These automatons can be useful without fsts and should live in a separate crate. For instance, I already used them on a trie and on an sstable.

I can add a levenshtein_automata as a dependency in the fst-levenshtein crate, and add the code to integrate it in the fst crate, is this what you had in mind?

As for fst-regex, my hope there is to transition to using the regex-automata crate instead of the ad hoc implementation that exists now. regex-automata has oodles of features, including performance tuning options, as well as DFA minimization and all that jazz.

That would be awesome.

@BurntSushi
Copy link
Owner

@fulmicoton Ah yeah, keeping it as a separate crate makes sense then! You can make support for fst::Automaton optional like I did for regex-automata.

@fulmicoton
Copy link
Contributor Author

@BurntSushi I believe this is already the case today.

Meilisearch for instance uses your fst crate and the levenshtein_automata crate.

@fulmicoton
Copy link
Contributor Author

Is there any particular reason that you use Deref<[u8]> instead of AsRef<[u8]>?

Apologies I forgot to answer this question.
My choice was only motivated by ignorance and laziness.

I am not sure what is the right one for this job. If you tell me AsRef makes more sense here, I believe you.

I think i picked Deref<Target=[u8]> because it implied a smaller diff.

@BurntSushi
Copy link
Owner

@fulmicoton Gotya, great. The next release of fst (whether 0.4 or 1.0) will have AsRef<[u8]>.

Another question: how big a deal is it to have the FST format break on you? That is, if fst 0.4 comes out and refuses to read older indexes, is that something you can deal with on your end? (i.e., Do your end users expect to have to re-index things occasionally?)

@fulmicoton
Copy link
Contributor Author

Another question: how big a deal is it to have the FST format break on you? That is, if fst 0.4 comes out and refuses to read older indexes, is that something you can deal with on your end? (i.e., Do your end users expect to have to re-index things occasionally?)

Users are expected to keep the mean to reindex everything.
Tantivy still breaks the index format about once every two non hot-fix release nowadays.
I try to reduce it, but we switching back to the fst crate is reason enough to justify breaking the index format.

In the future, if you update the fst crate often and break the index format, I will simply delay the updates version to every year or every 6 months or so.

@BurntSushi
Copy link
Owner

@fulmicoton Aye. I do not anticipate breaking the index format often. If I do a 1.0 release, then I'll commit to index format stability for 1.x.y.

(My plan is to add optional CRC32 checking on the FST. Building an FST will always write a checksum, but opening an FST will not check it by default. It will be opt-in.)

@fulmicoton
Copy link
Contributor Author

Understood.

FYI, the CRC32 check is a bit of anti feature for tantivy.
Tantivy already has a footer for each files it writes, that contains a CRC32 check and information about tantivy version.

@BurntSushi
Copy link
Owner

Yeah. If computing the CRC32 leads to measurable overhead during construction, then I'll make it optional to compute as well. Otherwise it should be practically free. But other people have experienced index corruption probably due to hardware failurea, so this is an important diagnostic tool.

@fulmicoton
Copy link
Contributor Author

@BurntSushi I would be surprised if you see any overhead. No need to make it optional. I just wanted to let you know this is not something that is useful for tantivy...

BurntSushi added a commit that referenced this pull request Mar 6, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this pull request Mar 6, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this pull request Mar 7, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this pull request Mar 7, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this pull request Mar 7, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
BurntSushi added a commit that referenced this pull request Mar 8, 2020
This change was not nearly as bad as I thought it would be. Instead of
trying to provide all the different possible ways to store some bytes,
we instead make FSTs maximally flexible by accepting any type that can
cheaply provide byte slice. This should resolve a number of issues with
folks constructing FSTs in ways that weren't supported by the old
constructors.

As a bonus, we no longer need to directly depend on a specific
implementation of memory maps. Conveniently, the `Mmap` type in the
`memmap` crate already implements `AsRef<[u8]>`, so using memory maps is
as simple as

    let mmap = memmap::Mmap::map(&File::open("something.fst").unwrap());
    let fst = Fst::new(mmap).unwrap();

Fixes #92, Fixes #94, Fixes #97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants