Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: support to customize tokenizer #2992

Merged
merged 25 commits into from
Oct 21, 2024
Merged

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Oct 10, 2024

users can customize the tokenizer:

This introduces a breaking change: we used en_stem as default tokenizer before, which stems the words, but this PR switches the default tokenizer to be without stemming

BubbleCal and others added 17 commits September 24, 2024 18:01
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Oct 10, 2024
@BubbleCal BubbleCal changed the title feat: support to customize tokenizer feat!: support to customize tokenizer Oct 10, 2024
@BubbleCal BubbleCal force-pushed the tokenizer branch 2 times, most recently from 0aafd0f to 6ec08a7 Compare October 10, 2024 08:53
@wjones127
Copy link
Contributor

Could we also support the AsciiFoldingFilter as an option? This will allow normalizing text with accents.

@SaintBacchus
Copy link
Collaborator

nice work. will the pr have some chinese tokenizers?

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal
Copy link
Contributor Author

Could we also support the AsciiFoldingFilter as an option? This will allow normalizing text with accents.

Yeah, added

@BubbleCal
Copy link
Contributor Author

nice work. will the pr have some chinese tokenizers?

No, the tokenizer is based on tantivy's, unfortunately it doesn't support Chinese/Japanese

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 60.48780% with 81 lines in your changes missing coverage. Please review.

Project coverage is 78.12%. Comparing base (f9024ce) to head (95d929a).
Report is 71 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/tokenizer.rs 42.55% 49 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 37.50% 12 Missing and 3 partials ⚠️
rust/lance-index/src/scalar.rs 11.11% 8 Missing ⚠️
rust/lance-index/src/scalar/inverted/builder.rs 94.87% 1 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2992      +/-   ##
==========================================
- Coverage   78.19%   78.12%   -0.08%     
==========================================
  Files         239      240       +1     
  Lines       76782    76943     +161     
  Branches    76782    76943     +161     
==========================================
+ Hits        60043    60112      +69     
- Misses      13669    13740      +71     
- Partials     3070     3091      +21     
Flag Coverage Δ
unittests 78.12% <60.48%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@BubbleCal BubbleCal marked this pull request as ready for review October 18, 2024 06:50
@BubbleCal BubbleCal requested a review from westonpace October 18, 2024 09:02
Comment on lines +1397 to +1406
stem: bool, default False
This is for the ``INVERTED`` index. If True, the index will stem the
tokens.
remove_stop_words: bool, default False
This is for the ``INVERTED`` index. If True, the index will remove
stop words.
ascii_folding: bool, default False
This is for the ``INVERTED`` index. If True, the index will convert
non-ascii characters to ascii characters if possible.
This would remove accents like "é" -> "e".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence whether these should be true by default. I feel like for more english text, quality would be better if these are all true. But for use cases like searching over code, I think all of these should likely be false. False seems like an okay default then, but we should definitely highlight in the user guide that setting these all to true would make sense for prose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, usually stemming can improve recall.
Setting them to false by default cause tantivy does so

Comment on lines +8 to +10
/// Tokenizer configs
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TokenizerConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we store this anywhere? When I run optimize_indices(), will it use the same configuration as when I run create_scalar_index() with tokenizer configuration?

Copy link
Contributor Author

@BubbleCal BubbleCal Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we store it in the index, and optimize_indices would load the existing index, then it would use this configuration, see here

@wjones127
Copy link
Contributor

BTW it would be cool to have a standalone Tokenizer object users could play with. Right now the tokenizer is a black box. But users might want to run search queries through it to debug. I could imagine in LanceDB we might have a nice UI to help users do this. It would be nice if you could do something like:

tokenizer = LanceTokenizer(
    lang="en",
    lowercase=True,
    ascii_fold=True
)
test_results = tokenizer.tokenize("Hello World!")
assert test_results == [
    { "text": "hello", "pos": 0 },
    { "text": "world", "pos": 1 },
]

dataset.create_scalar_index("INVERTED", tokenizer=tokenizer)

@BubbleCal
Copy link
Contributor Author

BTW it would be cool to have a standalone Tokenizer object users could play with. Right now the tokenizer is a black box. But users might want to run search queries through it to debug. I could imagine in LanceDB we might have a nice UI to help users do this. It would be nice if you could do something like:

tokenizer = LanceTokenizer(
    lang="en",
    lowercase=True,
    ascii_fold=True
)
test_results = tokenizer.tokenize("Hello World!")
assert test_results == [
    { "text": "hello", "pos": 0 },
    { "text": "world", "pos": 1 },
]

dataset.create_scalar_index("INVERTED", tokenizer=tokenizer)

cool idea! it would be very useful for debugging, will add this in next PR

@BubbleCal BubbleCal requested a review from wjones127 October 21, 2024 05:11
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Note, because there is a breaking change, we need to increment the minor version. You can either do that in this PR, or wait for Weston's PR (also a breaking change) to merge and rerun the CI job.

@westonpace
Copy link
Contributor

(the mentioned PR is merged so should just need a rebase)

@wjones127 wjones127 merged commit c152d36 into lancedb:main Oct 21, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants