-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: support to customize tokenizer #2992
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Maddening error
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
0aafd0f
to
6ec08a7
Compare
Could we also support the AsciiFoldingFilter as an option? This will allow normalizing text with accents. |
nice work. will the pr have some chinese tokenizers? |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Yeah, added |
No, the tokenizer is based on tantivy's, unfortunately it doesn't support Chinese/Japanese |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2992 +/- ##
==========================================
- Coverage 78.19% 78.12% -0.08%
==========================================
Files 239 240 +1
Lines 76782 76943 +161
Branches 76782 76943 +161
==========================================
+ Hits 60043 60112 +69
- Misses 13669 13740 +71
- Partials 3070 3091 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
stem: bool, default False | ||
This is for the ``INVERTED`` index. If True, the index will stem the | ||
tokens. | ||
remove_stop_words: bool, default False | ||
This is for the ``INVERTED`` index. If True, the index will remove | ||
stop words. | ||
ascii_folding: bool, default False | ||
This is for the ``INVERTED`` index. If True, the index will convert | ||
non-ascii characters to ascii characters if possible. | ||
This would remove accents like "é" -> "e". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence whether these should be true by default. I feel like for more english text, quality would be better if these are all true. But for use cases like searching over code, I think all of these should likely be false. False seems like an okay default then, but we should definitely highlight in the user guide that setting these all to true would make sense for prose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, usually stemming can improve recall.
Setting them to false by default cause tantivy does so
/// Tokenizer configs | ||
#[derive(Debug, Clone, Serialize, Deserialize)] | ||
pub struct TokenizerConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we store this anywhere? When I run optimize_indices()
, will it use the same configuration as when I run create_scalar_index()
with tokenizer configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we store it in the index, and optimize_indices
would load the existing index, then it would use this configuration, see here
BTW it would be cool to have a standalone tokenizer = LanceTokenizer(
lang="en",
lowercase=True,
ascii_fold=True
)
test_results = tokenizer.tokenize("Hello World!")
assert test_results == [
{ "text": "hello", "pos": 0 },
{ "text": "world", "pos": 1 },
]
dataset.create_scalar_index("INVERTED", tokenizer=tokenizer) |
cool idea! it would be very useful for debugging, will add this in next PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Note, because there is a breaking change, we need to increment the minor version. You can either do that in this PR, or wait for Weston's PR (also a breaking change) to merge and rerun the CI job.
(the mentioned PR is merged so should just need a rebase) |
users can customize the tokenizer:
solve feat: support accent insensitive full text search #2996
This introduces a breaking change: we used
en_stem
as default tokenizer before, which stems the words, but this PR switches the default tokenizer to be without stemming