feat!: support to customize tokenizer #2992

BubbleCal · 2024-10-10T07:52:33Z

users can customize the tokenizer:

language
remove long words
lower case
stem
remove stop words
ascii folding
solve feat: support accent insensitive full text search #2996

This introduces a breaking change: we used en_stem as default tokenizer before, which stems the words, but this PR switches the default tokenizer to be without stemming

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Maddening error

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2024-10-10T14:08:41Z

Could we also support the AsciiFoldingFilter as an option? This will allow normalizing text with accents.

SaintBacchus · 2024-10-12T07:36:50Z

nice work. will the pr have some chinese tokenizers?

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2024-10-17T01:45:15Z

Could we also support the AsciiFoldingFilter as an option? This will allow normalizing text with accents.

Yeah, added

BubbleCal · 2024-10-17T01:46:10Z

nice work. will the pr have some chinese tokenizers?

No, the tokenizer is based on tantivy's, unfortunately it doesn't support Chinese/Japanese

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2024-10-18T04:59:26Z

Codecov Report

Attention: Patch coverage is 60.48780% with 81 lines in your changes missing coverage. Please review.

Project coverage is 78.12%. Comparing base (f9024ce) to head (95d929a).
Report is 71 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/tokenizer.rs	42.55%	49 Missing and 5 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	37.50%	12 Missing and 3 partials ⚠️
rust/lance-index/src/scalar.rs	11.11%	8 Missing ⚠️
rust/lance-index/src/scalar/inverted/builder.rs	94.87%	1 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2992      +/-   ##
==========================================
- Coverage   78.19%   78.12%   -0.08%     
==========================================
  Files         239      240       +1     
  Lines       76782    76943     +161     
  Branches    76782    76943     +161     
==========================================
+ Hits        60043    60112      +69     
- Misses      13669    13740      +71     
- Partials     3070     3091      +21

Flag	Coverage Δ
unittests	`78.12% <60.48%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 · 2024-10-18T16:26:03Z

python/python/lance/dataset.py

+        stem: bool, default False
+            This is for the ``INVERTED`` index. If True, the index will stem the
+            tokens.
+        remove_stop_words: bool, default False
+            This is for the ``INVERTED`` index. If True, the index will remove
+            stop words.
+        ascii_folding: bool, default False
+            This is for the ``INVERTED`` index. If True, the index will convert
+            non-ascii characters to ascii characters if possible.
+            This would remove accents like "é" -> "e".


I'm on the fence whether these should be true by default. I feel like for more english text, quality would be better if these are all true. But for use cases like searching over code, I think all of these should likely be false. False seems like an okay default then, but we should definitely highlight in the user guide that setting these all to true would make sense for prose.

right, usually stemming can improve recall.
Setting them to false by default cause tantivy does so

wjones127 · 2024-10-18T16:27:12Z

rust/lance-index/src/scalar/inverted/tokenizer.rs

+/// Tokenizer configs
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct TokenizerConfig {


Do we store this anywhere? When I run optimize_indices(), will it use the same configuration as when I run create_scalar_index() with tokenizer configuration?

yes we store it in the index, and optimize_indices would load the existing index, then it would use this configuration, see here

wjones127 · 2024-10-18T16:31:37Z

BTW it would be cool to have a standalone Tokenizer object users could play with. Right now the tokenizer is a black box. But users might want to run search queries through it to debug. I could imagine in LanceDB we might have a nice UI to help users do this. It would be nice if you could do something like:

tokenizer = LanceTokenizer(
    lang="en",
    lowercase=True,
    ascii_fold=True
)
test_results = tokenizer.tokenize("Hello World!")
assert test_results == [
    { "text": "hello", "pos": 0 },
    { "text": "world", "pos": 1 },
]

dataset.create_scalar_index("INVERTED", tokenizer=tokenizer)

BubbleCal · 2024-10-20T09:47:24Z

BTW it would be cool to have a standalone Tokenizer object users could play with. Right now the tokenizer is a black box. But users might want to run search queries through it to debug. I could imagine in LanceDB we might have a nice UI to help users do this. It would be nice if you could do something like:
tokenizer = LanceTokenizer(
    lang="en",
    lowercase=True,
    ascii_fold=True
)
test_results = tokenizer.tokenize("Hello World!")
assert test_results == [
    { "text": "hello", "pos": 0 },
    { "text": "world", "pos": 1 },
]

dataset.create_scalar_index("INVERTED", tokenizer=tokenizer)

cool idea! it would be very useful for debugging, will add this in next PR

wjones127

Looks good.

Note, because there is a breaking change, we need to increment the minor version. You can either do that in this PR, or wait for Weston's PR (also a breaking change) to merge and rerun the CI job.

westonpace · 2024-10-21T18:12:08Z

(the mentioned PR is merged so should just need a rebase)

BubbleCal and others added 17 commits September 24, 2024 18:01

perf: reduce the required memory for indexing FTS

157f9c3

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Maddening error

8593ff1

Merge pull request #1 from westonpace/fts-mem

ad0f080

Maddening error

fix

057f29d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix buffer

3f678f0

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

improve

4dd67c4

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

7731af9

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into fts-mem

01bbade

improve

f214ce6

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

93fa94f

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix order

49179da

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

merge stream

687da54

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

optimize scheduling

1217a99

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remove trie

5a6903d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

b3b076e

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

05bab1b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into fts-mem

fe5fcf3

github-actions bot added the enhancement New feature or request label Oct 10, 2024

BubbleCal force-pushed the tokenizer branch from be7d564 to a9c421f Compare October 10, 2024 08:47

github-actions bot added the python label Oct 10, 2024

BubbleCal changed the title ~~feat: support to customize tokenizer~~ feat!: support to customize tokenizer Oct 10, 2024

github-actions bot added the breaking-change label Oct 10, 2024

BubbleCal force-pushed the tokenizer branch 2 times, most recently from 0aafd0f to 6ec08a7 Compare October 10, 2024 08:53

wjones127 self-requested a review October 10, 2024 14:08

BubbleCal mentioned this pull request Oct 11, 2024

Full text search (FTS) indices #1195

Open

26 tasks

BubbleCal force-pushed the tokenizer branch from 6ec08a7 to 41aec63 Compare October 11, 2024 09:21

BubbleCal force-pushed the tokenizer branch from 41aec63 to ed66300 Compare October 14, 2024 07:41

BubbleCal force-pushed the tokenizer branch from ed66300 to 85e672f Compare October 14, 2024 07:50

feat: support to customize tokenizer

56c70a7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal force-pushed the tokenizer branch from 85e672f to 56c70a7 Compare October 14, 2024 07:50

low FLUSH_SIZE

8de6d50

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

add doc for ascii_folding

0101fea

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal force-pushed the tokenizer branch from 4cc1b89 to 0101fea Compare October 17, 2024 15:01

BubbleCal added 3 commits October 18, 2024 01:14

Merge branch 'main' of https://github.com/lancedb/lance into tokenizer

724fc04

fix

be0c0ee

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

feeec4e

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal added 2 commits October 18, 2024 13:00

Merge branch 'main' of https://github.com/lancedb/lance into tokenizer

0dbc337

fix version

95d929a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal marked this pull request as ready for review October 18, 2024 06:50

BubbleCal requested a review from westonpace October 18, 2024 09:02

wjones127 reviewed Oct 18, 2024

View reviewed changes

BubbleCal requested a review from wjones127 October 21, 2024 05:11

wjones127 approved these changes Oct 21, 2024

View reviewed changes

wjones127 merged commit c152d36 into lancedb:main Oct 21, 2024
21 of 22 checks passed

wjones127 mentioned this pull request Oct 22, 2024

feat!: upgrade lance to 0.19.1 lancedb/lancedb#1762

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: support to customize tokenizer #2992

feat!: support to customize tokenizer #2992

BubbleCal commented Oct 10, 2024 •

edited

Loading

wjones127 commented Oct 10, 2024

SaintBacchus commented Oct 12, 2024

BubbleCal commented Oct 17, 2024

BubbleCal commented Oct 17, 2024

codecov-commenter commented Oct 18, 2024 •

edited

Loading

wjones127 Oct 18, 2024

BubbleCal Oct 20, 2024

wjones127 Oct 18, 2024

BubbleCal Oct 20, 2024 •

edited

Loading

wjones127 commented Oct 18, 2024

BubbleCal commented Oct 20, 2024

wjones127 left a comment

westonpace commented Oct 21, 2024

feat!: support to customize tokenizer #2992

feat!: support to customize tokenizer #2992

Conversation

BubbleCal commented Oct 10, 2024 • edited Loading

wjones127 commented Oct 10, 2024

SaintBacchus commented Oct 12, 2024

BubbleCal commented Oct 17, 2024

BubbleCal commented Oct 17, 2024

codecov-commenter commented Oct 18, 2024 • edited Loading

Codecov Report

wjones127 Oct 18, 2024

Choose a reason for hiding this comment

BubbleCal Oct 20, 2024

Choose a reason for hiding this comment

wjones127 Oct 18, 2024

Choose a reason for hiding this comment

BubbleCal Oct 20, 2024 • edited Loading

Choose a reason for hiding this comment

wjones127 commented Oct 18, 2024

BubbleCal commented Oct 20, 2024

wjones127 left a comment

Choose a reason for hiding this comment

westonpace commented Oct 21, 2024

BubbleCal commented Oct 10, 2024 •

edited

Loading

codecov-commenter commented Oct 18, 2024 •

edited

Loading

BubbleCal Oct 20, 2024 •

edited

Loading