-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wordseg_best unable to transform english words correctly #9891
Comments
Hi, Short answer: WordSegmenterModel doesn't support multi-lingual word segmentation and it's always trained on a specific language. The WordSegmenterModel is for languages that require segmentation like the model you are using which only supports Thai. This annotator doesn't support multi-lingual segmentation since it is always trained over a specific language, you need a mix of WordSegmenterModel for |
sorry closed by mistake, that's being said we will look into why the content with Thai (even with a few English) is not performing well. |
hi @maziyarpanahi what if a single field contains a mix of english & thai words? like |
That's what @danilojsl will investigate to see if that's possible. For now, only the language of that model can be segmented via WordSegmenterModel |
Alright. Thanks @maziyarpanahi |
Hi @jslim89, as @maziyarpanahi pointed out, WordSegmenter is not multi-lingual. All these models assume the document contents will be of one language only. So, suppose a sentence has a mix of languages. In that case, it will segment/combine the characters based on the language the model was trained for (in this example Thai), the other characters will be considered as a single character since the model does not know how to segment/combine those. One way to change this behavior would be that WordSegmenter internally runs a regular Tokenizer annotator, let the tokens with characters different than Thai as they are, and only segment the Thai tokens. This behavior requires a change in the code, @maziyarpanahi let me know if we proceed. |
@danilojsl that's interesting, it will make the annotator more flexible for sure. However, will this be passing TonkeinzerModel somehow to WordSegmenter? (makes it complicated for saving and serialization) What we can do is to have RegexTokenizer inside and control it via some parameters:
This way we can easily save those parameters and it allows users to customize how to tokenize the words with whitespace between them. |
Description
I'm using wordseg_best, the example given is working as expected.
However, I try it with mix of english & th, the english word not segmented properly.
Expected Behavior
Current Behavior
Possible Solution
Steps to Reproduce
Run the unit test
Context
I'm doing a benchmark with pythainlp
Your Environment
4.0.0
3.2.0
openjdk version "11.0.15" 2022-04-19
Ubuntu 18.04.4 LTS
The text was updated successfully, but these errors were encountered: