wordseg_best unable to transform english words correctly #9891

jslim89 · 2022-07-01T11:56:32Z

Description

I'm using wordseg_best, the example given is working as expected.

However, I try it with mix of english & th, the english word not segmented properly.

Expected Behavior

+---------------------------------------------------------------------------------------------------------------------------------------+
|term_text                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
|[oem, loomma, สำหรับ, ฐาน, ลำโพง, apple, homepod, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, speaker, stands, null]|
|[v3i, 100, original, motorola, razr, v3i, quad, band, flip, gsm, bluetooth, mp3, unlocked, mobile, phone, console, gaming, controllers]|
+---------------------------------------------------------------------------------------------------------------------------------------+

Current Behavior

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|term_masterbrain                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[o, e, m, l, o, o, m, m, a, สำหรับฐาน, ล, ำ, โพง, a, p, p, l, e, h, o, m, e, p, o, d, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, s, p, e, a, k, e, r, s, t, a, n, d, snull]                                                                          |
|[v, 3, i1, 0, 0, o, r, i, g, i, n, a, l, m, o, t, o, r, o, l, a, r, a, z, r, v, 3, i, q, u, a, d, b, a, n, d, f, l, i, p, g, s, m, b, l, u, e, t, o, o, t, h, m, p3unlockedmobile, p, h, o, n, e, c, o, n, s, o, l, e, g, a, m, i, n, g, c, o, n, t, r, o, l, l, e, r, s]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Possible Solution

Steps to Reproduce

Run the unit test

from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Pipeline, Finisher

class TestThaiNlp(PySparkTestCase):

    def setUp(self):
        self.spark = SparkSession.builder \
            .master('local') \
            .appName('vision') \
            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0") \
            .getOrCreate()
        self.df = self.spark.createDataFrame(
            [
                ('oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null'),
                ('v3i 100 original motorola razr v3i quad band flip gsm bluetooth mp3 unlocked mobile phone console gaming controllers'),
            ],
            [
                'text',
            ]
        )

    def test_sparknlp(self):
        field = 'text'
        document_assembler = DocumentAssembler() \
            .setInputCol(field) \
            .setOutputCol(f'{field}_document')
        word_seg = WordSegmenterModel.pretrained('wordseg_best', 'th') \
            .setInputCols(f'{field}_document') \
            .setOutputCol(f'{field}_token')
        finisher = Finisher() \
            .setInputCols([f'{field}_token']) \
            .setIncludeMetadata(True)
        pipeline = Pipeline(stages=[document_assembler, word_seg, finisher])
        result = pipeline.fit(self.df).transform(self.df).withColumnRenamed(f'finished_{field}_token', f'term_{field}')
        result.select(f'term_{field}').show(2, False)

    def tearDown(self):
        self.spark.stop()

Context

I'm doing a benchmark with pythainlp

Your Environment

Spark NLP version: 4.0.0
Apache NLP version: 3.2.0
Java version: openjdk version "11.0.15" 2022-04-19
Operating System and version: Ubuntu 18.04.4 LTS

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2022-07-01T12:01:53Z

Hi,

Short answer: WordSegmenterModel doesn't support multi-lingual word segmentation and it's always trained on a specific language.

The WordSegmenterModel is for languages that require segmentation like the model you are using which only supports Thai.

This annotator doesn't support multi-lingual segmentation since it is always trained over a specific language, you need a mix of WordSegmenterModel for Thai and Tokenizer for English. I would suggest using a LanguageDetectorDL to detect the language of each row/document, and then via the value of that column using one of those two to tokenize the content. (or if you already have a way to separate the DataFrame into different languages you can have different pipelines for different languages)

maziyarpanahi · 2022-07-01T12:03:40Z

sorry closed by mistake, that's being said we will look into why the content with Thai (even with a few English) is not performing well.
@danilojsl

jslim89 · 2022-07-01T12:16:08Z

hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null.
Is not possible to processed by spark-nlp right?

maziyarpanahi · 2022-07-01T12:19:13Z

hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null. Is not possible to processed by spark-nlp right?

That's what @danilojsl will investigate to see if that's possible. For now, only the language of that model can be segmented via WordSegmenterModel

jslim89 · 2022-07-01T12:21:45Z

Alright. Thanks @maziyarpanahi

danilojsl · 2022-08-18T18:49:04Z

Hi @jslim89, as @maziyarpanahi pointed out, WordSegmenter is not multi-lingual. All these models assume the document contents will be of one language only. So, suppose a sentence has a mix of languages. In that case, it will segment/combine the characters based on the language the model was trained for (in this example Thai), the other characters will be considered as a single character since the model does not know how to segment/combine those.

One way to change this behavior would be that WordSegmenter internally runs a regular Tokenizer annotator, let the tokens with characters different than Thai as they are, and only segment the Thai tokens.
So for this example, it would run the word segmenter algorithm for [สำหรับฐานลำโพง, อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น] tokens.
So, the output will look something like these:
[oem, loomma, word_segmenter_output, apple, homepod, word_segmenter_output, speaker, stands, null]

This behavior requires a change in the code, @maziyarpanahi let me know if we proceed.

maziyarpanahi · 2022-08-20T08:11:58Z

@danilojsl that's interesting, it will make the annotator more flexible for sure. However, will this be passing TonkeinzerModel somehow to WordSegmenter? (makes it complicated for saving and serialization)

What we can do is to have RegexTokenizer inside and control it via some parameters:

enableRegexTokenizer
if enabled, then the following parameters are used to feed RegexTokenizer and get the results internally
.setToLowercase(true)
.setPattern("\s+")

This way we can easily save those parameters and it allows users to customize how to tokenize the words with whitespace between them.

jslim89 assigned maziyarpanahi Jul 1, 2022

maziyarpanahi closed this as completed Jul 1, 2022

maziyarpanahi added the question label Jul 1, 2022

maziyarpanahi reopened this Jul 1, 2022

maziyarpanahi assigned ahmedlone127 and danilojsl and unassigned ahmedlone127 Jul 1, 2022

danilojsl mentioned this issue Sep 28, 2022

Adding enableRegexTokenizer in WordSegmenter #12854

Merged

10 tasks

maziyarpanahi linked a pull request Sep 29, 2022 that will close this issue

Adding enableRegexTokenizer in WordSegmenter #12854

Merged

10 tasks

maziyarpanahi linked a pull request Oct 11, 2022 that will close this issue

421-release-candidate #12902

Merged

maziyarpanahi closed this as completed in #12902 Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordseg_best unable to transform english words correctly #9891

wordseg_best unable to transform english words correctly #9891

jslim89 commented Jul 1, 2022

maziyarpanahi commented Jul 1, 2022

maziyarpanahi commented Jul 1, 2022

jslim89 commented Jul 1, 2022

maziyarpanahi commented Jul 1, 2022

jslim89 commented Jul 1, 2022

danilojsl commented Aug 18, 2022 •

edited

Loading

maziyarpanahi commented Aug 20, 2022

wordseg_best unable to transform english words correctly #9891

wordseg_best unable to transform english words correctly #9891

Comments

jslim89 commented Jul 1, 2022

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment

maziyarpanahi commented Jul 1, 2022

maziyarpanahi commented Jul 1, 2022

jslim89 commented Jul 1, 2022

maziyarpanahi commented Jul 1, 2022

jslim89 commented Jul 1, 2022

danilojsl commented Aug 18, 2022 • edited Loading

maziyarpanahi commented Aug 20, 2022

danilojsl commented Aug 18, 2022 •

edited

Loading