Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wordseg_best unable to transform english words correctly #9891

Closed
jslim89 opened this issue Jul 1, 2022 · 7 comments · Fixed by #12854 or #12902
Closed

wordseg_best unable to transform english words correctly #9891

jslim89 opened this issue Jul 1, 2022 · 7 comments · Fixed by #12854 or #12902
Assignees
Labels

Comments

@jslim89
Copy link

jslim89 commented Jul 1, 2022

Description

I'm using wordseg_best, the example given is working as expected.

However, I try it with mix of english & th, the english word not segmented properly.

Expected Behavior

+---------------------------------------------------------------------------------------------------------------------------------------+
|term_text                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------+
|[oem, loomma, สำหรับ, ฐาน, ลำโพง, apple, homepod, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, speaker, stands, null]|
|[v3i, 100, original, motorola, razr, v3i, quad, band, flip, gsm, bluetooth, mp3, unlocked, mobile, phone, console, gaming, controllers]|
+---------------------------------------------------------------------------------------------------------------------------------------+

Current Behavior

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|term_masterbrain                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[o, e, m, l, o, o, m, m, a, สำหรับฐาน, ล, ำ, โพง, a, p, p, l, e, h, o, m, e, p, o, d, อุปกรณ์, เครื่อง, เสียง, ยึด, ขา, ตั้ง, ไม้, แข็ง, ตั้ง, พื้น, s, p, e, a, k, e, r, s, t, a, n, d, snull]                                                                          |
|[v, 3, i1, 0, 0, o, r, i, g, i, n, a, l, m, o, t, o, r, o, l, a, r, a, z, r, v, 3, i, q, u, a, d, b, a, n, d, f, l, i, p, g, s, m, b, l, u, e, t, o, o, t, h, m, p3unlockedmobile, p, h, o, n, e, c, o, n, s, o, l, e, g, a, m, i, n, g, c, o, n, t, r, o, l, l, e, r, s]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Possible Solution

Steps to Reproduce

Run the unit test

from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Pipeline, Finisher

class TestThaiNlp(PySparkTestCase):

    def setUp(self):
        self.spark = SparkSession.builder \
            .master('local') \
            .appName('vision') \
            .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0") \
            .getOrCreate()
        self.df = self.spark.createDataFrame(
            [
                ('oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null'),
                ('v3i 100 original motorola razr v3i quad band flip gsm bluetooth mp3 unlocked mobile phone console gaming controllers'),
            ],
            [
                'text',
            ]
        )

    def test_sparknlp(self):
        field = 'text'
        document_assembler = DocumentAssembler() \
            .setInputCol(field) \
            .setOutputCol(f'{field}_document')
        word_seg = WordSegmenterModel.pretrained('wordseg_best', 'th') \
            .setInputCols(f'{field}_document') \
            .setOutputCol(f'{field}_token')
        finisher = Finisher() \
            .setInputCols([f'{field}_token']) \
            .setIncludeMetadata(True)
        pipeline = Pipeline(stages=[document_assembler, word_seg, finisher])
        result = pipeline.fit(self.df).transform(self.df).withColumnRenamed(f'finished_{field}_token', f'term_{field}')
        result.select(f'term_{field}').show(2, False)

    def tearDown(self):
        self.spark.stop()

Context

I'm doing a benchmark with pythainlp

Your Environment

  • Spark NLP version: 4.0.0
  • Apache NLP version: 3.2.0
  • Java version: openjdk version "11.0.15" 2022-04-19
  • Operating System and version: Ubuntu 18.04.4 LTS
@maziyarpanahi
Copy link
Member

Hi,

Short answer: WordSegmenterModel doesn't support multi-lingual word segmentation and it's always trained on a specific language.

The WordSegmenterModel is for languages that require segmentation like the model you are using which only supports Thai.

This annotator doesn't support multi-lingual segmentation since it is always trained over a specific language, you need a mix of WordSegmenterModel for Thai and Tokenizer for English. I would suggest using a LanguageDetectorDL to detect the language of each row/document, and then via the value of that column using one of those two to tokenize the content. (or if you already have a way to separate the DataFrame into different languages you can have different pipelines for different languages)

@maziyarpanahi
Copy link
Member

sorry closed by mistake, that's being said we will look into why the content with Thai (even with a few English) is not performing well.
@danilojsl

@jslim89
Copy link
Author

jslim89 commented Jul 1, 2022

hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null.
Is not possible to processed by spark-nlp right?

@maziyarpanahi
Copy link
Member

hi @maziyarpanahi what if a single field contains a mix of english & thai words? like oem loomma สำหรับฐานลำโพง apple homepod อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น speaker stands null. Is not possible to processed by spark-nlp right?

That's what @danilojsl will investigate to see if that's possible. For now, only the language of that model can be segmented via WordSegmenterModel

@jslim89
Copy link
Author

jslim89 commented Jul 1, 2022

Alright. Thanks @maziyarpanahi

@danilojsl
Copy link
Contributor

danilojsl commented Aug 18, 2022

Hi @jslim89, as @maziyarpanahi pointed out, WordSegmenter is not multi-lingual. All these models assume the document contents will be of one language only. So, suppose a sentence has a mix of languages. In that case, it will segment/combine the characters based on the language the model was trained for (in this example Thai), the other characters will be considered as a single character since the model does not know how to segment/combine those.

One way to change this behavior would be that WordSegmenter internally runs a regular Tokenizer annotator, let the tokens with characters different than Thai as they are, and only segment the Thai tokens.
So for this example, it would run the word segmenter algorithm for [สำหรับฐานลำโพง, อุปกรณ์เครื่องเสียงยึดขาตั้งไม้แข็งตั้งพื้น] tokens.
So, the output will look something like these:
[oem, loomma, word_segmenter_output, apple, homepod, word_segmenter_output, speaker, stands, null]

This behavior requires a change in the code, @maziyarpanahi let me know if we proceed.

@maziyarpanahi
Copy link
Member

@danilojsl that's interesting, it will make the annotator more flexible for sure. However, will this be passing TonkeinzerModel somehow to WordSegmenter? (makes it complicated for saving and serialization)

What we can do is to have RegexTokenizer inside and control it via some parameters:

  • enableRegexTokenizer
  • if enabled, then the following parameters are used to feed RegexTokenizer and get the results internally
  • .setToLowercase(true)
  • .setPattern("\s+")

This way we can easily save those parameters and it allows users to customize how to tokenize the words with whitespace between them.

@maziyarpanahi maziyarpanahi linked a pull request Sep 29, 2022 that will close this issue
10 tasks
@maziyarpanahi maziyarpanahi linked a pull request Oct 11, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants