How to train Linear Chain CRF for word segmentation? #7485

jackieair · 2022-03-29T07:53:14Z

Name of the Spark NLP feature whose docs need improvement:
Linear Chain CRF

What you think the docs should say:
Hi, I want to thank you for this great NLP project first.

I am new to NLP and want to use exactly LinearChainCrf for Chinese word segmentation.
As I know CRF needs feature templates(or feature functions, like Unigram/Bigram) for training like CRF++.

However, I found there's no instruction about how to use LinearChainCrf. I don't see how to set the training pipeline for CRF(not NerCrf), and what dataset format it requires, etc.

Could you please offer some help : )

maziyarpanahi · 2022-03-29T08:06:01Z

Hi @jackieair

I don't believe we have LinearChainCrf in Spark or Spark NLP. However, for training new word segmentation models on languages that don't have whitespace like Chinese, Korean, etc. we have a feature called WordSegmenter:

Document: https://nlp.johnsnowlabs.com/docs/en/annotators#wordsegmenter
APIs regarding parameters: https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp.annotator.WordSegmenterApproach.html

I can see that we miss a notebook that demonstrates how to use this annotator for training a Word Segmenter. We will add this notebook shortly.

maziyarpanahi · 2022-03-29T08:07:02Z

@DevinTDHa Could you please add a new directory here chinese and have a notebook that shows how to use WordSegmenterApproach for training Chinese word segmentation?

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training

maziyarpanahi · 2022-03-29T08:24:20Z

Hi, thanks for your prompt reply.

I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf

Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.

We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).

But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation

maziyarpanahi · 2022-03-29T08:25:42Z

@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.

jackieair · 2022-03-29T08:51:01Z

Hi, thanks for your prompt reply.
I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf
Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.

We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).

But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation

Did you mean that I can use NerCrf for POS Tagging?

If yes, then I can train a NerCrfModel for POS task, I noticed NerCrf is based on LinearChainCrf.

The dataset I would use is backoff2005, so the major work is to convert the format of backoff2005 to the required style, do I understand it well?
Or I only need to set documentAssembler, tokenizer, posTagger, embeddings like this：
https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala

jackieair · 2022-03-29T08:52:07Z

@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.

Hi, yes, I don't know why this happend. : )

maziyarpanahi · 2022-03-29T08:55:44Z

Yes, since POS is like NER, you can use NerCrfApproach for training POS. There are examples of how to do so:

You need to have a CoNLL 2003 format dataset. (you need to use something to convert your dataset to that format which is the acceptable format for most of the trainable annotators in Spark NLP.)

That repository has lots of examples, I highly suggest this part which teaches you how to do most of the NLP tasks in Spark NLP (from notebook #1 to #16): https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public

jackieair · 2022-03-29T08:58:54Z

@maziyarpanahi Many many thanks!

I'm so grateful for your kind help and prompt replies. wish you have a good day!

maziyarpanahi · 2022-03-30T08:29:09Z

@jackieair Here is a simple example of how to train word segmenter : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb

Obviously, the larger and the better the dataset the higher is the accuracy.

jackieair · 2022-03-30T09:07:40Z

Hi, I'm trying to train a NerCrfModel by using line69-104 from following:
https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala

However, when I use the pretrained PerceptronModel and WordEmbeddingsModel, some error occurred like this:

Exception in thread "main" com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1467)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1326)
at com.johnsnowlabs.client.aws.AWSGateway.getObjectFromS3(AWSGateway.scala:91)
at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:77)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:445)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:370)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:405)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:400)
at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:44)
at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:41)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:154)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:154)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:51)
at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:51)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:148)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:148)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160)
at com.huawei.bigdata.ml.crftest.CrfTest$.main(CrfTest.scala:90)
at com.huawei.bigdata.ml.crftest.CrfTest.main(CrfTest.scala)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
at com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
... 35 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:450)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:317)
at sun.security.validator.Validator.validate(Validator.java:262)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621)
... 62 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:445)
... 68 more

I tried many methods but unfortunately, they all failed.

Is there any way to download the pretrained model and I can directly import them from local instead of downloading them when training in cluster?

maziyarpanahi · 2022-03-30T09:15:58Z

It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load for offline use:
https://github.com/JohnSnowLabs/spark-nlp#offline

I'll close this as it is no longer an issue. (mainly WordSegmenter example)

jackieair · 2022-03-30T09:22:29Z

It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load for offline use: https://github.com/JohnSnowLabs/spark-nlp#offline

I'll close this as it is no longer an issue. (mainly WordSegmenter example)

Thanks!

jackieair added the documentation label Mar 29, 2022

jackieair assigned C-K-Loan and maziyarpanahi Mar 29, 2022

maziyarpanahi assigned DevinTDHa and unassigned C-K-Loan Mar 29, 2022

jackieair closed this as completed Mar 29, 2022

jackieair reopened this Mar 29, 2022

JohnSnowLabs deleted a comment from jackieair Mar 29, 2022

maziyarpanahi closed this as completed Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train Linear Chain CRF for word segmentation? #7485

How to train Linear Chain CRF for word segmentation? #7485

jackieair commented Mar 29, 2022 •

edited

Loading

maziyarpanahi commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

jackieair commented Mar 29, 2022

jackieair commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

jackieair commented Mar 29, 2022

maziyarpanahi commented Mar 30, 2022

jackieair commented Mar 30, 2022

maziyarpanahi commented Mar 30, 2022

jackieair commented Mar 30, 2022

How to train Linear Chain CRF for word segmentation? #7485

How to train Linear Chain CRF for word segmentation? #7485

Comments

jackieair commented Mar 29, 2022 • edited Loading

maziyarpanahi commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

jackieair commented Mar 29, 2022

jackieair commented Mar 29, 2022

maziyarpanahi commented Mar 29, 2022

jackieair commented Mar 29, 2022

maziyarpanahi commented Mar 30, 2022

jackieair commented Mar 30, 2022

maziyarpanahi commented Mar 30, 2022

jackieair commented Mar 30, 2022

jackieair commented Mar 29, 2022 •

edited

Loading