Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train Linear Chain CRF for word segmentation? #7485

Closed
jackieair opened this issue Mar 29, 2022 · 12 comments
Closed

How to train Linear Chain CRF for word segmentation? #7485

jackieair opened this issue Mar 29, 2022 · 12 comments
Assignees

Comments

@jackieair
Copy link

jackieair commented Mar 29, 2022

Name of the Spark NLP feature whose docs need improvement:
Linear Chain CRF

What you think the docs should say:
Hi, I want to thank you for this great NLP project first.

I am new to NLP and want to use exactly LinearChainCrf for Chinese word segmentation.
As I know CRF needs feature templates(or feature functions, like Unigram/Bigram) for training like CRF++.

However, I found there's no instruction about how to use LinearChainCrf. I don't see how to set the training pipeline for CRF(not NerCrf), and what dataset format it requires, etc.

Could you please offer some help : )

@maziyarpanahi
Copy link
Member

Hi @jackieair

I don't believe we have LinearChainCrf in Spark or Spark NLP. However, for training new word segmentation models on languages that don't have whitespace like Chinese, Korean, etc. we have a feature called WordSegmenter:

I can see that we miss a notebook that demonstrates how to use this annotator for training a Word Segmenter. We will add this notebook shortly.

@maziyarpanahi maziyarpanahi assigned DevinTDHa and unassigned C-K-Loan Mar 29, 2022
@maziyarpanahi
Copy link
Member

@DevinTDHa Could you please add a new directory here chinese and have a notebook that shows how to use WordSegmenterApproach for training Chinese word segmentation?

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training

@jackieair jackieair reopened this Mar 29, 2022
@maziyarpanahi
Copy link
Member

Hi, thanks for your prompt reply.

I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf

Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.

We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).

But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation

@JohnSnowLabs JohnSnowLabs deleted a comment from jackieair Mar 29, 2022
@maziyarpanahi
Copy link
Member

@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.

@jackieair
Copy link
Author

Hi, thanks for your prompt reply.
I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf
Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.

We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).

But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation

Did you mean that I can use NerCrf for POS Tagging?

If yes, then I can train a NerCrfModel for POS task, I noticed NerCrf is based on LinearChainCrf.

The dataset I would use is backoff2005, so the major work is to convert the format of backoff2005 to the required style, do I understand it well?
Or I only need to set documentAssembler, tokenizer, posTagger, embeddings like this:
https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala

@jackieair
Copy link
Author

@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.

Hi, yes, I don't know why this happend. : )

@maziyarpanahi
Copy link
Member

Yes, since POS is like NER, you can use NerCrfApproach for training POS. There are examples of how to do so:

You need to have a CoNLL 2003 format dataset. (you need to use something to convert your dataset to that format which is the acceptable format for most of the trainable annotators in Spark NLP.)

That repository has lots of examples, I highly suggest this part which teaches you how to do most of the NLP tasks in Spark NLP (from notebook #1 to #16): https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public

@jackieair
Copy link
Author

@maziyarpanahi Many many thanks!

I'm so grateful for your kind help and prompt replies. wish you have a good day!

@maziyarpanahi
Copy link
Member

@jackieair Here is a simple example of how to train word segmenter : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb

Obviously, the larger and the better the dataset the higher is the accuracy.

@jackieair
Copy link
Author

Hi, I'm trying to train a NerCrfModel by using line69-104 from following:
https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala

However, when I use the pretrained PerceptronModel and WordEmbeddingsModel, some error occurred like this:

Exception in thread "main" com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1467)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1326)
at com.johnsnowlabs.client.aws.AWSGateway.getObjectFromS3(AWSGateway.scala:91)
at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:77)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68)
at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:445)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:370)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:405)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:400)
at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:44)
at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:41)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:154)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:154)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:51)
at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:51)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:148)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:148)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160)
at com.huawei.bigdata.ml.crftest.CrfTest$.main(CrfTest.scala:90)
at com.huawei.bigdata.ml.crftest.CrfTest.main(CrfTest.scala)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316)
at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639)
at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:965)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
at com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source)
at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
... 35 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:450)
at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:317)
at sun.security.validator.Validator.validate(Validator.java:262)
at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330)
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237)
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132)
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621)
... 62 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:445)
... 68 more

I tried many methods but unfortunately, they all failed.

Is there any way to download the pretrained model and I can directly import them from local instead of downloading them when training in cluster?

@maziyarpanahi
Copy link
Member

It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load for offline use:
https://github.com/JohnSnowLabs/spark-nlp#offline

I'll close this as it is no longer an issue. (mainly WordSegmenter example)

@jackieair
Copy link
Author

It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load for offline use: https://github.com/JohnSnowLabs/spark-nlp#offline

I'll close this as it is no longer an issue. (mainly WordSegmenter example)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants