-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train Linear Chain CRF for word segmentation? #7485
Comments
Hi @jackieair I don't believe we have LinearChainCrf in Spark or Spark NLP. However, for training new word segmentation models on languages that don't have whitespace like Chinese, Korean, etc. we have a feature called WordSegmenter:
I can see that we miss a notebook that demonstrates how to use this annotator for training a Word Segmenter. We will add this notebook shortly. |
@DevinTDHa Could you please add a new directory here https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training |
We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment). But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation |
@jackieair I don't know what happened, your comments disappeared! but it exists in my reply. |
Did you mean that I can use NerCrf for POS Tagging? If yes, then I can train a NerCrfModel for POS task, I noticed NerCrf is based on LinearChainCrf. The dataset I would use is backoff2005, so the major work is to convert the format of backoff2005 to the required style, do I understand it well? |
Hi, yes, I don't know why this happend. : ) |
Yes, since POS is like NER, you can use NerCrfApproach for training POS. There are examples of how to do so:
You need to have a CoNLL 2003 format dataset. (you need to use something to convert your dataset to that format which is the acceptable format for most of the trainable annotators in Spark NLP.) That repository has lots of examples, I highly suggest this part which teaches you how to do most of the NLP tasks in Spark NLP (from notebook #1 to #16): https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public |
@maziyarpanahi Many many thanks! I'm so grateful for your kind help and prompt replies. wish you have a good day! |
@jackieair Here is a simple example of how to train word segmenter : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb Obviously, the larger and the better the dataset the higher is the accuracy. |
Hi, I'm trying to train a NerCrfModel by using line69-104 from following: However, when I use the pretrained PerceptronModel and WordEmbeddingsModel, some error occurred like this:
I tried many methods but unfortunately, they all failed. Is there any way to download the pretrained model and I can directly import them from local instead of downloading them when training in cluster? |
It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use I'll close this as it is no longer an issue. (mainly WordSegmenter example) |
Thanks! |
Name of the Spark NLP feature whose docs need improvement:
Linear Chain CRF
What you think the docs should say:
Hi, I want to thank you for this great NLP project first.
I am new to NLP and want to use exactly LinearChainCrf for Chinese word segmentation.
As I know CRF needs feature templates(or feature functions, like Unigram/Bigram) for training like CRF++.
However, I found there's no instruction about how to use LinearChainCrf. I don't see how to set the training pipeline for CRF(not NerCrf), and what dataset format it requires, etc.
Could you please offer some help : )
The text was updated successfully, but these errors were encountered: