Skip to content
forked from vistec-AI/crfcut

Thai sentence segmentation with conditional random fields

Notifications You must be signed in to change notification settings

PyThaiNLP/crfcut

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRF-Cut: Sentence Segmentation

The objective of CRF-Cut (Conditional Random Fields - Cut) is to cut sentences and we will able to utilize these sentences.

The process of training is to get sentences and we will tokenize words and assign label for each word I: Inside of sentence and E: End of sentence.

The result of CRF-Cut is trained by different datasets are as follows:

dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct
Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82
Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73
Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78
Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71
Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87
Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70
Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56
Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67
Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97
Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78
Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82
Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96

Google colab:

Sentence Breaking Journal

What doesn't work

  • POS-perceptron
  • Larger features than window = 2, max_n_gram = 3
  • Number of verbs to the left and right
  • Rule-based override
  • L2 regularization - also not practical
  • POS-artagger - not really too slow
  • ORCHID - different domains get totally different results

What to try

  • TNC

What worked

  • Fake "convolutions" of window = 2, max_n_gram = 3
  • L1 regularization of 1
  • Predict end of sentence (space) instead of beginning of sentence
  • Custom POS - only faster convergence
  • Try with ORCHID to compare performance more fairly - 87% vs 95% SOTA

Requirements

  • pythainlp
  • python-crfsuite
  • pandas
  • numpy
  • scikit-learn
  • tqdm

About

Thai sentence segmentation with conditional random fields

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.0%
  • Python 3.0%