mini-project on Chinese NLP

containerized http service

docker build -t nlp:v1 . && docker run --rm -it --name nlp-service -p 0.0.0.0:8080:8080 nlp:v1

mock API requests

may need to change base_url in code

mock/db.py: tests data insertion and query;
mock/nlp.py: tests three NLP tasks;

Components

HTML parsing and database

see scripts/step1-build_database.ipynb

simply load the excel file, then use BeautifulSoup to parse html text;
use sqlite3 as embedded db, dump db file as data/data.db;
store parsed title, published_date and content into db;

Word Segmentation

Word segmentation is directed built from spacy's zh_core_web_md model.

use spacy because it is an efficient multitask NLP framework;
- note that I chose version 2.3.0 over 3.x because the latest version reports unkown bug in NER fine-tuning;
there are four chinese models available, chose zh_core_web_md because it achieves a balance between model size and performance;
model zh_core_web_md==2.3.0 uses jieba as segmentor;

Error Analysis and Ideas

an example for imperfect word segmentation:

Input: 国务院关税税则委员会按程序决定，对相关商品延长排除期限。
output:  ['国务院', '关税税', '则', '委员会', '按', '程序', '决定', '，', '对', '相关', '商品', '延长', '排除', '期限', '。']

Since jieba was intruduced many years ago, I further tested the same input on pkuseg, a more recent segmentor:

pkuseg output: ['国务院', '关税税', '则', '委员会', '按', '程序', '决定', '，', '对', '相关', '商品', '延长', '排除', '期限']

which did not really seem better...

NER on goverment departments

train code pls refer to step2-train_ner.py

source of labeled data: CLUENER2020;
extract entities with label gov;
train NER pipeline based on spacy's zh_core_web_md model;

To avoid training upon each deployment, I have trained and dumpped the model in saved_model.zip (tracked and managed by git-lfs).

After 10 iterations of training, the pipeline was able to recognize '公安部门' and even complex entity like '省级住房和城乡建设、水利、财政部门'.

Error Analysis and Ideas

Segmentation error can leads to NER error. use the same example as in Task 1:

Input: 国务院关税税则委员会按程序决定，对相关商品延长排除期限。
TASK1: SEG {'segment': ['国务院', '关税税', '则', '委员会', '按', '程序', '决定', '，', '对', '相关', '商品', '延长', '排除', '期限', '。']}
TASK2: NER {'entities': [{'entity': '国务院', 'beginning_position': 0}, {'entity': '委员会', 'beginning_position': 7}]}

where gold prediction should be '国务院关税税则委员会'. NER could possibly perform better on this sample with a better segmentor.

Identify action, the object of the action, and the modifier of the object given the subject is a department.

After reading materials on dependency parsing, I decided to build a rule-based dependency parser.

Everything related is packed and well-annotated in model/finetune_ner.py. Key functions:

_subtree_boundary;
_dependency_analysis;
extract_components_from_root;

Error Analysis and Ideas

In the following example, SEG and NER looks fine, but failed to find action, object and modifer:

Input: 公安部门依法打击利用黑客手段提供有偿“刷课”服务违法犯罪活动
TASK1: SEG {'segment': ['公安', '部门', '依法', '打击', '利用', '黑客', '手段', '提供', '有偿', '“', '刷课', '”', '服务', '违法', '犯罪', '活动']}
TASK2: NER {'entities': [{'entity': '公安部门', 'beginning_position': 0}]}
TASK3: DEP {'components': []}

It could be that the object and modifer are too long and ambiguous. After replacing '提供有偿“刷课”服务' with '的', the sentence seems to make more sense, and the parser managed to give a better result:

Input: 公安部门依法打击利用黑客手段的违法犯罪活动
TASK1: SEG {'segment': ['公安', '部门', '依法', '打击', '利用', '黑客', '手段', '的', '违法', '犯罪', '活动']}
TASK2: NER {'entities': [{'entity': '公安部门', 'beginning_position': 0}]}
TASK3: DEP {'components': [{'action': '打击', 'action_position': 6, 'object': '违法犯罪活动', 'object_position': 15, 'modifier': '利用黑客手段的', 'modifier_position': 8}]}

The dependency tree of original sentence looks like:

and for the modified sentence:

In the original version, '利用' is a direct child of '打击', with dependency ccomp (meaning 'clausal complement'). Obviously, the dependency parser treated '打击利用' as a clause. Besides, '提供' is treated as conj (conjunct) of '打击', which is also not correct.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
img		img
mock		mock
model		model
scripts		scripts
.gitattributes		.gitattributes
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.json		app.json
app.py		app.py
nlp_api.py		nlp_api.py
requirements.txt		requirements.txt
saved_model.zip		saved_model.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mini-project on Chinese NLP

containerized http service

mock API requests

Components

HTML parsing and database

Word Segmentation

Error Analysis and Ideas

NER on goverment departments

Error Analysis and Ideas

Identify action, the object of the action, and the modifier of the object given the subject is a department.

Error Analysis and Ideas

About

Releases

Packages

Languages

License

madcpt/cn-nlp-multitask

Folders and files

Latest commit

History

Repository files navigation

mini-project on Chinese NLP

containerized http service

mock API requests

Components

HTML parsing and database

Word Segmentation

Error Analysis and Ideas

NER on goverment departments

Error Analysis and Ideas

Identify action, the object of the action, and the modifier of the object given the subject is a department.

Error Analysis and Ideas

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages