is a Python wrapper for Stanford CoreNLP. It provides a simple API for text processing tasks such as Tokenization, Part of Speech Tagging, Named Entity Reconigtion, Constituency Parsing, Dependency Parsing, and more.
Java 1.8+ (Check with command: java -version
) (Download Page)
Stanford CoreNLP 3.7.0 (Download Page)
pip install stanfordcorenlp
# Simple usage
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'G:/JavaLibraries/stanford-corenlp-full-2016-10-31/')
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print 'Tokenize:', nlp.word_tokenize(sentence)
print 'Part of Speech:', nlp.pos_tag(sentence)
print 'Named Entities:', nlp.ner(sentence)
print 'Constituency Parsing:', nlp.parse(sentence)
print 'Dependency Parsing:', nlp.dependency_parse(sentence)
Output format:
# Tokenize
[u'Guangdong', u'University', u'of', u'Foreign', u'Studies', u'is', u'located', u'in', u'Guangzhou', u'.']
# Part of Speech
[(u'Guangdong', u'NNP'), (u'University', u'NNP'), (u'of', u'IN'), (u'Foreign', u'NNP'), (u'Studies', u'NNPS'), (u'is', u'VBZ'), (u'located', u'JJ'), (u'in', u'IN'), (u'Guangzhou', u'NNP'), (u'.', u'.')]
# Named Entities
[(u'Guangdong', u'ORGANIZATION'), (u'University', u'ORGANIZATION'), (u'of', u'ORGANIZATION'), (u'Foreign', u'ORGANIZATION'), (u'Studies', u'ORGANIZATION'), (u'is', u'O'), (u'located', u'O'), (u'in', u'O'), (u'Guangzhou', u'LOCATION'), (u'.', u'O')]
# Constituency Parsing
(NP (NNP Guangdong) (NNP University))
(PP (IN of)
(NP (NNP Foreign) (NNPS Studies))))
(VP (VBZ is)
(ADJP (JJ located)
(PP (IN in)
(NP (NNP Guangzhou)))))
(. .)))
# Dependency Parsing
[(u'ROOT', 0, 7), (u'compound', 2, 1), (u'nsubjpass', 7, 2), (u'case', 5, 3), (u'compound', 5, 4), (u'nmod', 2, 5), (u'auxpass', 7, 6), (u'case', 9, 8), (u'nmod', 7, 9), (u'punct', 7, 10)]
Note: you must download an additional model file and place it in the .../stanford-corenlp-full-2016-10-31/
folder. For example, you should download the stanford-chinese-corenlp-2016-10-31-models.jar
file if you want to process Chinese.
# _*_coding:utf-8_*_
# Other human languages support, e.g. Chinese
nlp = StanfordCoreNLP(r'G:/JavaLibraries/stanford-corenlp-full-2016-10-31/', lang='zh')
sentence = '清华大学位于北京。'
print nlp.word_tokenize(sentence)
print nlp.pos_tag(sentence)
print nlp.ner(sentence)
print nlp.parse(sentence)
print nlp.dependency_parse(sentence)
Since this will load all the models which require more memory, initialize the server with more memory. 8GB is recommended.
# General json output
nlp = StanfordCoreNLP(r'path_to_corenlp', memory='8g')
print nlp.annotate(sentence)
You can specify properties:
:tokenize, ssplit, pos, lemma, ner, parse, depparse, dcoref
(See Detail) -
:en, zh, ar, fr, de, es
(English, Chinese, Arabic, French, German, Spanish) (See Annotator Support Detail) -
:json, xml, text
text = 'Guangdong University of Foreign Studies is located in Guangzhou. ' \
'GDUFS is active in a full range of international cooperation and exchanges in education. '
props={'annotators': 'tokenize,ssplit,pos','pipelineLanguage':'en','outputFormat':'xml'}
print nlp.annotate(text, properties=props)
# Use an existing server
nlp = StanfordCoreNLP('', port=80)
import logging
from stanfordcorenlp import StanfordCoreNLP
# Debug the wrapper
nlp = StanfordCoreNLP(r'path_or_host', logging_level=logging.DEBUG)
# Check more info from the CoreNLP Server
nlp = StanfordCoreNLP(r'path_or_host', quiet=False, logging_level=logging.DEBUG)