Datasets

Corpus

brightmart/nlp_chinese_corpus
- contain several million and 10 million Chinese nlp corpus, related wiki2019zh, news2016zh, baike2019qa
AI实践项目-数据集 Chinese NLP Data Hub
中文任务基准测评

Core Entity(Keywords) Extraction

thunlp/THUOCL
- have filtered by manual work

Language Model

[enwiki8]
[text8]
[WikiText-103]
[One Billion Word]
[Penn TreeBank]

Word Embedding

Chinese Word Embedding
1. Fasttext. Trained on Common Crawl and Wikipedia. 300 dims. Stanford word segmenter for Chinese.
  - Useful. 2 million.
2. Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
3. Embedding/Chinese-Word-Vectors
  - 100+ Chinese Word Vectors
4. []
English Word Embedding
1. 3Top/word2vec-api

QA

[NQ] Natural Questions: a Benchmark for Question Answering Research.** Tom Kwiatkowski and Michael Collins, Research Scientists, Google AI Language. January 23, 2019. paper; blog
[SQuAD 2.0: Stanford Question Answering Dataset] Know What You Don’t Know: Unanswerable Questions for SQuAD. Pranav Rajpurkar∗ Robin Jia∗ Percy Liang. 2018 ACL.
[SQuAD 1.0: Stanford Question Answering Dataset] SQuAD: 100,000+ Questions for Machine Comprehension of Text. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang. 2016 EMNLP.
[HotpotQA]
[NarrativeQA]
[TriviaQA]
[QuAC]
[CoQA]
[WikiQA]
[MS Marco]
[NewsQA]
[CNN/DailyMail news] Teaching machines to read and comprehend. ACL 2015.
[CBTest: Children’s Book Test] The goldilocks principle: Reading children’s books with explicit memory representations. Hill et al. 2015.
[DuReader] DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. Haifeng Wang et al. 2018. Leadboard
[CMRC 2018] Chinese Machine Reading Comprehension. HIT & iFLYTEK 2018. link
[bAbI] Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. Jason Weston 2015.
[Who-Did-What] Who did What: A Large-Scale Person-Centered Cloze Dataset. 2016 EMNLP. Takeshi Onishi et al.
[RACE] a reading comprehension task designed for middle and high-school English exams in China. Lai et al. 2017.

Two Sentence Classification

[SNLI: Stanford Natural Language Inference] A large annotated corpus for learning natural language inference. Bowman et al. 2015. link
[SciTail] A textual entailment dataset from science question answering. Khot et al. AAAI. 2018. link
[QQP: Quora Question Pairs] Quora question pairs. Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.
[MRPC: Microsoft Research Paraphrase Corpus] Automatically constructing a corpus of sentential paraphrases. William B Dolan and Chris Brockett. 2005.
[MNLI: Multi-Genre Natural Language Inference] The RepEval 2017 Shared Task: MultiGenre Natural Language Inference with Sentence Representations N. Nangia, A. Williams, A. Lazaridou, and S. R. Bowman. 2017.
[RTE: Recognizing Textual Entailment] Glue: A multi-task benchmark and analysis platform for natural language understanding. Alex Wang et al. 2018
[WNLI: Winograd NLI] derived from The winograd schema challenge. Hector Levesque et al. 2012.
LCQMC

Two Sentence Relevance

[STS-B: Semantic Textual Similarity Benchmark] Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. Daniel Cer et al. 2017.
[QNLI] derived from the Stanford Question Answering Dataset.(SQuAD 1.0) 2016.

Sentence Classification

[CoLA] Neural Network Acceptability Judgments. Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018
[SST-2: Stanford Sentiment Treebank] Recursive deep models for semantic compositionality over a sentiment treebank. Richard Socher et al. 2013 EMNLP.

Tasks [NLU models & NLI tasks]

Single Sentence. [a sentence can be an arbitrary span of contiguous text or word sequence, rather than a linguistically plausible sentence.]
- single-sentence classification
  - CoLA (predict whether an English sentence is grammatically plausible.)
  - SST-2 (determine whether the sentiment of a sentence extracted from movie reviews is positive or negative)
- sequence labeling
  - NER:
  - POS
Pair Sentences.
- pairwise text classification
  - RTE (predict whether the hypothesis is an entailment, or not entailment with respect to the premise.)
  - MNLI (predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise.)
  - WNLI (select the referent of a pronoun from a list of choices in a given sentence which contains the pronoun.)
  - QQP (predict whether two questions are semantically equivalent)
  - MRPC (whether a sentence pair is semantically equivalent to the other in the pair)
  - SNLI (widely used entailment dataset for NLI)
  - SciTail (assessing whether a given premise entails a given hypothesis)
- text similarity scoring
  - STS-B (Given a pair of sentences, the model predicts a real-value score indicating the semantic similarity of the two sentences)
- relevance ranking
  - QNLI (The task involves assessing whether a sentence contains the correct answer to a given query)
- sentiment analysis
QA.
- extractive QA
  - SQuAD 1.0 (extract answer from a context given a question)
  - SQuAD 2.0 (predict whether have answer and extract answer from a context given a question)
  - CNN/DailyMail News (to teach the machine to do cloze-style reading comprehensions)
- generate QA
  - Conversational QA
    - DREAM (a multiple-choice Dialogue-based REAding comprehension exaMination dataset). example
  - Single-turn QA
    - DuReader (summary answer from multiple documents according to question which type is in Entity/Description/YesNo or Fact/Opinion)
  - Machine Reading Comprehension
    - C3: Multiple-Choice Chinese machine reading Comprehension. github. example

Data process tools

jackalhan/qa_datasets_converter
- Dataset Converter for natural language processing tasks such QA(question-answering) Tasks: from one format to other one.

Data scale comparison

Corpus	Task	#Train	#Dev	#Test	#Label	Metrics	Category	Source
CoLA	Acceptablility	8.5k	1k	1k	2	Matthews corr	Single-Sentence Classification(GLUE)
SST-2	Sentiment	67k	872	1.8k	2	Accuracy	Single-Sentence Classification(GLUE)	movie reviews
STS-B	Similarity	7k	1.5k	1.4k	1	Pearson/Spearman corr	Text Similarity(GLUE)	multiple data resources
QNLI	QA/NLI	108k	5.7k	5.7k	2	Accuracy	Relevance Ranking(GLUE)	SQuAD 1.0
QQP	Paraphrase	364k	40k	391k	2	Accuracy/F1	Pairwise Text Classification(GLUE)	Quora
MRPC	Paraphrase	3.7k	408	1.7k	2	Accuracy/F1	Pairwise Text Classification(GLUE)	online news
MNLI	NLI	393k	20k	20k	3	Accuracy	Pairwise Text Classification(GLUE)
RTE	NLI	2.5k	276	3k	2	Accuracy	Pairwise Text Classification(GLUE)
WNLI	NLI	634	71	146	2	Accuracy	Pairwise Text Classification(GLUE)
SNLI	NLI	549k	9.8k	9.8k	3	Accuracy	Pairwise Text Classification	captions of the Flickr30 corpus
SciTail	NLI	23.5k	1.3k	2.1k	2	Accuracy	Pairwise Text Classification	science questions & relevant web sentences
SQuAD 1.0	QA	87.5k	10.5k	9.5k		Accuracy/F1	Extractive QA	546 wiki pages
SQuAD 2.0	QA	130.3k	11.8k	8.8k		Accuracy/F1	Extractive QA	348 wiki pages
NQ	QA	307.3k	7.8k	7.8k			Extractive QA	Google Search Engine
CMRC 2018	Span-Extraction Reading Comprehension	11.1k	3.2k	2.5k	span	EM/F1	Extractive QA	Chinese wiki papes
DuReader	Open-domain Question Answering	271.5k	10k	20k		ROUGE-L and BLEU4	Generative QA	Baidu Search & Baid Zhidao
CNN/DailyMail	news

Data related links

Baidu Release Chinese Dataset
chaotbot_corpus_Chinese

References

Multi-Task Deep Neural Networks for Natural Language Understanding. Xiaodong Liu et al. 2019.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets.md

Datasets.md

Datasets

Corpus

Core Entity(Keywords) Extraction

Language Model

Word Embedding

QA

Two Sentence Classification

Two Sentence Relevance

Sentence Classification

Tasks [NLU models & NLI tasks]

sentiment analysis

Data process tools

Data scale comparison

Data related links

References

Files

Datasets.md

Latest commit

History

Datasets.md

File metadata and controls

Datasets

Corpus

Core Entity(Keywords) Extraction

Language Model

Word Embedding

QA

Two Sentence Classification

Two Sentence Relevance

Sentence Classification

Tasks [NLU models & NLI tasks]

sentiment analysis

Data process tools

Data scale comparison

Data related links

References