Skip to content

ko-nlp/Korpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Korpora: Korean Corpora Archives

졜근 μžμ—°μ–΄ μ²˜λ¦¬μ— 관심이 λ†’μ•„μ§€λ©΄μ„œ 정뢀와 기업은 λ¬Όλ‘  λœ»μžˆλŠ” κ°œμΈμ— 이λ₯΄κΈ°κΉŒμ§€ 데이터λ₯Ό 무료둜 κ³΅κ°œν•˜λŠ” μΆ”μ„Έμž…λ‹ˆλ‹€. ν•˜μ§€λ§Œ 데이터가 곳곳에 μ‚°μž¬ν•΄ μžˆλ‹€λ³΄λ‹ˆ ν’ˆμ§ˆ 쒋은 λ§λ­‰μΉ˜μž„μ—λ„ κ·Έ 쑴재쑰차 잘 μ•Œλ €μ§€μ§€ μ•Šμ€ κ²½μš°κ°€ λ§ŽμŠ΅λ‹ˆλ‹€. 파일 포맷과 μ €μž₯ ν˜•μ‹ 등이 각기 달라 μ‚¬μš©μ΄ 쉽지 μ•ŠμŠ΅λ‹ˆλ‹€. κ°œλ³„ μ‚¬μš©μžλ“€μ€ λ‹€μš΄λ‘œλ“œλ‚˜ μ „μ²˜λ¦¬ μ½”λ“œλ₯Ό κ·Έλ•Œκ·Έλ•Œ κ°œλ°œν•΄μ„œ 써야 ν•˜λŠ” μˆ˜κ³ λ‘œμ›€μ΄ μžˆμŠ΅λ‹ˆλ‹€.

KorporaλŠ” 이 같은 λΆˆνŽΈν•¨μ„ μ‘°κΈˆμ΄λ‚˜λ§ˆ λœμ–΄λ“œλ¦¬κΈ° μœ„ν•΄ κ°œλ°œν•œ μ˜€ν”ˆμ†ŒμŠ€ 파이썬 νŒ¨ν‚€μ§€μž…λ‹ˆλ‹€. KorporaλŠ” λ§λ­‰μΉ˜λΌλŠ” 뜻의 μ˜μ–΄ 단어 corpus의 λ³΅μˆ˜ν˜•μΈ corporaμ—μ„œ μ°©μ•ˆν•΄ 이름 μ§€μ—ˆμŠ΅λ‹ˆλ‹€. KorporaλŠ” Korean Corpora의 μ€€λ§μž…λ‹ˆλ‹€. Korporaκ°€ λ§ˆμ€‘λ¬Όμ΄ λ˜μ–΄ ν•œκ΅­μ–΄ 데이터셋이 더 많이 곡개되고 이λ₯Ό 톡해 ν•œκ΅­μ–΄ μžμ—°μ–΄ 처리 μˆ˜μ€€μ΄ ν•œ 단계 μ—…κ·Έλ ˆμ΄λ“œλ˜κΈ°λ₯Ό ν¬λ§ν•©λ‹ˆλ‹€.

λ§λ­‰μΉ˜ λͺ©λ‘

Korporaκ°€ μ œκ³΅ν•˜λŠ” λ§λ­‰μΉ˜ λͺ©λ‘μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

corpus_name description link
korean_chatbot_data 챗봇 νŠΈλ ˆμ΄λ‹μš© λ¬Έλ‹΅ νŽ˜μ–΄ https://github.com/songys/Chatbot_data
kcbert KcBERT λͺ¨λΈ ν•™μŠ΅μš© λŒ“κΈ€ 데이터 https://github.com/Beomi/KcBERT
korean_hate_speech ν•œκ΅­μ–΄ 혐였 데이터셋 https://github.com/kocohub/korean-hate-speech
korean_petitions μ²­μ™€λŒ€ κ΅­λ―Ό 청원 https://github.com/lovit/petitions_archive
kornli Korean NLI https://github.com/kakaobrain/KorNLUDatasets
korsts Korean STS https://github.com/kakaobrain/KorNLUDatasets
kowikitext ν•œκ΅­μ–΄ μœ„ν‚€ ν…μŠ€νŠΈ https://github.com/lovit/kowikitext/
namuwikitext λ‚˜λ¬΄μœ„ν‚€ ν…μŠ€νŠΈ https://github.com/lovit/namuwikitext
naver_changwon_ner 넀이버 x μ°½μ›λŒ€ 개체λͺ… 인식 데이터셋 https://github.com/naver/nlp-challenge/tree/master/missions/ner
nsmc NAVER Sentiment Movie Corpus https://github.com/e9t/nsmc
question_pair ν•œκ΅­μ–΄ 질문쌍 데이터셋 https://github.com/songys/Question_pair
modu_news λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: μ‹ λ¬Έ https://corpus.korean.go.kr
modu_messenger λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ©”μ‹ μ € https://corpus.korean.go.kr
modu_mp λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: ν˜•νƒœ 뢄석 https://corpus.korean.go.kr
modu_ne λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: 개체λͺ… 뢄석 https://corpus.korean.go.kr
modu_spoken λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: ꡬ어 https://corpus.korean.go.kr
modu_web λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: μ›Ή https://corpus.korean.go.kr
modu_written λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ¬Έμ–΄ https://corpus.korean.go.kr
aihub_translation ν•œκ΅­μ–΄-μ˜μ–΄ λ²ˆμ—­ λ§λ­‰μΉ˜ https://aihub.or.kr/aidata/87
open_subtitles μ˜ν™” μžλ§‰ ν•œμ˜ 병렬 λ§λ­‰μΉ˜ http://opus.nlpl.eu/OpenSubtitles-v2018.php
korean_parallel_koen_news ν•œκ΅­μ–΄-μ˜μ–΄ 병렬 λ§λ­‰μΉ˜ https://github.com/jungyeul/korean-parallel-corpora

μ•ˆλ‚΄ νŽ˜μ΄μ§€

Korpora μ‚¬μš©λ²•μ„ μžμ„Ένžˆ μ•ˆλ‚΄ν•˜λŠ” νŽ˜μ΄μ§€λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. μ•„λž˜μ˜ νŽ˜μ΄μ§€λŠ” ν•œκΈ€κ³Ό μ˜μ–΄λ‘œ κΈ°μˆ λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. μ˜μ–΄ λ²ˆμ—­μ— νž˜μ¨μ£Όμ‹  Han Kyul Kim (@hank110) Won Ik Cho (@warnikchow) (Alphabet order) λ‹˜μ—κ²Œ κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€.

핡심 κΈ°λŠ₯ μœ„μ£Όλ‘œ λΉ λ₯΄κ²Œ μ‚΄νŽ΄λ³΄κ³  싢은 뢄듀은 μ•„λž˜ λΉ λ₯Έ μ‚¬μš©λ²• 파트λ₯Ό μ°Έκ³ ν•˜μ„Έμš”. μ‹€ν–‰μ‹œ 주의점, μ˜΅μ…˜ μΆ”κ°€ 및 λ³€κ²½ 등은 μœ„ νŽ˜μ΄μ§€λ₯Ό λ³΄μ‹œλ©΄ λ©λ‹ˆλ‹€.

λΉ λ₯Έ μ‚¬μš©λ²•

μ„€μΉ˜

From source

git clone https://github.com/ko-nlp/Korpora
python setup.py install

Using pip

pip install Korpora

νŒŒμ΄μ¬μ—μ„œ μ‚¬μš©ν•˜κΈ°

KorporaλŠ” μ˜€ν”ˆμ†ŒμŠ€ 파이썬 νŒ¨ν‚€μ§€μž…λ‹ˆλ‹€. 기본적으둜 파이썬 μ½˜μ†”(console)μ—μ„œ λ™μž‘ν•©λ‹ˆλ‹€. λ§λ­‰μΉ˜ λͺ©λ‘μ„ ν™•μΈν•˜λŠ” 파이썬 μ˜ˆμ œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

from Korpora import Korpora
Korpora.corpus_list()
{
   'kcbert': 'beomi@github λ‹˜μ΄ λ§Œλ“œμ‹  KcBERT ν•™μŠ΅λ°μ΄ν„°',
   'korean_chatbot_data': 'songys@github λ‹˜μ΄ λ§Œλ“œμ‹  챗봇 λ¬Έλ‹΅ 데이터',
   'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github λ‹˜μ΄ λ§Œλ“œμ‹  ν˜μ˜€λŒ“κΈ€λ°μ΄ν„°',
   'korean_petitions': 'lovit@github λ‹˜μ΄ λ§Œλ“œμ‹  2017.08 ~ 2019.03 μ²­μ™€λŒ€ 청원데이터',
   'kornli': 'KakaoBrain μ—μ„œ μ œκ³΅ν•˜λŠ” Natural Language Inference (NLI) 데이터',
   'korsts': 'KakaoBrain μ—μ„œ μ œκ³΅ν•˜λŠ” Semantic Textual Similarity (STS) 데이터',
   'kowikitext': "lovit@github λ‹˜μ΄ λ§Œλ“œμ‹  wikitext ν˜•μ‹μ˜ ν•œκ΅­μ–΄ μœ„ν‚€ν”Όλ””μ•„ 데이터",
   'namuwikitext': 'lovit@github λ‹˜μ΄ λ§Œλ“œμ‹  wikitext ν˜•μ‹μ˜ λ‚˜λ¬΄μœ„ν‚€ 데이터',
   'naver_changwon_ner': '넀이버 + μ°½μ›λŒ€ NER shared task data',
   'nsmc': 'e9t@github λ‹˜μ΄ λ§Œλ“œμ‹  Naver sentiment movie corpus v1.0',
   'question_pair': 'songys@github λ‹˜μ΄ λ§Œλ“œμ‹  질문쌍(Paired Question v.2)',
   'modu_news': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ‰΄μŠ€ λ§λ­‰μΉ˜',
   'modu_messenger': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ©”μ‹ μ € λ§λ­‰μΉ˜',
   'modu_mp': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: ν˜•νƒœ 뢄석 λ§λ­‰μΉ˜',
   'modu_ne': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: 개체λͺ… 뢄석 λ§λ­‰μΉ˜',
   'modu_spoken': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: ꡬ어 λ§λ­‰μΉ˜',
   'modu_web': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: μ›Ή λ§λ­‰μΉ˜',
   'modu_written': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ¬Έμ–΄ λ§λ­‰μΉ˜',
   'aihub_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (ꡬ어 + λŒ€ν™” + λ‰΄μŠ€ + ν•œκ΅­λ¬Έν™” + μ‘°λ‘€ + μ§€μžμ²΄μ›Ήμ‚¬μ΄νŠΈ)",
   'aihub_spoken_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (ꡬ어)",
   'aihub_conversation_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (λŒ€ν™”)",
   'aihub_news_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (λ‰΄μŠ€)",
   'aihub_korean_culture_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (ν•œκ΅­λ¬Έν™”)",
   'aihub_decree_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (μ‘°λ‘€)",
   'aihub_government_website_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (μ§€μžμ²΄μ›Ήμ‚¬μ΄νŠΈ)",
   'open_subtitles': 'Open parallel corpus (OPUS) μ—μ„œ μ œκ³΅ν•˜λŠ” μ˜ν™” μžλ§‰ λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜',
}

파이썬 μ½˜μ†”μ—μ„œ KcBERT ν•™μŠ΅λ°μ΄ν„°λ₯Ό λ‚΄λ € λ°›λŠ” 파이썬 μ˜ˆμ œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. μ‚¬μš©μžμ˜ 둜컬 컴퓨터 루트 ν•˜μœ„μ˜ KorporaλΌλŠ” 디렉토리(~/Korpora)에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€. λ‹€λ₯Έ 데이터λ₯Ό λ°›κ³  μ‹Άλ‹€λ©΄ μœ„μ—μ„œ ν™•μΈν•œ λ§λ­‰μΉ˜ 이름을 인자둜 μ£Όλ©΄ λ©λ‹ˆλ‹€.

from Korpora import Korpora
Korpora.fetch("kcbert")

Korporaκ°€ μ œκ³΅ν•˜λŠ” λͺ¨λ“  λ§λ­‰μΉ˜λ₯Ό λ‚΄λ €λ°›κ³  μ‹Άλ‹€λ©΄ λ‹€μŒκ³Ό 같이 μ‹€ν–‰ν•˜μ„Έμš”. ~/Korpora에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€.

from Korpora import Korpora
Korpora.fetch('all')

KcBERT ν•™μŠ΅λ°μ΄ν„°λ₯Ό 파이썬 μ½˜μ†”μ—μ„œ μ½μ–΄λ“€μ΄λŠ” μ˜ˆμ œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. 데이터가 λ‘œμ»¬μ— μ—†λ‹€λ©΄ ~/Korpora에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€. 이후 corpusλΌλŠ” 파이썬 λ³€μˆ˜μ— λ§λ­‰μΉ˜ 데이터가 λ‹΄κΈ°κ²Œ λ©λ‹ˆλ‹€. λ‹€λ₯Έ 데이터λ₯Ό 읽고 μ‹Άλ‹€λ©΄ μœ„μ—μ„œ ν™•μΈν•œ λ§λ­‰μΉ˜ 이름을 인자둜 μ£Όλ©΄ λ©λ‹ˆλ‹€.

from Korpora import Korpora
corpus = Korpora.load("kcbert")

ν„°λ―Έλ„μ—μ„œ μ‚¬μš©ν•˜κΈ°

KorporaλŠ” ν„°λ―Έλ„μ—μ„œλ„ λ™μž‘ν•©λ‹ˆλ‹€(Command Line Interface, CLI). 파이썬 μ½˜μ†” μ‹€ν–‰ 없이 Korporaλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. ν„°λ―Έλ„μ—μ„œ KcBERT ν•™μŠ΅λ°μ΄ν„° ν•˜λ‚˜λ₯Ό λ‹€μš΄λ°›λŠ” μ˜ˆμ œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. ~/Korpora에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€.

korpora fetch --corpus kcbert

ν„°λ―Έλ„μ—μ„œ KcBERT ν•™μŠ΅λ°μ΄ν„°μ™€ 챗봇 λ¬Έλ‹΅ 데이터 두 개λ₯Ό λ™μ‹œμ— λ‹€μš΄λ‘œλ“œ λ°›λŠ” μ˜ˆμ œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. 이같은 λ°©μ‹μœΌλ‘œ 3개 μ΄μƒμ˜ 데이터도 λ™μ‹œμ— 내렀받을 수 μžˆμŠ΅λ‹ˆλ‹€. ~/Korpora에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€.

korpora fetch --corpus kcbert korean_chatbot_data

ν„°λ―Έλ„μ—μ„œ Korporaκ°€ μ œκ³΅ν•˜λŠ” λͺ¨λ“  λ§λ­‰μΉ˜λ₯Ό λ‚΄λ €λ°›λŠ” μ˜ˆμ œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. ~/Korpora에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€.

korpora fetch --corpus all

ν„°λ―Έλ„μ—μ„œ μ–Έμ–΄λͺ¨λΈ(Language Model) ν•™μŠ΅μš© 데이터λ₯Ό λ§Œλ“€ 수 μžˆμŠ΅λ‹ˆλ‹€. μ–Έμ–΄λͺ¨λΈ ν•™μŠ΅μš© 데이터 ꡬ좕이라고 함은, Korporaκ°€ μ œκ³΅ν•˜λŠ” μ½”νΌμŠ€μ—μ„œ λ¬Έμž₯λ§Œμ„ λ–Όμ–΄μ„œ ν…μŠ€νŠΈ 파일둜 λ€ν”„ν•˜λŠ” κ±Έ κ°€λ¦¬ν‚΅λ‹ˆλ‹€. κΈ°λ³Έ 예제 μ½”λ“œλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€. λ‹€μŒ μ½”λ“œλŠ” Korporaκ°€ μ œκ³΅ν•˜λŠ” λͺ¨λ“  μ½”νΌμŠ€(all)λ₯Ό μ–Έμ–΄λͺ¨λΈ ν•™μŠ΅μš© λ§λ­‰μΉ˜λ‘œ 일괄 μ²˜λ¦¬ν•˜λŠ” 역할을 ν•©λ‹ˆλ‹€. λ‹€μš΄λ‘œλ“œμ™€ μ „μ²˜λ¦¬λ₯Ό λ™μ‹œμ— μˆ˜ν–‰ν•©λ‹ˆλ‹€. λ‘œμ»¬μ— 데이터가 μ—†λ‹€λ©΄ ~/Korpora에 λ§λ­‰μΉ˜λ₯Ό λ‚΄λ € λ°›μŠ΅λ‹ˆλ‹€. 결과물은 all.trainμ΄λΌλŠ” 파일 ν•˜λ‚˜μž…λ‹ˆλ‹€. output_dir에 μƒμ„±λ©λ‹ˆλ‹€.

korpora lmdata \
  --corpus all \
  --output_dir ~/works/lmdata

License

  • Korpora λΌμ΄μ„ΌμŠ€λŠ” Creative Commons License(CCL) 4.0의 CC-BYμž…λ‹ˆλ‹€. 이 λΌμ΄μ„ΌμŠ€λŠ” Korpora νŒ¨ν‚€μ§€ 및 κ·Έ 뢀속물에 ν•œμ •λ©λ‹ˆλ‹€.
  • μ΄μš©μžλŠ” λ‹€μŒμ˜ ꢌ리λ₯Ό κ°–μŠ΅λ‹ˆλ‹€.
    • 곡유 : 볡제, 배포, μ „μ‹œ, 곡연 및 곡쀑 솑신(포맷 변경도 포함) 등을 자유둭게 ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
    • λ³€κ²½ : 리믹슀, λ³€ν˜•, 2차적 μ €μž‘λ¬Όμ˜ μž‘μ„±μ΄ κ°€λŠ₯ν•©λ‹ˆλ‹€. 영리 λͺ©μ μœΌλ‘œλ„ 이용이 κ°€λŠ₯ν•©λ‹ˆλ‹€.
  • μ΄μš©μžλŠ” λ‹€μŒμ˜ μ˜λ¬΄κ°€ μžˆμŠ΅λ‹ˆλ‹€. μ•„λž˜ 의무λ₯Ό μ§€ν‚€λŠ” ν•œ μœ„μ˜ κΆŒλ¦¬κ°€ μœ νš¨ν•©λ‹ˆλ‹€.
    • μ €μž‘μžν‘œμ‹œ : Korporaλ₯Ό μ΄μš©ν–ˆλ‹€λŠ” 정보λ₯Ό ν‘œμ‹œν•΄μ•Ό ν•©λ‹ˆλ‹€.
    • μΆ”κ°€μ œν•œκΈˆμ§€ : μ΄μš©μžλŠ” Korporaλ₯Ό ν™œμš©ν•œ 2차적 μ €μž‘λ¬Όμ— CC-BY보닀 μ—„κ²©ν•œ λΌμ΄μ„ΌμŠ€λ₯Ό λΆ€κ°€ν•  수 μ—†μŠ΅λ‹ˆλ‹€.
    • μ˜ˆμ»¨λŒ€ Korporaλ₯Ό λ‚΄λ € λ°›μ•„ λ‹¨μˆœνžˆ μ‚¬μš©ν•˜κΈ°λ§Œ ν–ˆλ‹€λ©΄ 'μ €μž‘μžν‘œμ‹œ'만 지킀면 λ©λ‹ˆλ‹€. Korporaλ₯Ό ν™œμš©ν•΄ λͺ¨λΈμ΄λ‚˜ λ¬Έμ„œ λ“± 2μ°¨ μ €μž‘λ¬Όμ„ λ§Œλ“€κ³  이λ₯Ό 배포할 경우 'μ €μž‘μžν‘œμ‹œ'뿐 μ•„λ‹ˆλΌ 'μΆ”κ°€μ œν•œκΈˆμ§€' μ˜λ¬΄λ„ μ§€μΌœμ•Ό ν•©λ‹ˆλ‹€.
  • ν•œνŽΈ λ§λ­‰μΉ˜μ˜ λΌμ΄μ„ΌμŠ€λŠ” λ§λ­‰μΉ˜λ³„λ‘œ 별도 μ μš©λ©λ‹ˆλ‹€. μžμ‹ μ΄ μ‚¬μš©ν•  λ§λ­‰μΉ˜μ˜ λΌμ΄μ„ΌμŠ€κ°€ μ–΄λ–€ λ‚΄μš©μΈμ§€ ν™œμš© 전에 λ°˜λ“œμ‹œ ν™•μΈν•˜μ„Έμš”!

Korpora: Korean Corpora Archives

Due to the growing interest in natural language processing, governments, businesses, and individuals are disclosing their data for free. However, even for a high-quality corpus, its existence is often unknown as datasets are scattered in different locations. Furthermore, each of their file or saved format is often different, making it even more difficult to use them. Therefore, individuals need to painstakingly create download or preprocessing codes for every instance.

Korpora is an open-source Python package that aims to minimize such inconvenience. The name Korpora comes from the word corpora, a plural form of the word corpus. Korpora is an acronym that stands for Korean Corpora. We hope that Korpora will serve as a starting point that encourages more Korean datasets to be released and improve the state of Korean natural language processing to the next level.

List of corpora

Korpora provides following corpora.

corpus_name description link
korean_chatbot_data Question and answer pairs for training a chatbot https://github.com/songys/Chatbot_data
kcbert Comment data used for training KcBERT model https://github.com/Beomi/KcBERT
korean_hate_speech Korean hate speech dataset https://github.com/kocohub/korean-hate-speech
korean_petitions Petitions to Blue House https://github.com/lovit/petitions_archive
kornli Korean NLI https://github.com/kakaobrain/KorNLUDatasets
korsts Korean STS https://github.com/kakaobrain/KorNLUDatasets
kowikitext Korean Wikipedia text https://github.com/lovit/kowikitext/
namuwikitext Namuwiki text https://github.com/lovit/namuwikitext
naver_changwon_ner NAVER x Changwon National University NER dataset https://github.com/naver/nlp-challenge/tree/master/missions/ner
nsmc NAVER Sentiment Movie Corpus https://github.com/e9t/nsmc
question_pair Korean question and answer pair dataset https://github.com/songys/Question_pair
modu_news Modu Corpus: Newspaper https://corpus.korean.go.kr
modu_messenger Modu Corpus: Messenger https://corpus.korean.go.kr
modu_mp Modu Corpus: Morphemes https://corpus.korean.go.kr
modu_ne Modu Corpus: Named Entity https://corpus.korean.go.kr
modu_spoken Modu Corpus: Spoken https://corpus.korean.go.kr
modu_web Modu Corpus: Web https://corpus.korean.go.kr
modu_written Modu Corpus: Written https://corpus.korean.go.kr
aihub_translation Korean-English translation corpus https://aihub.or.kr/aidata/87
open_subtitles Korean-English parallel corpus from movie subtitles http://opus.nlpl.eu/OpenSubtitles-v2018.php
korean_parallel_koen_news Korean-English parallel corpus https://github.com/jungyeul/korean-parallel-corpora

Information page

Detailed information on Korpora is available from the link below. The information page is written in both Korean and English. We like to thank Han Kyul Kim (@hank110) and Won Ik Cho (@warnikchow) (Alphabet order) for the English translation.

For those who would like to quickly go through the core functions, please refer to the Quick overview part below. For more information about notes on execution or option modifications, please refer to the information page linked above.

Quick overview

Installation

From source

git clone https://github.com/ko-nlp/Korpora
python setup.py install

Using pip

pip install Korpora

Using in Python

Korpora is an open-source Python package. By default, it can be executed in a Python console. You can check the list of the available corpus with the following Python codes.

from Korpora import Korpora
Korpora.corpus_list()
{
   'kcbert': 'beomi@github λ‹˜μ΄ λ§Œλ“œμ‹  KcBERT ν•™μŠ΅λ°μ΄ν„°',
   'korean_chatbot_data': 'songys@github λ‹˜μ΄ λ§Œλ“œμ‹  챗봇 λ¬Έλ‹΅ 데이터',
   'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github λ‹˜μ΄ λ§Œλ“œμ‹  ν˜μ˜€λŒ“κΈ€λ°μ΄ν„°',
   'korean_petitions': 'lovit@github λ‹˜μ΄ λ§Œλ“œμ‹  2017.08 ~ 2019.03 μ²­μ™€λŒ€ 청원데이터',
   'kornli': 'KakaoBrain μ—μ„œ μ œκ³΅ν•˜λŠ” Natural Language Inference (NLI) 데이터',
   'korsts': 'KakaoBrain μ—μ„œ μ œκ³΅ν•˜λŠ” Semantic Textual Similarity (STS) 데이터',
   'kowikitext': "lovit@github λ‹˜μ΄ λ§Œλ“œμ‹  wikitext ν˜•μ‹μ˜ ν•œκ΅­μ–΄ μœ„ν‚€ν”Όλ””μ•„ 데이터",
   'namuwikitext': 'lovit@github λ‹˜μ΄ λ§Œλ“œμ‹  wikitext ν˜•μ‹μ˜ λ‚˜λ¬΄μœ„ν‚€ 데이터',
   'naver_changwon_ner': '넀이버 + μ°½μ›λŒ€ NER shared task data',
   'nsmc': 'e9t@github λ‹˜μ΄ λ§Œλ“œμ‹  Naver sentiment movie corpus v1.0',
   'question_pair': 'songys@github λ‹˜μ΄ λ§Œλ“œμ‹  질문쌍(Paired Question v.2)',
   'modu_news': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ‰΄μŠ€ λ§λ­‰μΉ˜',
   'modu_messenger': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ©”μ‹ μ € λ§λ­‰μΉ˜',
   'modu_mp': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: ν˜•νƒœ 뢄석 λ§λ­‰μΉ˜',
   'modu_ne': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: 개체λͺ… 뢄석 λ§λ­‰μΉ˜',
   'modu_spoken': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: ꡬ어 λ§λ­‰μΉ˜',
   'modu_web': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: μ›Ή λ§λ­‰μΉ˜',
   'modu_written': 'κ΅­λ¦½κ΅­μ–΄μ›μ—μ„œ λ§Œλ“  λͺ¨λ‘μ˜ λ§λ­‰μΉ˜: λ¬Έμ–΄ λ§λ­‰μΉ˜',
   'aihub_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (ꡬ어 + λŒ€ν™” + λ‰΄μŠ€ + ν•œκ΅­λ¬Έν™” + μ‘°λ‘€ + μ§€μžμ²΄μ›Ήμ‚¬μ΄νŠΈ)",
   'aihub_spoken_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (ꡬ어)",
   'aihub_conversation_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (λŒ€ν™”)",
   'aihub_news_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (λ‰΄μŠ€)",
   'aihub_korean_culture_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (ν•œκ΅­λ¬Έν™”)",
   'aihub_decree_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (μ‘°λ‘€)",
   'aihub_government_website_translation': "AI Hub μ—μ„œ μ œκ³΅ν•˜λŠ” λ²ˆμ—­μš© 병렬 λ§λ­‰μΉ˜ (μ§€μžμ²΄μ›Ήμ‚¬μ΄νŠΈ)",
   'open_subtitles': 'Open parallel corpus (OPUS) μ—μ„œ μ œκ³΅ν•˜λŠ” μ˜ν™” μžλ§‰ λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜',
}

From the Python console, you can download KcBERT training data with the following Python codes. The corpus is downloaded to the Korpora directory within the user's root directory (~/Korpora). If you want to download a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.

from Korpora import Korpora
Korpora.fetch("kcbert")

If you want to download all corpora provided by Korpora, use the following Python codes. All datasets are downloaded to ~/Korpora.

from Korpora import Korpora
Korpora.fetch('all')

Using the following codes, you can load the KcBERT training dataset from your Python console. If the corpus does not exist in the local directory, it is downloaded to ~/Korpora as well. Then, the corpus data is stored in a Python variable corpus. To load a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.

from Korpora import Korpora
corpus = Korpora.load("kcbert")

Using in a terminal

You can execute Korpora through your terminal as well (Command Line Interface, CLI). Korpora can be used without executing your Python console. You can download the KcBERT training dataset from your terminal with the following command. The dataset is downloaded to ~/Korpora.

korpora fetch --corpus kcbert

With the following command, you can simultaneously download the KcBERT training dataset and the chatbot Q&A pair dataset. With this command, you can also simultaneously download three or more datasets. Datasets are downloaded to ~/Korpora.

korpora fetch --corpus kcbert korean_chatbot_data

You can download all corpora provided by Korpora from your terminal with the following command. Datasets are downloaded to ~/Korpora.

korpora fetch --corpus all

From your terminal, you can also create a dataset for training a language model. Creating this training dataset for a language model refers to a process of extracting only the sentences from all corpora provided by Korpora and saving them in a text file. A sample command is as follows. It simultaneously processes all corpora provided by Korpora and creates a single training dataset for a language model. Downloading the corpus and preprocessing its text occur simultaneously as well. If the corpus does not exist in the local directory, it is downloaded to ~/Korpora. A single output file named all.train will be created. It is created within output_dir.

korpora lmdata \
  --corpus all \
  --output_dir ~/works/lmdata

License

  • Korpora is licensed under the Creative Commons License(CCL) 4.0 CC-BY. This license covers the Korpora package and all of its components.
  • Its users have the following rights.
    • Share : They are free to reproduce, distribute, exhibit, perform and transmit via air (including changes in the format).
    • Adapt : They can remix, transform, and build upon the material for any purpose, even commercially.
  • Its users have the following obligations. As long as these obligations are fulfilled, the user rights listed above are valid.
    • Attribution : They must indicate that they have used Korpora.
    • No additional restrictions : For all derivative works of Korpora, they cannot impose stricter license than CC-BY permits.
    • For example, if you have downloaded and used Korpora, you need to fulfill only the 'attribution' obligation. However, if you are creating and distributing models, documents or any other derivative works of Korpora, you must fulfill both the 'attribution' and 'no additional restrictions' obligations.
  • Each corpus adheres to its own license policy. Please check the license of the corpus before using it!