μ΅κ·Ό μμ°μ΄ μ²λ¦¬μ κ΄μ¬μ΄ λμμ§λ©΄μ μ λΆμ κΈ°μ μ λ¬Όλ‘ λ»μλ κ°μΈμ μ΄λ₯΄κΈ°κΉμ§ λ°μ΄ν°λ₯Ό 무λ£λ‘ 곡κ°νλ μΆμΈμ λλ€. νμ§λ§ λ°μ΄ν°κ° κ³³κ³³μ μ°μ¬ν΄ μλ€λ³΄λ νμ§ μ’μ λ§λμΉμμλ κ·Έ μ‘΄μ¬μ‘°μ°¨ μ μλ €μ§μ§ μμ κ²½μ°κ° λ§μ΅λλ€. νμΌ ν¬λ§·κ³Ό μ μ₯ νμ λ±μ΄ κ°κΈ° λ¬λΌ μ¬μ©μ΄ μ½μ§ μμ΅λλ€. κ°λ³ μ¬μ©μλ€μ λ€μ΄λ‘λλ μ μ²λ¦¬ μ½λλ₯Ό κ·Έλκ·Έλ κ°λ°ν΄μ μ¨μΌ νλ μκ³ λ‘μμ΄ μμ΅λλ€.
Korpora
λ μ΄ κ°μ λΆνΈν¨μ μ‘°κΈμ΄λλ§ λμ΄λ리기 μν΄ κ°λ°ν μ€νμμ€ νμ΄μ¬ ν¨ν€μ§μ
λλ€.
Korpora
λ λ§λμΉλΌλ λ»μ μμ΄ λ¨μ΄ corpusμ 볡μνμΈ corporaμμ μ°©μν΄ μ΄λ¦ μ§μμ΅λλ€.
Korpora
λ Korean Corporaμ μ€λ§μ
λλ€.
Korpora
κ° λ§μ€λ¬Όμ΄ λμ΄ νκ΅μ΄ λ°μ΄ν°μ
μ΄ λ λ§μ΄ 곡κ°λκ³ μ΄λ₯Ό ν΅ν΄ νκ΅μ΄ μμ°μ΄ μ²λ¦¬ μμ€μ΄ ν λ¨κ³ μ
κ·Έλ μ΄λλκΈ°λ₯Ό ν¬λ§ν©λλ€.
Korpora
κ° μ 곡νλ λ§λμΉ λͺ©λ‘μ λ€μκ³Ό κ°μ΅λλ€.
Korpora
μ¬μ©λ²μ μμΈν μλ΄νλ νμ΄μ§λ λ€μκ³Ό κ°μ΅λλ€.
μλμ νμ΄μ§λ νκΈκ³Ό μμ΄λ‘ κΈ°μ λμ΄ μμ΅λλ€.
μμ΄ λ²μμ νμ¨μ£Όμ Han Kyul Kim (@hank110) Won Ik Cho (@warnikchow) (Alphabet order) λμκ² κ°μ¬λ립λλ€.
ν΅μ¬ κΈ°λ₯ μμ£Όλ‘ λΉ λ₯΄κ² μ΄ν΄λ³΄κ³ μΆμ λΆλ€μ μλ λΉ λ₯Έ μ¬μ©λ²
ννΈλ₯Ό μ°Έκ³ νμΈμ.
μ€νμ μ£Όμμ , μ΅μ
μΆκ° λ° λ³κ²½ λ±μ μ νμ΄μ§λ₯Ό 보μλ©΄ λ©λλ€.
From source
git clone https://github.com/ko-nlp/Korpora
python setup.py install
Using pip
pip install Korpora
Korpora
λ μ€νμμ€ νμ΄μ¬ ν¨ν€μ§μ
λλ€.
κΈ°λ³Έμ μΌλ‘ νμ΄μ¬ μ½μ(console)μμ λμν©λλ€.
λ§λμΉ λͺ©λ‘μ νμΈνλ νμ΄μ¬ μμ λ λ€μκ³Ό κ°μ΅λλ€.
from Korpora import Korpora
Korpora.corpus_list()
{
'kcbert': 'beomi@github λμ΄ λ§λμ KcBERT νμ΅λ°μ΄ν°',
'korean_chatbot_data': 'songys@github λμ΄ λ§λμ μ±λ΄ λ¬Έλ΅ λ°μ΄ν°',
'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github λμ΄ λ§λμ νμ€λκΈλ°μ΄ν°',
'korean_petitions': 'lovit@github λμ΄ λ§λμ 2017.08 ~ 2019.03 μ²μλ μ²μλ°μ΄ν°',
'kornli': 'KakaoBrain μμ μ 곡νλ Natural Language Inference (NLI) λ°μ΄ν°',
'korsts': 'KakaoBrain μμ μ 곡νλ Semantic Textual Similarity (STS) λ°μ΄ν°',
'kowikitext': "lovit@github λμ΄ λ§λμ wikitext νμμ νκ΅μ΄ μν€νΌλμ λ°μ΄ν°",
'namuwikitext': 'lovit@github λμ΄ λ§λμ wikitext νμμ λ무μν€ λ°μ΄ν°',
'naver_changwon_ner': 'λ€μ΄λ² + μ°½μλ NER shared task data',
'nsmc': 'e9t@github λμ΄ λ§λμ Naver sentiment movie corpus v1.0',
'question_pair': 'songys@github λμ΄ λ§λμ μ§λ¬Έμ(Paired Question v.2)',
'modu_news': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: λ΄μ€ λ§λμΉ',
'modu_messenger': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: λ©μ μ λ§λμΉ',
'modu_mp': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: νν λΆμ λ§λμΉ',
'modu_ne': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: κ°μ²΄λͺ
λΆμ λ§λμΉ',
'modu_spoken': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: κ΅¬μ΄ λ§λμΉ',
'modu_web': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: μΉ λ§λμΉ',
'modu_written': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: λ¬Έμ΄ λ§λμΉ',
'aihub_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (κ΅¬μ΄ + λν + λ΄μ€ + νκ΅λ¬Έν + μ‘°λ‘ + μ§μ체μΉμ¬μ΄νΈ)",
'aihub_spoken_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (ꡬμ΄)",
'aihub_conversation_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (λν)",
'aihub_news_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (λ΄μ€)",
'aihub_korean_culture_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (νκ΅λ¬Έν)",
'aihub_decree_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (μ‘°λ‘)",
'aihub_government_website_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (μ§μ체μΉμ¬μ΄νΈ)",
'open_subtitles': 'Open parallel corpus (OPUS) μμ μ 곡νλ μν μλ§ λ²μ λ³λ ¬ λ§λμΉ',
}
νμ΄μ¬ μ½μμμ KcBERT νμ΅λ°μ΄ν°λ₯Ό λ΄λ € λ°λ νμ΄μ¬ μμ λ λ€μκ³Ό κ°μ΅λλ€.
μ¬μ©μμ λ‘컬 μ»΄ν¨ν° λ£¨νΈ νμμ KorporaλΌλ λλ ν 리(~/Korpora
)μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
λ€λ₯Έ λ°μ΄ν°λ₯Ό λ°κ³ μΆλ€λ©΄ μμμ νμΈν λ§λμΉ μ΄λ¦μ μΈμλ‘ μ£Όλ©΄ λ©λλ€.
from Korpora import Korpora
Korpora.fetch("kcbert")
Korpora
κ° μ 곡νλ λͺ¨λ λ§λμΉλ₯Ό λ΄λ €λ°κ³ μΆλ€λ©΄ λ€μκ³Ό κ°μ΄ μ€ννμΈμ.
~/Korpora
μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
from Korpora import Korpora
Korpora.fetch('all')
KcBERT νμ΅λ°μ΄ν°λ₯Ό νμ΄μ¬ μ½μμμ μ½μ΄λ€μ΄λ μμ λ λ€μκ³Ό κ°μ΅λλ€.
λ°μ΄ν°κ° λ‘컬μ μλ€λ©΄ ~/Korpora
μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
μ΄ν corpus
λΌλ νμ΄μ¬ λ³μμ λ§λμΉ λ°μ΄ν°κ° λ΄κΈ°κ² λ©λλ€.
λ€λ₯Έ λ°μ΄ν°λ₯Ό μ½κ³ μΆλ€λ©΄ μμμ νμΈν λ§λμΉ μ΄λ¦μ μΈμλ‘ μ£Όλ©΄ λ©λλ€.
from Korpora import Korpora
corpus = Korpora.load("kcbert")
Korpora
λ ν°λ―Έλμμλ λμν©λλ€(Command Line Interface, CLI).
νμ΄μ¬ μ½μ μ€ν μμ΄ Korpora
λ₯Ό μ¬μ©ν μ μμ΅λλ€.
ν°λ―Έλμμ KcBERT νμ΅λ°μ΄ν° νλλ₯Ό λ€μ΄λ°λ μμ λ λ€μκ³Ό κ°μ΅λλ€.
~/Korpora
μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
korpora fetch --corpus kcbert
ν°λ―Έλμμ KcBERT νμ΅λ°μ΄ν°μ μ±λ΄ λ¬Έλ΅ λ°μ΄ν° λ κ°λ₯Ό λμμ λ€μ΄λ‘λ λ°λ μμ λ λ€μκ³Ό κ°μ΅λλ€.
μ΄κ°μ λ°©μμΌλ‘ 3κ° μ΄μμ λ°μ΄ν°λ λμμ λ΄λ €λ°μ μ μμ΅λλ€.
~/Korpora
μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
korpora fetch --corpus kcbert korean_chatbot_data
ν°λ―Έλμμ Korpora
κ° μ 곡νλ λͺ¨λ λ§λμΉλ₯Ό λ΄λ €λ°λ μμ λ λ€μκ³Ό κ°μ΅λλ€.
~/Korpora
μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
korpora fetch --corpus all
ν°λ―Έλμμ μΈμ΄λͺ¨λΈ(Language Model) νμ΅μ© λ°μ΄ν°λ₯Ό λ§λ€ μ μμ΅λλ€.
μΈμ΄λͺ¨λΈ νμ΅μ© λ°μ΄ν° ꡬμΆμ΄λΌκ³ ν¨μ, Korpora
κ° μ 곡νλ μ½νΌμ€μμ λ¬Έμ₯λ§μ λΌμ΄μ ν
μ€νΈ νμΌλ‘ λ€ννλ κ±Έ κ°λ¦¬ν΅λλ€.
κΈ°λ³Έ μμ μ½λλ λ€μκ³Ό κ°μ΅λλ€.
λ€μ μ½λλ Korpora
κ° μ 곡νλ λͺ¨λ μ½νΌμ€(all
)λ₯Ό μΈμ΄λͺ¨λΈ νμ΅μ© λ§λμΉλ‘ μΌκ΄ μ²λ¦¬νλ μν μ ν©λλ€.
λ€μ΄λ‘λμ μ μ²λ¦¬λ₯Ό λμμ μνν©λλ€.
λ‘컬μ λ°μ΄ν°κ° μλ€λ©΄ ~/Korpora
μ λ§λμΉλ₯Ό λ΄λ € λ°μ΅λλ€.
κ²°κ³Όλ¬Όμ all.train
μ΄λΌλ νμΌ νλμ
λλ€.
output_dir
μ μμ±λ©λλ€.
korpora lmdata \
--corpus all \
--output_dir ~/works/lmdata
- Korpora λΌμ΄μΌμ€λ Creative Commons License(CCL) 4.0μ CC-BYμ λλ€. μ΄ λΌμ΄μΌμ€λ Korpora ν¨ν€μ§ λ° κ·Έ λΆμλ¬Όμ νμ λ©λλ€.
- μ΄μ©μλ λ€μμ κΆλ¦¬λ₯Ό κ°μ΅λλ€.
- 곡μ : 볡μ , λ°°ν¬, μ μ, κ³΅μ° λ° κ³΅μ€ μ‘μ (ν¬λ§· λ³κ²½λ ν¬ν¨) λ±μ μμ λ‘κ² ν μ μμ΅λλ€.
- λ³κ²½ : 리믹μ€, λ³ν, 2μ°¨μ μ μλ¬Όμ μμ±μ΄ κ°λ₯ν©λλ€. μ리 λͺ©μ μΌλ‘λ μ΄μ©μ΄ κ°λ₯ν©λλ€.
- μ΄μ©μλ λ€μμ μλ¬΄κ° μμ΅λλ€. μλ μ무λ₯Ό μ§ν€λ ν μμ κΆλ¦¬κ° μ ν¨ν©λλ€.
- μ μμνμ : Korporaλ₯Ό μ΄μ©νλ€λ μ 보λ₯Ό νμν΄μΌ ν©λλ€.
- μΆκ°μ νκΈμ§ : μ΄μ©μλ Korporaλ₯Ό νμ©ν 2μ°¨μ μ μλ¬Όμ CC-BYλ³΄λ€ μ격ν λΌμ΄μΌμ€λ₯Ό λΆκ°ν μ μμ΅λλ€.
- μ컨λ Korporaλ₯Ό λ΄λ € λ°μ λ¨μν μ¬μ©νκΈ°λ§ νλ€λ©΄ 'μ μμνμ'λ§ μ§ν€λ©΄ λ©λλ€. Korporaλ₯Ό νμ©ν΄ λͺ¨λΈμ΄λ λ¬Έμ λ± 2μ°¨ μ μλ¬Όμ λ§λ€κ³ μ΄λ₯Ό λ°°ν¬ν κ²½μ° 'μ μμνμ'λΏ μλλΌ 'μΆκ°μ νκΈμ§' μ무λ μ§μΌμΌ ν©λλ€.
- ννΈ λ§λμΉμ λΌμ΄μΌμ€λ λ§λμΉλ³λ‘ λ³λ μ μ©λ©λλ€. μμ μ΄ μ¬μ©ν λ§λμΉμ λΌμ΄μΌμ€κ° μ΄λ€ λ΄μ©μΈμ§ νμ© μ μ λ°λμ νμΈνμΈμ!
Due to the growing interest in natural language processing, governments, businesses, and individuals are disclosing their data for free. However, even for a high-quality corpus, its existence is often unknown as datasets are scattered in different locations. Furthermore, each of their file or saved format is often different, making it even more difficult to use them. Therefore, individuals need to painstakingly create download or preprocessing codes for every instance.
Korpora
is an open-source Python package that aims to minimize such inconvenience.
The name Korpora
comes from the word corpora, a plural form of the word corpus.
Korpora
is an acronym that stands for Korean Corpora.
We hope that Korpora
will serve as a starting point that encourages more Korean datasets to be released and improve the state of Korean natural language processing to the next level.
Korpora
provides following corpora.
Detailed information on Korpora
is available from the link below.
The information page is written in both Korean and English.
We like to thank Han Kyul Kim (@hank110) and Won Ik Cho (@warnikchow) (Alphabet order) for the English translation.
For those who would like to quickly go through the core functions, please refer to the Quick overview
part below.
For more information about notes on execution or option modifications, please refer to the information page linked above.
From source
git clone https://github.com/ko-nlp/Korpora
python setup.py install
Using pip
pip install Korpora
Korpora
is an open-source Python package.
By default, it can be executed in a Python console.
You can check the list of the available corpus with the following Python codes.
from Korpora import Korpora
Korpora.corpus_list()
{
'kcbert': 'beomi@github λμ΄ λ§λμ KcBERT νμ΅λ°μ΄ν°',
'korean_chatbot_data': 'songys@github λμ΄ λ§λμ μ±λ΄ λ¬Έλ΅ λ°μ΄ν°',
'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github λμ΄ λ§λμ νμ€λκΈλ°μ΄ν°',
'korean_petitions': 'lovit@github λμ΄ λ§λμ 2017.08 ~ 2019.03 μ²μλ μ²μλ°μ΄ν°',
'kornli': 'KakaoBrain μμ μ 곡νλ Natural Language Inference (NLI) λ°μ΄ν°',
'korsts': 'KakaoBrain μμ μ 곡νλ Semantic Textual Similarity (STS) λ°μ΄ν°',
'kowikitext': "lovit@github λμ΄ λ§λμ wikitext νμμ νκ΅μ΄ μν€νΌλμ λ°μ΄ν°",
'namuwikitext': 'lovit@github λμ΄ λ§λμ wikitext νμμ λ무μν€ λ°μ΄ν°',
'naver_changwon_ner': 'λ€μ΄λ² + μ°½μλ NER shared task data',
'nsmc': 'e9t@github λμ΄ λ§λμ Naver sentiment movie corpus v1.0',
'question_pair': 'songys@github λμ΄ λ§λμ μ§λ¬Έμ(Paired Question v.2)',
'modu_news': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: λ΄μ€ λ§λμΉ',
'modu_messenger': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: λ©μ μ λ§λμΉ',
'modu_mp': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: νν λΆμ λ§λμΉ',
'modu_ne': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: κ°μ²΄λͺ
λΆμ λ§λμΉ',
'modu_spoken': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: κ΅¬μ΄ λ§λμΉ',
'modu_web': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: μΉ λ§λμΉ',
'modu_written': 'κ΅λ¦½κ΅μ΄μμμ λ§λ λͺ¨λμ λ§λμΉ: λ¬Έμ΄ λ§λμΉ',
'aihub_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (κ΅¬μ΄ + λν + λ΄μ€ + νκ΅λ¬Έν + μ‘°λ‘ + μ§μ체μΉμ¬μ΄νΈ)",
'aihub_spoken_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (ꡬμ΄)",
'aihub_conversation_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (λν)",
'aihub_news_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (λ΄μ€)",
'aihub_korean_culture_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (νκ΅λ¬Έν)",
'aihub_decree_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (μ‘°λ‘)",
'aihub_government_website_translation': "AI Hub μμ μ 곡νλ λ²μμ© λ³λ ¬ λ§λμΉ (μ§μ체μΉμ¬μ΄νΈ)",
'open_subtitles': 'Open parallel corpus (OPUS) μμ μ 곡νλ μν μλ§ λ²μ λ³λ ¬ λ§λμΉ',
}
From the Python console, you can download KcBERT training data with the following Python codes.
The corpus is downloaded to the Korpora directory within the user's root directory (~/Korpora
).
If you want to download a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.
from Korpora import Korpora
Korpora.fetch("kcbert")
If you want to download all corpora provided by Korpora
, use the following Python codes.
All datasets are downloaded to ~/Korpora
.
from Korpora import Korpora
Korpora.fetch('all')
Using the following codes, you can load the KcBERT training dataset from your Python console.
If the corpus does not exist in the local directory, it is downloaded to ~/Korpora
as well.
Then, the corpus data is stored in a Python variable corpus
.
To load a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.
from Korpora import Korpora
corpus = Korpora.load("kcbert")
You can execute Korpora
through your terminal as well (Command Line Interface, CLI).
Korpora
can be used without executing your Python console.
You can download the KcBERT training dataset from your terminal with the following command.
The dataset is downloaded to ~/Korpora
.
korpora fetch --corpus kcbert
With the following command, you can simultaneously download the KcBERT training dataset and the chatbot Q&A pair dataset.
With this command, you can also simultaneously download three or more datasets.
Datasets are downloaded to ~/Korpora
.
korpora fetch --corpus kcbert korean_chatbot_data
You can download all corpora provided by Korpora
from your terminal with the following command.
Datasets are downloaded to ~/Korpora
.
korpora fetch --corpus all
From your terminal, you can also create a dataset for training a language model.
Creating this training dataset for a language model refers to a process of extracting only the sentences from all corpora provided by Korpora
and saving them in a text file.
A sample command is as follows.
It simultaneously processes all corpora provided by Korpora
and creates a single training dataset for a language model.
Downloading the corpus and preprocessing its text occur simultaneously as well.
If the corpus does not exist in the local directory, it is downloaded to ~/Korpora
.
A single output file named all.train
will be created.
It is created within output_dir
.
korpora lmdata \
--corpus all \
--output_dir ~/works/lmdata
- Korpora is licensed under the Creative Commons License(CCL) 4.0 CC-BY. This license covers the Korpora package and all of its components.
- Its users have the following rights.
- Share : They are free to reproduce, distribute, exhibit, perform and transmit via air (including changes in the format).
- Adapt : They can remix, transform, and build upon the material for any purpose, even commercially.
- Its users have the following obligations. As long as these obligations are fulfilled, the user rights listed above are valid.
- Attribution : They must indicate that they have used Korpora.
- No additional restrictions : For all derivative works of Korpora, they cannot impose stricter license than CC-BY permits.
- For example, if you have downloaded and used Korpora, you need to fulfill only the 'attribution' obligation. However, if you are creating and distributing models, documents or any other derivative works of Korpora, you must fulfill both the 'attribution' and 'no additional restrictions' obligations.
- Each corpus adheres to its own license policy. Please check the license of the corpus before using it!