COVID-19 corpus repository contains research articles annotated with biomedical entities of interest, namely Disorder, Species, Chemical or Drug, Gene or Protein, Enzyme, Anatomy, Biological Process, Molecular Function, Cellular Component, Pathway and microRNA.
Two different datasets are provided:
- CORD-19 full-text articles with more than 31 million annotations.
- Pubmed abstract articles with more than 680 thousand annotations.
Annotated corpora are freely available and can be used to further research topics related with COVID-19, contributing to find insights towards a better understanding of the disease, in order to find effective drugs and reduce the pandemic impact.
Blog post is available at https://hands-on-tech.github.io/2020/03/28/covid19-corpus.html.
Full-text research articles related with COVID-19 topics. Raw text and detailed description available on the official CORD-19 corpus Kaggle page.
Download the latest version of the CORD-19 annotated corpus.
Overall corpus statistics:
- Number of articles: 33 375
- Number of entity annotation occurrences: 31 272 212
- Number of unique entity annotations: 141 604
Number of annotations per entity type:
Entity | # Occurrences | # Unique |
---|---|---|
Disorder | 5638277 | 18704 |
Species | 5899678 | 30343 |
Chemical or Drug | 4458126 | 11173 |
Gene and Protein | 2013425 | 57738 |
Enzyme | 372308 | 1480 |
Anatomy | 5420584 | 10373 |
Biological Process | 3701117 | 7765 |
Molecular Function | 842418 | 1722 |
Cellular Component | 2542276 | 1099 |
Pathway | 382338 | 517 |
microRNA | 1665 | 690 |
Technical description of the annotated CORD-19 corpus is available here.
Abstracts of research articles from Pubmed related with COVID-19 topics. Blog post about building this corpus is available at https://hands-on-tech.github.io/2020/03/28/covid19-corpus.html.
Download the latest version of the annotated Pubmed corpus.
Overall corpus statistics:
- Number of abstracts: 17 740
- Number of entity annotation occurrences: 683 349
- Number of unique entity annotations: 29 423
Number of annotations per entity type:
Entity | # Occurrences | # Unique |
---|---|---|
Disorder | 183528 | 4477 |
Species | 128356 | 2170 |
Chemical or Drug | 70619 | 2768 |
Gene and Protein | 51114 | 15025 |
Enzyme | 7892 | 282 |
Anatomy | 106401 | 2369 |
Biological Process | 74286 | 1561 |
Molecular Function | 15089 | 383 |
Cellular Component | 39451 | 263 |
Pathway | 6587 | 97 |
microRNA | 26 | 28 |
Technical description of the annotated Pubmed corpus is available here.
The following resources were applied to annotate each entity type:
- Disorder (DISO): UMLS
- Species (SPEC): NCBI Taxonomy
- Chemical or Drug (CHED): ChEBI
- Gene or Protein (PRGE): NER with CRFs and normalization with UniProt
- Enzyme (ENZY): ExPASy
- Anatomy (ANAT): Unified Medical Language System (UMLS)
- Biological Process (PROC): Gene Ontology (GO) and UMLS
- Molecular Function (FUNC): Gene Ontology (GO)
- Cellular Component (COMP): Gene Ontology (GO)
- Pathway (PATH): NCBI BioSystems
- microRNA (MRNA): miRBase
For more details please check the article. Unfortunately dictionaries could not be shared for download, due to UMLS usage license. Nevertheless, keep in mind that Disorder and Species entities were extended to include COVID-19 entities of interest.
Neji is the tool used for NER (Named Entity Recognition) and normalization, which is optimized for biomedical scientific articles and provides an easy to use CLI. For more details please check the article.
- CORD-19 annotated corpus.
- Initial release.
- Annotate "methods", "results" and "conclusions" sections from JSON files.
Possible next steps to improve the COVID-19 corpus:
Annotate "methods", "results" and "conclusions" sections from JSON files;- Further optimize resources to target entities related with COVID-19;
- Include additional entities of relevance;
Annotate PMC and Elsevier full text articles;- Collect co-occurrences to understand which entities might be related more often;
- Index articles and annotations and provide access to search tool.
If you would like to know more or contribute, please send an e-mail to david.marques.campos@gmail.com or create a ticket on GitHub.
The annotations and scripts are free to use and released under the MIT license.