-
Notifications
You must be signed in to change notification settings - Fork 717
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add model 2022-10-19-ner_covid_trials_en * Add model 2022-10-19-ner_jsl_en * Add model 2022-10-25-t5_base_pubmedqa_en * Add model 2022-10-25-ner_oncology_en * Add model 2022-10-25-ner_oncology_therapy_en * Add model 2022-10-25-ner_oncology_diagnosis_en * Add model 2022-10-25-ner_oncology_tnm_en * Add model 2022-10-25-ner_oncology_anatomy_general_en * Add model 2022-10-25-ner_oncology_demographics_en * Add model 2022-10-25-ner_oncology_test_en * Add model 2022-10-25-ner_oncology_unspecific_posology_en * Add model 2022-10-25-ner_oncology_anatomy_granular_en * Add model 2022-10-25-ner_oncology_response_to_treatment_en * Add model 2022-10-25-ner_oncology_biomarker_en * Add model 2022-10-25-ner_oncology_posology_en * updated bancmark * Benchmark format updating * Benchmark format updating * Benchmark format updating * Benchmark format updating * Update 2022-10-25-ner_oncology_anatomy_general_en.md * Benchmark format updating * Benchmark format updating * Benchmark format updating * Benchmark format update * Benchmark format update * Benchmark format update * Benchmark format update * Add model 2022-10-28-sbiobertresolve_icd10pcs_augmented_en * Update 2022-10-28-sbiobertresolve_icd10pcs_augmented_en.md * Add model 2022-10-29-icd10cm_mapper_en * 2022-10-30-abbreviation_mapper_augmented_en (#13005) * Add model 2022-10-30-abbreviation_mapper_augmented_en * Add model 2022-11-02-icd10cm_resolver_pipeline_en * Delete 2022-11-02-icd10cm_resolver_pipeline_en.md Co-authored-by: Ahmetemintek <ahmetemin.tek.66@gmail.com> * Add model 2022-11-02-icd10cm_resolver_pipeline_en (#13017) Co-authored-by: Ahmetemintek <ahmetemin.tek.66@gmail.com> * Add model 2022-11-03-oncology_general_pipeline_en (#13031) Co-authored-by: mauro-nievoff <mauro.nievasoffidani@gmail.com> * 2022-11-04-oncology_diagnosis_pipeline_en (#13038) * Add model 2022-11-04-oncology_diagnosis_pipeline_en * Add model 2022-11-04-oncology_biomarker_pipeline_en * Add model 2022-11-04-oncology_therapy_pipeline_en * Update 2022-11-04-oncology_diagnosis_pipeline_en.md * Update 2022-11-04-oncology_diagnosis_pipeline_en.md * Update 2022-11-04-oncology_diagnosis_pipeline_en.md Co-authored-by: mauro-nievoff <mauro.nievasoffidani@gmail.com> Co-authored-by: mauro-nievoff <55700369+mauro-nievoff@users.noreply.github.com> Co-authored-by: muhammetsnts <76607915+muhammetsnts@users.noreply.github.com> Co-authored-by: Ahmetemintek <ahmetemin.tek.66@gmail.com> Co-authored-by: Veysel Kocaman <vkocaman@gmail.com> Co-authored-by: HashamUlHaq <Haashaamulhaq@gmail.com> Co-authored-by: mauro-nievoff <mauro.nievasoffidani@gmail.com> Co-authored-by: mauro-nievoff <55700369+mauro-nievoff@users.noreply.github.com> Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com> Co-authored-by: muhammetsnts <76607915+muhammetsnts@users.noreply.github.com>
- Loading branch information
1 parent
02b94aa
commit 333c95c
Showing
23 changed files
with
3,799 additions
and
0 deletions.
There are no files selected for viewing
209 changes: 209 additions & 0 deletions
209
docs/_posts/Ahmetemintek/2022-10-19-ner_covid_trials_en.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
--- | ||
layout: model | ||
title: Extract Entities in Covid Trials | ||
author: John Snow Labs | ||
name: ner_covid_trials | ||
date: 2022-10-19 | ||
tags: [ner, en, clinical, licensed, covid] | ||
task: Named Entity Recognition | ||
language: en | ||
edition: Spark NLP for Healthcare 4.2.0 | ||
spark_version: 3.0 | ||
supported: true | ||
article_header: | ||
type: cover | ||
use_language_switcher: "Python-Scala-Java" | ||
--- | ||
|
||
## Description | ||
|
||
Pretrained named entity recognition deep learning model for extracting covid-related clinical terminology from covid trials. | ||
|
||
## Predicted Entities | ||
|
||
`Stage`, `Severity`, `Virus`, `Trial_Design`, `Trial_Phase`, `N_Patients`, `Institution`, `Statistical_Indicator`, `Section_Header`, `Cell_Type`, `Cellular_component`, `Viral_components`, `Physiological_reaction`, `Biological_molecules`, `Admission_Discharge`, `Age`, `BMI`, `Cerebrovascular_Disease`, `Date`, `Death_Entity`, `Diabetes`, `Disease_Syndrome_Disorder`, `Dosage`, `Drug_Ingredient`, `Employment`, `Frequency`, `Gender`, `Heart_Disease`, `Hypertension`, `Obesity`, `Pulse`, `Race_Ethnicity`, `Respiration`, `Route`, `Smoking`, `Time`, `Total_Cholesterol`, `Treatment`, `VS_Finding`, `Vaccine`, `Vaccine_Name` | ||
|
||
{:.btn-box} | ||
[Live Demo](https://demo.johnsnowlabs.com/healthcare/NER_COVID/){:.button.button-orange} | ||
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} | ||
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_covid_trials_en_4.2.0_3.0_1666177383134.zip){:.button.button-orange.button-orange-trans.arr.button-icon} | ||
|
||
## How to use | ||
|
||
|
||
|
||
<div class="tabs-box" markdown="1"> | ||
{% include programmingLanguageSelectScalaPythonNLU.html %} | ||
```python | ||
documentAssembler = DocumentAssembler()\ | ||
.setInputCol("text")\ | ||
.setOutputCol("document") | ||
|
||
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \ | ||
.setInputCols(["document"]) \ | ||
.setOutputCol("sentence") | ||
|
||
tokenizer = Tokenizer()\ | ||
.setInputCols(["sentence"])\ | ||
.setOutputCol("token") | ||
|
||
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\ | ||
.setInputCols(["sentence", "token"])\ | ||
.setOutputCol("embeddings") | ||
|
||
ner = MedicalNerModel.pretrained("ner_covid_trials","en","clinical/models")\ | ||
.setInputCols(["sentence","token","embeddings"])\ | ||
.setOutputCol("ner")\ | ||
.setLabelCasing("upper") | ||
|
||
ner_converter = NerConverter() \ | ||
.setInputCols(["sentence", "token", "ner"]) \ | ||
.setOutputCol("ner_chunk") | ||
|
||
ner_pipeline = Pipeline(stages=[ | ||
documentAssembler, | ||
sentenceDetector, | ||
tokenizer, | ||
word_embeddings, | ||
ner, | ||
ner_converter]) | ||
|
||
empty_data = spark.createDataFrame([[""]]).toDF("text") | ||
|
||
ner_model = ner_pipeline.fit(empty_data) | ||
|
||
text= """In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""" | ||
|
||
results= model.transform(spark.createDataFrame([[text]]).toDF('text')) | ||
``` | ||
```scala | ||
val document_assembler = new DocumentAssembler() | ||
.setInputCol("text") | ||
.setOutputCol("document") | ||
|
||
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") | ||
.setInputCols("document") | ||
.setOutputCol("sentence") | ||
|
||
val tokenizer = new Tokenizer() | ||
.setInputCols("sentence") | ||
.setOutputCol("token") | ||
|
||
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical" ,"en", "clinical/models") | ||
.setInputCols(Array("sentence","token")) | ||
.setOutputCol("embeddings") | ||
|
||
val ner_model = MedicalNerModel.pretrained("ner_covid_trials", "en", "clinical/models") | ||
.setInputCols(Array("sentence", "token", "embeddings")) | ||
.setOutputCol("ner") | ||
|
||
val ner_converter = new NerConverter() | ||
.setInputCols(Array("sentence", "token", "ner")) | ||
.setOutputCol("ner_chunk") | ||
|
||
val pipeline = new Pipeline().setStages(Array(document_assembler, | ||
sentence_detector, | ||
tokenizer, | ||
word_embeddings, | ||
ner_model, | ||
ner_converter)) | ||
|
||
val data = Seq("""In December 2019 , a group of patients with the acute respiratory disease was detected in Wuhan , Hubei Province of China . A month later , a new beta-coronavirus was identified as the cause of the 2019 coronavirus infection . SARS-CoV-2 is a coronavirus that belongs to the group of β-coronaviruses of the subgenus Coronaviridae . The SARS-CoV-2 is the third known zoonotic coronavirus disease after severe acute respiratory syndrome ( SARS ) and Middle Eastern respiratory syndrome ( MERS ). The diagnosis of SARS-CoV-2 recommended by the WHO , CDC is the collection of a sample from the upper respiratory tract ( nasal and oropharyngeal exudate ) or from the lower respiratory tractsuch as expectoration of endotracheal aspirate and bronchioloalveolar lavage and its analysis using the test of real-time polymerase chain reaction ( qRT-PCR ).In 2020, the first COVID‑19 vaccine was developed and made available to the public through emergency authorizations and conditional approvals.""").toDS.toDF("text") | ||
|
||
val result = pipeline.fit(data).transform(data) | ||
``` | ||
</div> | ||
|
||
## Results | ||
|
||
```bash | ||
| | chunks | begin | end | sentence_id | entities | | ||
|---:|:------------------------------------|--------:|------:|--------------:|:--------------------------| | ||
| 0 | December 2019 | 3 | 15 | 0 | Date | | ||
| 1 | acute respiratory disease | 48 | 72 | 0 | Disease_Syndrome_Disorder | | ||
| 2 | beta-coronavirus | 146 | 161 | 1 | Virus | | ||
| 3 | 2019 | 198 | 201 | 1 | Date | | ||
| 4 | coronavirus infection | 203 | 223 | 1 | Disease_Syndrome_Disorder | | ||
| 5 | SARS-CoV-2 | 228 | 237 | 2 | Virus | | ||
| 6 | coronavirus | 244 | 254 | 2 | Virus | | ||
| 7 | β-coronaviruses | 285 | 299 | 2 | Virus | | ||
| 8 | subgenus Coronaviridae | 308 | 329 | 2 | Virus | | ||
| 9 | SARS-CoV-2 | 337 | 346 | 3 | Virus | | ||
| 10 | zoonotic coronavirus disease | 367 | 394 | 3 | Disease_Syndrome_Disorder | | ||
| 11 | severe acute respiratory syndrome | 402 | 434 | 3 | Disease_Syndrome_Disorder | | ||
| 12 | SARS | 438 | 441 | 3 | Disease_Syndrome_Disorder | | ||
| 13 | Middle Eastern respiratory syndrome | 449 | 483 | 3 | Disease_Syndrome_Disorder | | ||
| 14 | MERS | 487 | 490 | 3 | Disease_Syndrome_Disorder | | ||
| 15 | SARS-CoV-2 | 513 | 522 | 4 | Virus | | ||
| 16 | WHO | 543 | 545 | 4 | Institution | | ||
| 17 | CDC | 549 | 551 | 4 | Institution | | ||
| 18 | 2020 | 852 | 855 | 5 | Date | | ||
| 19 | COVID‑19 vaccine | 868 | 883 | 5 | Vaccine_Name | | ||
``` | ||
|
||
{:.model-param} | ||
## Model Information | ||
|
||
{:.table-model} | ||
|---|---| | ||
|Model Name:|ner_covid_trials| | ||
|Compatibility:|Spark NLP for Healthcare 4.2.0+| | ||
|License:|Licensed| | ||
|Edition:|Official| | ||
|Input Labels:|[sentence, token, embeddings]| | ||
|Output Labels:|[ner]| | ||
|Language:|en| | ||
|Size:|14.8 MB| | ||
|
||
## References | ||
|
||
This model is trained on data sampled from clinicaltrials.gov - covid trials, and annotated in-house. | ||
|
||
## Benchmarking | ||
|
||
```bash | ||
label tp fp fn total precision recall f1 | ||
Institution 34 8 20 55.0 0.7958 0.6343 0.706 | ||
VS_Finding 19 2 1 20.0 0.9048 0.95 0.9268 | ||
Respiration 5 0 0 5.0 1.0 1.0 1.0 | ||
Cerebrovascular_D... 5 2 2 7.0 0.7143 0.7143 0.7143 | ||
Cell_Type 152 27 14 167.0 0.8479 0.9123 0.8789 | ||
Heart_Disease 36 3 5 41.0 0.9231 0.878 0.9 | ||
Severity 57 25 3 60.0 0.6881 0.95 0.7981 | ||
N_Patients 27 3 1 29.0 0.8871 0.9483 0.9167 | ||
Pulse 12 2 0 12.0 0.8571 1.0 0.9231 | ||
Obesity 3 0 0 3.0 1.0 1.0 1.0 | ||
Admission_Discharge 85 3 0 85.0 0.9659 1.0 0.9827 | ||
Diabetes 8 0 0 8.0 1.0 1.0 1.0 | ||
Section_Header 94 8 13 108.0 0.9154 0.8711 0.8927 | ||
Age 22 1 0 22.0 0.9429 1.0 0.9706 | ||
Cellular_component 40 21 10 50.0 0.6534 0.8 0.7193 | ||
Hypertension 10 0 0 10.0 1.0 1.0 1.0 | ||
BMI 5 1 1 6.0 0.8333 0.8333 0.8333 | ||
Trial_Phase 13 0 1 14.0 0.9398 0.9286 0.9341 | ||
Employment 98 12 8 107.0 0.8874 0.9206 0.9037 | ||
Statistical_Indic... 76 29 11 88.0 0.7206 0.8689 0.7879 | ||
Time 2 0 1 3.0 1.0 0.6667 0.8 | ||
Total_Cholesterol 14 1 2 17.0 0.9355 0.8529 0.8923 | ||
Drug_Ingredient 327 33 67 395.0 0.9084 0.8281 0.8664 | ||
Physiological_rea... 27 7 14 41.0 0.7864 0.6585 0.7168 | ||
Treatment 66 4 25 92.0 0.9433 0.7228 0.8185 | ||
Vaccine 20 1 2 23.0 0.9531 0.8841 0.9173 | ||
Disease_Syndrome_... 774 70 41 816.0 0.9171 0.9495 0.933 | ||
Virus 121 8 23 144.0 0.9365 0.8403 0.8858 | ||
Frequency 57 1 2 59.9 0.9787 0.9556 0.967 | ||
Route 37 4 10 47.0 0.9024 0.7872 0.8409 | ||
Death_Entity 20 9 3 23.0 0.6897 0.8696 0.7692 | ||
Stage 4 0 7 12.0 1.0 0.3889 0.56 | ||
Vaccine_Name 10 1 0 10.0 0.9091 1.0 0.9524 | ||
Trial_Design 32 13 8 41.0 0.7149 0.7951 0.7529 | ||
Biological_molecules 251 91 53 305.0 0.7335 0.8233 0.7758 | ||
Date 98 5 2 100.0 0.9492 0.98 0.9643 | ||
Race_Ethnicity 0 0 2 2.0 0.0 0.0 0.0 | ||
Gender 46 1 0 46.0 0.9787 1.0 0.9892 | ||
Dosage 49 9 24 73.0 0.8376 0.6712 0.7452 | ||
Viral_components 18 10 15 34.0 0.6512 0.549 0.5957 | ||
|
||
macro - - - - - - 0.8382 | ||
micro - - - - - - 0.8704 | ||
``` |
Oops, something went wrong.