2022-11-22-ner_deid_generic_bert_ro (#13121)

* Add model 2022-11-22-ner_deid_generic_bert_ro * Add model 2022-11-22-ner_clinical_bert_ro * Add model 2022-11-22-ner_living_species_300_es Co-authored-by: Cabir40 <cabir4006@gmail.com>
JohnSnowLabs · Nov 22, 2022 · 45b58cd · 45b58cd
1 parent 8a3d199
commit 45b58cd
Show file tree

Hide file tree

Showing 3 changed files with 526 additions and 0 deletions.
diff --git a/docs/_posts/Cabir40/2022-11-22-ner_clinical_bert_ro.md b/docs/_posts/Cabir40/2022-11-22-ner_clinical_bert_ro.md
@@ -0,0 +1,187 @@
+---
+layout: model
+title: Detect Clinical Entities in Romanian (Bert, Base, Cased)
+author: John Snow Labs
+name: ner_clinical_bert
+date: 2022-11-22
+tags: [licensed, clinical, ro, ner, bert]
+task: Named Entity Recognition
+language: ro
+edition: Healthcare NLP 4.2.2
+spark_version: 3.0
+supported: true
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+Extract clinical entities from Romanian clinical texts. This model is trained using `bert_base_cased` embeddings.
+
+## Predicted Entities
+
+`Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units`
+
+{:.btn-box}
+[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI){:.button.button-orange}
+[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.2.2_3.0_1669124033852.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+documentAssembler = DocumentAssembler()\
+.setInputCol("text")\
+.setOutputCol("document")
+
+sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
+.setInputCols(["document"])\
+.setOutputCol("sentence")
+
+tokenizer = Tokenizer()\
+.setInputCols(["sentence"])\
+.setOutputCol("token")
+
+word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \
+.setInputCols("sentence", "token") \
+.setOutputCol("embeddings")
+
+clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\
+.setInputCols(["sentence","token","embeddings"])\
+.setOutputCol("ner")
+
+ner_converter = NerConverter()\
+.setInputCols(["sentence","token","ner"])\
+.setOutputCol("ner_chunk")
+
+nlpPipeline = Pipeline(stages=[
+documentAssembler,
+sentenceDetector,
+tokenizer,
+word_embeddings,
+clinical_ner,
+ner_converter])
+
+data = spark.createDataFrame([[""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""]]).toDF("text")
+
+result = nlpPipeline.fit(data).transform(data)
+```
+```scala
+val document_assembler = new DocumentAssembler()
+.setInputCol("text")
+.setOutputCol("document")
+
+val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
+.setInputCols(Array("document"))
+.setOutputCol("sentence")
+
+val tokenizer = new Tokenizer()
+.setInputCols(Array("sentence"))
+.setOutputCol("token")
+
+val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
+.setInputCols(Array("sentence", "token"))
+.setOutputCol("embeddings")
+
+val ner_model = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models")
+.setInputCols(Array("sentence", "token", "embeddings"))
+.setOutputCol("ner")
+
+val ner_converter = new NerConverter()
+.setInputCols(Array("sentence", "token", "ner"))
+.setOutputCol("ner_chunk")
+
+val pipeline = new PipelineModel().setStages(Array(document_assembler, 
+sentence_detector,
+tokenizer,
+embeddings,
+ner_model,
+ner_converter))
+
+val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text")
+
+val result = pipeline.fit(data).transform(data)
+```
+
+{:.nlu-block}
+```python
+import nlu
+nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""")
+```
+</div>
+
+## Results
+
+```bash
++--------------------------+-------------------------+
+|chunks                    |entities                 |
++--------------------------+-------------------------+
+|Angio CT cardio-toracic   |Imaging_Test             |
+|Atrezie                   |Disease_Syndrome_Disorder|
+|valva pulmonara           |Body_Part                |
+|Hipoplazie                |Disease_Syndrome_Disorder|
+|VS                        |Body_Part                |
+|Atrezie                   |Disease_Syndrome_Disorder|
+|VAV stang                 |Body_Part                |
+|Anastomoza Glenn          |Disease_Syndrome_Disorder|
+|Tromboza                  |Disease_Syndrome_Disorder|
+|Sectia Clinica Cardiologie|Clinical_Dept            |
+|GE Revolution HD          |Medical_Device           |
+|Branula albastra          |Medical_Device           |
+|membrului superior drept  |Body_Part                |
+|Scout                     |Body_Part                |
+|30 ml                     |Dosage                   |
+|Iomeron 350               |Drug_Ingredient          |
+|2.2 ml/s                  |Dosage                   |
+|20 ml                     |Dosage                   |
+|ser fiziologic            |Drug_Ingredient          |
+|angio-CT                  |Imaging_Test             |
++--------------------------+-------------------------+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|ner_clinical_bert|
+|Compatibility:|Healthcare NLP 4.2.2+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|ro|
+|Size:|16.3 MB|
+
+## Benchmarking
+
+```bash
+label                  precision    recall  f1-score   support
+Body_Part                   0.91      0.93      0.92       679
+Clinical_Dept               0.68      0.65      0.67        97
+Date                        0.99      0.99      0.99        87
+Direction                   0.66      0.76      0.70        50
+Disease_Syndrome_Disorder   0.73      0.76      0.74       121
+Dosage                      0.78      1.00      0.87        38
+Drug_Ingredient             0.90      0.94      0.92        48
+Form                        1.00      1.00      1.00         6
+Imaging_Findings            0.86      0.82      0.84       201
+Imaging_Technique           0.92      0.92      0.92        26
+Imaging_Test                0.93      0.98      0.95       205
+Measurements                0.71      0.69      0.70       214
+Medical_Device              0.85      0.81      0.83        42
+Pulse                       0.82      1.00      0.90         9
+Route                       1.00      0.91      0.95        33
+Score                       1.00      0.98      0.99        41
+Time                        1.00      1.00      1.00        28
+Units                       0.60      0.93      0.73        88
+Weight                      0.82      1.00      0.90         9
+micro-avg                   0.84      0.87      0.86      2037
+macro-avg                   0.70      0.74      0.72      2037
+weighted-avg                0.84      0.87      0.85      2037
+```
diff --git a/docs/_posts/Cabir40/2022-11-22-ner_deid_generic_bert_ro.md b/docs/_posts/Cabir40/2022-11-22-ner_deid_generic_bert_ro.md
@@ -0,0 +1,173 @@
+---
+layout: model
+title: Detect PHI for Generic Deidentification in Romanian (BERT)
+author: John Snow Labs
+name: ner_deid_generic_bert
+date: 2022-11-22
+tags: [licensed, clinical, ro, deidentification, phi, generic, bert]
+task: Named Entity Recognition
+language: ro
+edition: Healthcare NLP 4.2.2
+spark_version: 3.0
+supported: true
+article_header:
+  type: cover
+use_language_switcher: "Python-Scala-Java"
+---
+
+## Description
+
+Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN.
+
+Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities.
+
+This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.
+
+## Predicted Entities
+
+`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
+
+{:.btn-box}
+[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
+[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
+[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
+
+## How to use
+
+
+
+<div class="tabs-box" markdown="1">
+{% include programmingLanguageSelectScalaPythonNLU.html %}
+```python
+documentAssembler = DocumentAssembler()\
+        .setInputCol("text")\
+        .setOutputCol("document")
+
+sentenceDetector = SentenceDetector()\
+        .setInputCols(["document"])\
+        .setOutputCol("sentence")
+
+tokenizer = Tokenizer()\
+        .setInputCols(["sentence"])\
+        .setOutputCol("token")
+
+embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
+	.setInputCols(["sentence","token"])\
+	.setOutputCol("word_embeddings")
+
+clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\
+	.setInputCols(["sentence","token","word_embeddings"])\
+	.setOutputCol("ner")
+
+ner_converter = NerConverter()\
+	.setInputCols(["sentence", "token", "ner"])\
+	.setOutputCol("ner_chunk")
+
+nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter])
+
+text = """
+Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
+Tel: +40(235)413773
+Data setului de analize: 25 May 2022 15:36:00
+Nume si Prenume : BUREAN MARIA, Varsta: 77
+Medic : Agota Evelyn Tımar
+C.N.P : 2450502264401"""
+
+data = spark.createDataFrame([[text]]).toDF("text")
+
+results = nlpPipeline.fit(data).transform(data)
+```
+```scala
+val documentAssembler = new DocumentAssembler()
+        .setInputCol("text")
+        .setOutputCol("document")
+
+val sentenceDetector = new SentenceDetector()
+        .setInputCols(Array("document"))
+        .setOutputCol("sentence")
+
+val tokenizer = new Tokenizer()
+        .setInputCols(Array("sentence"))
+        .setOutputCol("token")
+
+val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
+	.setInputCols(Array("sentence","token"))
+	.setOutputCol("word_embeddings")
+
+val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")
+        .setInputCols(Array("sentence","token","word_embeddings"))
+        .setOutputCol("ner")
+
+val ner_converter = new NerConverter()
+	.setInputCols(Array("sentence", "token", "ner"))
+	.setOutputCol("ner_chunk")
+
+val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter))
+
+val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
+Tel: +40(235)413773
+Data setului de analize: 25 May 2022 15:36:00
+Nume si Prenume : BUREAN MARIA, Varsta: 77
+Medic : Agota Evelyn Tımar
+C.N.P : 2450502264401"""
+
+val data = Seq(text).toDS.toDF("text")
+
+val results = pipeline.fit(data).transform(data)
+```
+</div>
+
+## Results
+
+```bash
++----------------------------+---------+
+|chunk                       |ner_label|
++----------------------------+---------+
+|Spitalul Pentru Ochi de Deal|LOCATION |
+|Drumul Oprea Nr             |LOCATION |
+|972                         |LOCATION |
+|Vaslui                      |LOCATION |
+|737405                      |LOCATION |
+|+40(235)413773              |CONTACT  |
+|25 May 2022                 |DATE     |
+|BUREAN MARIA                |NAME     |
+|77                          |AGE      |
+|Agota Evelyn Tımar          |NAME     |
+|2450502264401               |ID       |
++----------------------------+---------+
+```
+
+{:.model-param}
+## Model Information
+
+{:.table-model}
+|---|---|
+|Model Name:|ner_deid_generic_bert|
+|Compatibility:|Healthcare NLP 4.2.2+|
+|License:|Licensed|
+|Edition:|Official|
+|Input Labels:|[sentence, token, embeddings]|
+|Output Labels:|[ner]|
+|Language:|ro|
+|Size:|16.5 MB|
+
+## References
+
+- Custom John Snow Labs datasets
+- Data augmentation techniques
+
+## Benchmarking
+
+```bash
+       label  precision    recall  f1-score   support
+         AGE       0.95      0.97      0.96      1186
+     CONTACT       0.99      0.98      0.98       366
+        DATE       0.96      0.92      0.94      4518
+          ID       1.00      1.00      1.00       679
+    LOCATION       0.91      0.90      0.90      1683
+        NAME       0.93      0.96      0.94      2916
+  PROFESSION       0.87      0.85      0.86       161
+   micro-avg       0.94      0.94      0.94     11509
+   macro-avg       0.94      0.94      0.94     11509
+weighted-avg       0.95      0.94      0.94     11509
+```