Skip to content

Commit

Permalink
2022-11-22-ner_deid_generic_bert_ro (#13121)
Browse files Browse the repository at this point in the history
* Add model 2022-11-22-ner_deid_generic_bert_ro

* Add model 2022-11-22-ner_clinical_bert_ro

* Add model 2022-11-22-ner_living_species_300_es

Co-authored-by: Cabir40 <cabir4006@gmail.com>
  • Loading branch information
jsl-models and Cabir40 authored Nov 22, 2022
1 parent 8a3d199 commit 45b58cd
Show file tree
Hide file tree
Showing 3 changed files with 526 additions and 0 deletions.
187 changes: 187 additions & 0 deletions docs/_posts/Cabir40/2022-11-22-ner_clinical_bert_ro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
layout: model
title: Detect Clinical Entities in Romanian (Bert, Base, Cased)
author: John Snow Labs
name: ner_clinical_bert
date: 2022-11-22
tags: [licensed, clinical, ro, ner, bert]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Extract clinical entities from Romanian clinical texts. This model is trained using `bert_base_cased` embeddings.

## Predicted Entities

`Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units`

{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.2.2_3.0_1669124033852.zip){:.button.button-orange.button-orange-trans.arr.button-icon}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")

ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter])

data = spark.createDataFrame([[""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
```
```scala
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")

val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")

val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")

val pipeline = new PipelineModel().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter))

val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
```

{:.nlu-block}
```python
import nlu
nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""")
```
</div>

## Results

```bash
+--------------------------+-------------------------+
|chunks |entities |
+--------------------------+-------------------------+
|Angio CT cardio-toracic |Imaging_Test |
|Atrezie |Disease_Syndrome_Disorder|
|valva pulmonara |Body_Part |
|Hipoplazie |Disease_Syndrome_Disorder|
|VS |Body_Part |
|Atrezie |Disease_Syndrome_Disorder|
|VAV stang |Body_Part |
|Anastomoza Glenn |Disease_Syndrome_Disorder|
|Tromboza |Disease_Syndrome_Disorder|
|Sectia Clinica Cardiologie|Clinical_Dept |
|GE Revolution HD |Medical_Device |
|Branula albastra |Medical_Device |
|membrului superior drept |Body_Part |
|Scout |Body_Part |
|30 ml |Dosage |
|Iomeron 350 |Drug_Ingredient |
|2.2 ml/s |Dosage |
|20 ml |Dosage |
|ser fiziologic |Drug_Ingredient |
|angio-CT |Imaging_Test |
+--------------------------+-------------------------+
```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|ner_clinical_bert|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.3 MB|

## Benchmarking

```bash
label precision recall f1-score support
Body_Part 0.91 0.93 0.92 679
Clinical_Dept 0.68 0.65 0.67 97
Date 0.99 0.99 0.99 87
Direction 0.66 0.76 0.70 50
Disease_Syndrome_Disorder 0.73 0.76 0.74 121
Dosage 0.78 1.00 0.87 38
Drug_Ingredient 0.90 0.94 0.92 48
Form 1.00 1.00 1.00 6
Imaging_Findings 0.86 0.82 0.84 201
Imaging_Technique 0.92 0.92 0.92 26
Imaging_Test 0.93 0.98 0.95 205
Measurements 0.71 0.69 0.70 214
Medical_Device 0.85 0.81 0.83 42
Pulse 0.82 1.00 0.90 9
Route 1.00 0.91 0.95 33
Score 1.00 0.98 0.99 41
Time 1.00 1.00 1.00 28
Units 0.60 0.93 0.73 88
Weight 0.82 1.00 0.90 9
micro-avg 0.84 0.87 0.86 2037
macro-avg 0.70 0.74 0.72 2037
weighted-avg 0.84 0.87 0.85 2037
```
173 changes: 173 additions & 0 deletions docs/_posts/Cabir40/2022-11-22-ner_deid_generic_bert_ro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
---
layout: model
title: Detect PHI for Generic Deidentification in Romanian (BERT)
author: John Snow Labs
name: ner_deid_generic_bert
date: 2022-11-22
tags: [licensed, clinical, ro, deidentification, phi, generic, bert]
task: Named Entity Recognition
language: ro
edition: Healthcare NLP 4.2.2
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN.

Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities.

This NER model is trained with a combination of custom datasets with several data augmentation mechanisms.

## Predicted Entities

`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`

{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange}
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon}
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.arr.button-icon}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
.setInputCols(["sentence","token"])\
.setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\
.setInputCols(["sentence","token","word_embeddings"])\
.setOutputCol("ner")

ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter])

text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""

data = spark.createDataFrame([[text]]).toDF("text")

results = nlpPipeline.fit(data).transform(data)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")

val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")

val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter))

val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""

val data = Seq(text).toDS.toDF("text")

val results = pipeline.fit(data).transform(data)
```
</div>

## Results

```bash
+----------------------------+---------+
|chunk |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|LOCATION |
|Drumul Oprea Nr |LOCATION |
|972 |LOCATION |
|Vaslui |LOCATION |
|737405 |LOCATION |
|+40(235)413773 |CONTACT |
|25 May 2022 |DATE |
|BUREAN MARIA |NAME |
|77 |AGE |
|Agota Evelyn Tımar |NAME |
|2450502264401 |ID |
+----------------------------+---------+
```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|ner_deid_generic_bert|
|Compatibility:|Healthcare NLP 4.2.2+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence, token, embeddings]|
|Output Labels:|[ner]|
|Language:|ro|
|Size:|16.5 MB|

## References

- Custom John Snow Labs datasets
- Data augmentation techniques

## Benchmarking

```bash
label precision recall f1-score support
AGE 0.95 0.97 0.96 1186
CONTACT 0.99 0.98 0.98 366
DATE 0.96 0.92 0.94 4518
ID 1.00 1.00 1.00 679
LOCATION 0.91 0.90 0.90 1683
NAME 0.93 0.96 0.94 2916
PROFESSION 0.87 0.85 0.86 161
micro-avg 0.94 0.94 0.94 11509
macro-avg 0.94 0.94 0.94 11509
weighted-avg 0.95 0.94 0.94 11509
```
Loading

0 comments on commit 45b58cd

Please sign in to comment.