-
Notifications
You must be signed in to change notification settings - Fork 718
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
2022-11-22-ner_deid_generic_bert_ro (#13121)
* Add model 2022-11-22-ner_deid_generic_bert_ro * Add model 2022-11-22-ner_clinical_bert_ro * Add model 2022-11-22-ner_living_species_300_es Co-authored-by: Cabir40 <cabir4006@gmail.com>
- Loading branch information
1 parent
8a3d199
commit 45b58cd
Showing
3 changed files
with
526 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
--- | ||
layout: model | ||
title: Detect Clinical Entities in Romanian (Bert, Base, Cased) | ||
author: John Snow Labs | ||
name: ner_clinical_bert | ||
date: 2022-11-22 | ||
tags: [licensed, clinical, ro, ner, bert] | ||
task: Named Entity Recognition | ||
language: ro | ||
edition: Healthcare NLP 4.2.2 | ||
spark_version: 3.0 | ||
supported: true | ||
article_header: | ||
type: cover | ||
use_language_switcher: "Python-Scala-Java" | ||
--- | ||
|
||
## Description | ||
|
||
Extract clinical entities from Romanian clinical texts. This model is trained using `bert_base_cased` embeddings. | ||
|
||
## Predicted Entities | ||
|
||
`Measurements`, `Form`, `Symptom`, `Route`, `Procedure`, `Disease_Syndrome_Disorder`, `Score`, `Drug_Ingredient`, `Pulse`, `Frequency`, `Date`, `Body_Part`, `Drug_Brand_Name`, `Time`, `Direction`, `Medical_Device`, `Imaging_Technique`, `Test`, `Imaging_Findings`, `Imaging_Test`, `Test_Result`, `Weight`, `Clinical_Dept`, `Units` | ||
|
||
{:.btn-box} | ||
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI){:.button.button-orange} | ||
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} | ||
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_clinical_bert_ro_4.2.2_3.0_1669124033852.zip){:.button.button-orange.button-orange-trans.arr.button-icon} | ||
|
||
## How to use | ||
|
||
|
||
|
||
<div class="tabs-box" markdown="1"> | ||
{% include programmingLanguageSelectScalaPythonNLU.html %} | ||
```python | ||
documentAssembler = DocumentAssembler()\ | ||
.setInputCol("text")\ | ||
.setOutputCol("document") | ||
|
||
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\ | ||
.setInputCols(["document"])\ | ||
.setOutputCol("sentence") | ||
|
||
tokenizer = Tokenizer()\ | ||
.setInputCols(["sentence"])\ | ||
.setOutputCol("token") | ||
|
||
word_embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") \ | ||
.setInputCols("sentence", "token") \ | ||
.setOutputCol("embeddings") | ||
|
||
clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert","ro","clinical/models")\ | ||
.setInputCols(["sentence","token","embeddings"])\ | ||
.setOutputCol("ner") | ||
|
||
ner_converter = NerConverter()\ | ||
.setInputCols(["sentence","token","ner"])\ | ||
.setOutputCol("ner_chunk") | ||
|
||
nlpPipeline = Pipeline(stages=[ | ||
documentAssembler, | ||
sentenceDetector, | ||
tokenizer, | ||
word_embeddings, | ||
clinical_ner, | ||
ner_converter]) | ||
|
||
data = spark.createDataFrame([[""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min."""]]).toDF("text") | ||
|
||
result = nlpPipeline.fit(data).transform(data) | ||
``` | ||
```scala | ||
val document_assembler = new DocumentAssembler() | ||
.setInputCol("text") | ||
.setOutputCol("document") | ||
|
||
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") | ||
.setInputCols(Array("document")) | ||
.setOutputCol("sentence") | ||
|
||
val tokenizer = new Tokenizer() | ||
.setInputCols(Array("sentence")) | ||
.setOutputCol("token") | ||
|
||
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") | ||
.setInputCols(Array("sentence", "token")) | ||
.setOutputCol("embeddings") | ||
|
||
val ner_model = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models") | ||
.setInputCols(Array("sentence", "token", "embeddings")) | ||
.setOutputCol("ner") | ||
|
||
val ner_converter = new NerConverter() | ||
.setInputCols(Array("sentence", "token", "ner")) | ||
.setOutputCol("ner_chunk") | ||
|
||
val pipeline = new PipelineModel().setStages(Array(document_assembler, | ||
sentence_detector, | ||
tokenizer, | ||
embeddings, | ||
ner_model, | ||
ner_converter)) | ||
|
||
val data = Seq("""Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""").toDS.toDF("text") | ||
|
||
val result = pipeline.fit(data).transform(data) | ||
``` | ||
|
||
{:.nlu-block} | ||
```python | ||
import nlu | ||
nlu.load("ro.embed.clinical.bert.base_cased").predict(""" Solicitare: Angio CT cardio-toracic Dg. de trimitere Atrezie de valva pulmonara. Hipoplazie VS. Atrezie VAV stang. Anastomoza Glenn. Sp. Tromboza la nivelul anastomozei. Trimis de: Sectia Clinica Cardiologie (dr. Sue T.) Procedura Aparat GE Revolution HD. Branula albastra montata la nivelul membrului superior drept. Scout. Se administreaza 30 ml Iomeron 350 cu flux 2.2 ml/s, urmate de 20 ml ser fiziologic cu acelasi flux. Se efectueaza o examinare angio-CT cardiotoracica cu achizitii secventiale prospective la o frecventa cardiaca medie de 100/min.""") | ||
``` | ||
</div> | ||
|
||
## Results | ||
|
||
```bash | ||
+--------------------------+-------------------------+ | ||
|chunks |entities | | ||
+--------------------------+-------------------------+ | ||
|Angio CT cardio-toracic |Imaging_Test | | ||
|Atrezie |Disease_Syndrome_Disorder| | ||
|valva pulmonara |Body_Part | | ||
|Hipoplazie |Disease_Syndrome_Disorder| | ||
|VS |Body_Part | | ||
|Atrezie |Disease_Syndrome_Disorder| | ||
|VAV stang |Body_Part | | ||
|Anastomoza Glenn |Disease_Syndrome_Disorder| | ||
|Tromboza |Disease_Syndrome_Disorder| | ||
|Sectia Clinica Cardiologie|Clinical_Dept | | ||
|GE Revolution HD |Medical_Device | | ||
|Branula albastra |Medical_Device | | ||
|membrului superior drept |Body_Part | | ||
|Scout |Body_Part | | ||
|30 ml |Dosage | | ||
|Iomeron 350 |Drug_Ingredient | | ||
|2.2 ml/s |Dosage | | ||
|20 ml |Dosage | | ||
|ser fiziologic |Drug_Ingredient | | ||
|angio-CT |Imaging_Test | | ||
+--------------------------+-------------------------+ | ||
``` | ||
|
||
{:.model-param} | ||
## Model Information | ||
|
||
{:.table-model} | ||
|---|---| | ||
|Model Name:|ner_clinical_bert| | ||
|Compatibility:|Healthcare NLP 4.2.2+| | ||
|License:|Licensed| | ||
|Edition:|Official| | ||
|Input Labels:|[sentence, token, embeddings]| | ||
|Output Labels:|[ner]| | ||
|Language:|ro| | ||
|Size:|16.3 MB| | ||
|
||
## Benchmarking | ||
|
||
```bash | ||
label precision recall f1-score support | ||
Body_Part 0.91 0.93 0.92 679 | ||
Clinical_Dept 0.68 0.65 0.67 97 | ||
Date 0.99 0.99 0.99 87 | ||
Direction 0.66 0.76 0.70 50 | ||
Disease_Syndrome_Disorder 0.73 0.76 0.74 121 | ||
Dosage 0.78 1.00 0.87 38 | ||
Drug_Ingredient 0.90 0.94 0.92 48 | ||
Form 1.00 1.00 1.00 6 | ||
Imaging_Findings 0.86 0.82 0.84 201 | ||
Imaging_Technique 0.92 0.92 0.92 26 | ||
Imaging_Test 0.93 0.98 0.95 205 | ||
Measurements 0.71 0.69 0.70 214 | ||
Medical_Device 0.85 0.81 0.83 42 | ||
Pulse 0.82 1.00 0.90 9 | ||
Route 1.00 0.91 0.95 33 | ||
Score 1.00 0.98 0.99 41 | ||
Time 1.00 1.00 1.00 28 | ||
Units 0.60 0.93 0.73 88 | ||
Weight 0.82 1.00 0.90 9 | ||
micro-avg 0.84 0.87 0.86 2037 | ||
macro-avg 0.70 0.74 0.72 2037 | ||
weighted-avg 0.84 0.87 0.85 2037 | ||
``` |
173 changes: 173 additions & 0 deletions
173
docs/_posts/Cabir40/2022-11-22-ner_deid_generic_bert_ro.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
--- | ||
layout: model | ||
title: Detect PHI for Generic Deidentification in Romanian (BERT) | ||
author: John Snow Labs | ||
name: ner_deid_generic_bert | ||
date: 2022-11-22 | ||
tags: [licensed, clinical, ro, deidentification, phi, generic, bert] | ||
task: Named Entity Recognition | ||
language: ro | ||
edition: Healthcare NLP 4.2.2 | ||
spark_version: 3.0 | ||
supported: true | ||
article_header: | ||
type: cover | ||
use_language_switcher: "Python-Scala-Java" | ||
--- | ||
|
||
## Description | ||
|
||
Named Entity Recognition annotators to allow a generic model to be trained by using a Deep Learning architecture (Char CNN's - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM CNN. | ||
|
||
Deidentification NER (Romanian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It is trained with bert_base_cased embeddings and can detect 7 generic entities. | ||
|
||
This NER model is trained with a combination of custom datasets with several data augmentation mechanisms. | ||
|
||
## Predicted Entities | ||
|
||
`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` | ||
|
||
{:.btn-box} | ||
[Live Demo](https://demo.johnsnowlabs.com/healthcare/DEID_PHI_TEXT_MULTI/){:.button.button-orange} | ||
[Open in Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/DEID_PHI_TEXT_MULTI.ipynb){:.button.button-orange.button-orange-trans.co.button-icon} | ||
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/ner_deid_generic_bert_ro_4.2.2_3.0_1669122326582.zip){:.button.button-orange.button-orange-trans.arr.button-icon} | ||
|
||
## How to use | ||
|
||
|
||
|
||
<div class="tabs-box" markdown="1"> | ||
{% include programmingLanguageSelectScalaPythonNLU.html %} | ||
```python | ||
documentAssembler = DocumentAssembler()\ | ||
.setInputCol("text")\ | ||
.setOutputCol("document") | ||
|
||
sentenceDetector = SentenceDetector()\ | ||
.setInputCols(["document"])\ | ||
.setOutputCol("sentence") | ||
|
||
tokenizer = Tokenizer()\ | ||
.setInputCols(["sentence"])\ | ||
.setOutputCol("token") | ||
|
||
embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\ | ||
.setInputCols(["sentence","token"])\ | ||
.setOutputCol("word_embeddings") | ||
|
||
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models")\ | ||
.setInputCols(["sentence","token","word_embeddings"])\ | ||
.setOutputCol("ner") | ||
|
||
ner_converter = NerConverter()\ | ||
.setInputCols(["sentence", "token", "ner"])\ | ||
.setOutputCol("ner_chunk") | ||
|
||
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter]) | ||
|
||
text = """ | ||
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România | ||
Tel: +40(235)413773 | ||
Data setului de analize: 25 May 2022 15:36:00 | ||
Nume si Prenume : BUREAN MARIA, Varsta: 77 | ||
Medic : Agota Evelyn Tımar | ||
C.N.P : 2450502264401""" | ||
|
||
data = spark.createDataFrame([[text]]).toDF("text") | ||
|
||
results = nlpPipeline.fit(data).transform(data) | ||
``` | ||
```scala | ||
val documentAssembler = new DocumentAssembler() | ||
.setInputCol("text") | ||
.setOutputCol("document") | ||
|
||
val sentenceDetector = new SentenceDetector() | ||
.setInputCols(Array("document")) | ||
.setOutputCol("sentence") | ||
|
||
val tokenizer = new Tokenizer() | ||
.setInputCols(Array("sentence")) | ||
.setOutputCol("token") | ||
|
||
val embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro") | ||
.setInputCols(Array("sentence","token")) | ||
.setOutputCol("word_embeddings") | ||
|
||
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_bert", "ro", "clinical/models") | ||
.setInputCols(Array("sentence","token","word_embeddings")) | ||
.setOutputCol("ner") | ||
|
||
val ner_converter = new NerConverter() | ||
.setInputCols(Array("sentence", "token", "ner")) | ||
.setOutputCol("ner_chunk") | ||
|
||
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, embeddings, clinical_ner, ner_converter)) | ||
|
||
val text = """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România | ||
Tel: +40(235)413773 | ||
Data setului de analize: 25 May 2022 15:36:00 | ||
Nume si Prenume : BUREAN MARIA, Varsta: 77 | ||
Medic : Agota Evelyn Tımar | ||
C.N.P : 2450502264401""" | ||
|
||
val data = Seq(text).toDS.toDF("text") | ||
|
||
val results = pipeline.fit(data).transform(data) | ||
``` | ||
</div> | ||
|
||
## Results | ||
|
||
```bash | ||
+----------------------------+---------+ | ||
|chunk |ner_label| | ||
+----------------------------+---------+ | ||
|Spitalul Pentru Ochi de Deal|LOCATION | | ||
|Drumul Oprea Nr |LOCATION | | ||
|972 |LOCATION | | ||
|Vaslui |LOCATION | | ||
|737405 |LOCATION | | ||
|+40(235)413773 |CONTACT | | ||
|25 May 2022 |DATE | | ||
|BUREAN MARIA |NAME | | ||
|77 |AGE | | ||
|Agota Evelyn Tımar |NAME | | ||
|2450502264401 |ID | | ||
+----------------------------+---------+ | ||
``` | ||
|
||
{:.model-param} | ||
## Model Information | ||
|
||
{:.table-model} | ||
|---|---| | ||
|Model Name:|ner_deid_generic_bert| | ||
|Compatibility:|Healthcare NLP 4.2.2+| | ||
|License:|Licensed| | ||
|Edition:|Official| | ||
|Input Labels:|[sentence, token, embeddings]| | ||
|Output Labels:|[ner]| | ||
|Language:|ro| | ||
|Size:|16.5 MB| | ||
|
||
## References | ||
|
||
- Custom John Snow Labs datasets | ||
- Data augmentation techniques | ||
|
||
## Benchmarking | ||
|
||
```bash | ||
label precision recall f1-score support | ||
AGE 0.95 0.97 0.96 1186 | ||
CONTACT 0.99 0.98 0.98 366 | ||
DATE 0.96 0.92 0.94 4518 | ||
ID 1.00 1.00 1.00 679 | ||
LOCATION 0.91 0.90 0.90 1683 | ||
NAME 0.93 0.96 0.94 2916 | ||
PROFESSION 0.87 0.85 0.86 161 | ||
micro-avg 0.94 0.94 0.94 11509 | ||
macro-avg 0.94 0.94 0.94 11509 | ||
weighted-avg 0.95 0.94 0.94 11509 | ||
``` |
Oops, something went wrong.