Skip to content

Commit

Permalink
Models hub legal (#12829)
Browse files Browse the repository at this point in the history
* 2022-09-19-legre_indemnifications_en (#12758)

* Add model 2022-09-19-legre_indemnifications_en

* Add model 2022-09-19-legner_bert_indemnifications_en

Co-authored-by: josejuanmartinez <jjmcarrascosa@gmail.com>

* 2022-09-20-legclf_cuad_confidentiality_clause_en (#12770)

* Add model 2022-09-20-legclf_cuad_confidentiality_clause_en

* Add model 2022-09-20-legclf_cuad_indemnifications_clause_en

* Add model 2022-09-20-legclf_cuad_licenses_clause_en

* Add model 2022-09-20-legclf_cuad_obligations_clause_en

* Add model 2022-09-20-legclf_cuad_whereas_clause_en

Co-authored-by: josejuanmartinez <jjmcarrascosa@gmail.com>

* Add model 2022-09-27-legclf_cuad_licenses_clause_en (#12827)

Co-authored-by: josejuanmartinez <jjmcarrascosa@gmail.com>

* Add model 2022-09-27-legclf_cuad_indemnifications_clause_en (#12828)

Co-authored-by: josejuanmartinez <jjmcarrascosa@gmail.com>

Co-authored-by: jsl-models <74001263+jsl-models@users.noreply.github.com>
  • Loading branch information
josejuanmartinez and jsl-models authored Sep 27, 2022
1 parent 8bf66a2 commit 34e0555
Show file tree
Hide file tree
Showing 2 changed files with 227 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
layout: model
title: Legal Indemnifications Clause Binary Classifier
author: John Snow Labs
name: legclf_cuad_indemnifications_clause
date: 2022-09-27
tags: [cuad, indemnifications, en, licensed]
task: Text Classification
language: en
edition: Spark NLP for Legal 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This model is a Binary Classifier (True, False) for the `indemnifications` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.

If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.

Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).

This model can be combined with any of the other 200+ Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.

## Predicted Entities

`other`, `indemnifications`

{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_indemnifications_clause_en_1.0.0_3.0_1664272531526.zip){:.button.button-orange.button-orange-trans.arr.button-icon}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("clause_text") \
.setOutputCol("document")

embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")

docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_indemnifications_clause", "en", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")

nlpPipeline = Pipeline(stages=[
documentAssembler,
embeddings,
docClassifier])

df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text")
model = nlpPipeline.fit(df)
result = model.transform(df)
```

</div>

## Results

```bash
+-------+
| result|
+-------+
|[indemnifications]|
|[other]|
|[other]|
|[indemnifications]|

```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|legclf_cuad_indemnifications_clause|
|Type:|legal|
|Compatibility:|Spark NLP for Legal 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|21.9 MB|

## References

In-house annotations on CUAD dataset

## Benchmarking

```bash
precision recall f1-score support

indemnifications 1.00 0.83 0.91 12
other 0.83 1.00 0.91 10

accuracy 0.91 22
macro avg 0.92 0.92 0.91 22
weighted avg 0.92 0.91 0.91 22
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
layout: model
title: Legal Licenses Clause Binary Classifier
author: John Snow Labs
name: legclf_cuad_licenses_clause
date: 2022-09-27
tags: [en, licensed]
task: Text Classification
language: en
edition: Spark NLP for Legal 1.0.0
spark_version: 3.0
supported: true
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

This model is a Binary Classifier (True, False) for the `licenses` clause type. To use this model, make sure you provide enough context as an input. Adding Sentence Splitters to the pipeline will make the model see only sentences, not the whole text, so it's better to skip it, unless you want to do Binary Classification as sentence level.

If you have big legal documents, and you want to look for clauses, we recommend you to split the documents using any of the techniques available in our Spark NLP for Legal Workshop Tokenization & Splitting Tutorial (link [here](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/1.Tokenization_Splitting.ipynb)), namely:
- Paragraph splitting (by multiline);
- Splitting by headers / subheaders;
- etc.

Take into consideration the embeddings of this model allows up to 512 tokens. If you have more than that, consider splitting in smaller pieces (you can also check the same tutorial link provided above).

This model can be combined with any of the other 200+ Legal Clauses Classifiers you will find in Models Hub, getting as an output a series of True/False values for each of the legal clause model you have added.

## Predicted Entities

`other`, `licenses`

{:.btn-box}
[Live Demo](https://demo.johnsnowlabs.com/finance/CLASSIFY_LEGAL_CLAUSES/){:.button.button-orange}
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/legal/models/legclf_cuad_licenses_clause_en_1.0.0_3.0_1664272270378.zip){:.button.button-orange.button-orange-trans.arr.button-icon}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("clause_text") \
.setOutputCol("document")

embeddings = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")

docClassifier = legal.ClassifierDLModel.pretrained("legclf_cuad_licenses_clause", "en", "legal/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")

nlpPipeline = Pipeline(stages=[
documentAssembler,
embeddings,
docClassifier])

df = spark.createDataFrame([["YOUR TEXT HERE"]]).toDF("clause_text")
model = nlpPipeline.fit(df)
result = model.transform(df)
```

</div>

## Results

```bash
+-------+
| result|
+-------+
|[licenses]|
|[other]|
|[other]|
|[licenses]|
```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|legclf_cuad_licenses_clause|
|Type:|legal|
|Compatibility:|Spark NLP for Legal 1.0.0+|
|License:|Licensed|
|Edition:|Official|
|Input Labels:|[sentence_embeddings]|
|Output Labels:|[category]|
|Language:|en|
|Size:|22.0 MB|

## References

In-house annotations on CUAD dataset

## Benchmarking

```bash
precision recall f1-score support

licenses 1.00 0.60 0.75 10
other 0.84 1.00 0.91 21

accuracy 0.87 31
macro avg 0.92 0.80 0.83 31
weighted avg 0.89 0.87 0.86 31
```

0 comments on commit 34e0555

Please sign in to comment.