Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKNLP-809: Add warning to ForZeroShot annotators #13798

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/en/annotator_entries/DocumentAssembler.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ DocumentAssembler

{%- capture description -%}
Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline.
The `DocumentAssembler` can read either a `String` column or an `Array[String]`. Additionally, setCleanupMode
The `DocumentAssembler` reads `String` columns. Additionally, setCleanupMode
can be used to pre-process the text (Default: `disabled`). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the
Expand Down
5 changes: 3 additions & 2 deletions docs/en/annotators.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,11 @@ Additionally, these transformers are available.
{% include templates/anno_table_entry.md path="./transformers" name="AlbertForQuestionAnswering" summary="AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
{% include templates/anno_table_entry.md path="./transformers" name="AlbertForTokenClassification" summary="AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="AlbertForSequenceClassification" summary="AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="AlbertForSequenceClassification" summary="AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="BartTransformer" summary="BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer"%}
{% include templates/anno_table_entry.md path="./transformers" name="BertForQuestionAnswering" summary="BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
{% include templates/anno_table_entry.md path="./transformers" name="BertForSequenceClassification" summary="Bert Models with sequence classification/regression head on top."%}
{% include templates/anno_table_entry.md path="./transformers" name="BertForTokenClassification" summary="BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="BertForZeroShotClassification" summary="BertForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI (natural language inference) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="BertForZeroShotClassification" summary="BertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="BertSentenceEmbeddings" summary="Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture."%}
{% include templates/anno_table_entry.md path="./transformers" name="CamemBertEmbeddings" summary="CamemBert is based on Facebook’s RoBERTa model released in 2019."%}
{% include templates/anno_table_entry.md path="./transformers" name="CamemBertForSequenceClassification" summary="amemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks."%}
Expand All @@ -127,6 +126,7 @@ Additionally, these transformers are available.
{% include templates/anno_table_entry.md path="./transformers" name="DistilBertForQuestionAnswering" summary="DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
{% include templates/anno_table_entry.md path="./transformers" name="DistilBertForSequenceClassification" summary="DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="DistilBertForTokenClassification" summary="DistilBertForTokenClassification can load DistilBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="DistilBertForZeroShotClassification" summary="DistilBertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="ElmoEmbeddings" summary="Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark."%}
{% include templates/anno_table_entry.md path="./transformers" name="GPT2Transformer" summary="GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages."%}
{% include templates/anno_table_entry.md path="./transformers" name="HubertForCTC" summary="Hubert Model with a language modeling head on top for Connectionist Temporal Classification (CTC)."%}
Expand All @@ -139,6 +139,7 @@ Additionally, these transformers are available.
{% include templates/anno_table_entry.md path="./transformers" name="RoBertaForQuestionAnswering" summary="RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
{% include templates/anno_table_entry.md path="./transformers" name="RoBertaForSequenceClassification" summary="RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="RoBertaForTokenClassification" summary="RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="RoBertaForZeroShotClassification" summary="RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks."%}
{% include templates/anno_table_entry.md path="./transformers" name="RoBertaSentenceEmbeddings" summary="Sentence-level embeddings using RoBERTa."%}
{% include templates/anno_table_entry.md path="./transformers" name="SpanBertCoref" summary="A coreference resolution model based on SpanBert."%}
{% include templates/anno_table_entry.md path="./transformers" name="SwinForImageClassification" summary="SwinImageClassification is an image classifier based on Swin."%}
Expand Down
3 changes: 3 additions & 0 deletions docs/en/transformer_entries/BertForZeroShotClassification.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ language inference) tasks. Equivalent of `BertForSequenceClassification` models,
models don't require a hardcoded number of potential classes, they can be chosen at runtime.
It usually means it's slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the
longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a
premise/hypothesis pair and passed to the pretrained model.

Expand Down
140 changes: 140 additions & 0 deletions docs/en/transformer_entries/DistilBertForZeroShotClassification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
{%- capture title -%}
DistilBertForZeroShotClassification
{%- endcapture -%}

{%- capture description -%}
DistilBertForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI
(natural language inference) tasks. Equivalent of `DistilBertForZeroShotClassification `
models, but these models don't require a hardcoded number of potential classes, they can be
chosen at runtime. It usually means it's slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the
longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a
premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val sequenceClassifier = DistilBertForZeroShotClassification .pretrained()
.setInputCols("token", "document")
.setOutputCol("label")
```

The default model is `"distilbert_base_zero_shot_classifier_uncased_mnli"`, if no name is
provided.

For available pretrained models please see the
[Models Hub](https://sparknlp.org/models?task=Text+Classification).

To see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT, TOKEN
{%- endcapture -%}

{%- capture output_anno -%}
CATEGORY
{%- endcapture -%}

{%- capture python_example -%}
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

sequenceClassifier = BertForZeroShotClassification.pretrained() \
.setInputCols(["token", "document"]) \
.setOutputCol("label") \
.setCaseSensitive(True)

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child."], ["It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+
{%- endcapture -%}

{%- capture scala_example -%}
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val sequenceClassifier = DistilBertForZeroShotClassification .pretrained()
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
sequenceClassifier
))

val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("label.result").show(false)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

{%- endcapture -%}

{%- capture api_link -%}
[DistilBertForZeroShotClassification](/api/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification)
{%- endcapture -%}

{%- capture python_api_link -%}
[DistilBertForZeroShotClassification](/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/distil_bert_for_zero_shot_classification/index.html#sparknlp.annotator.classifier_dl.distil_bert_for_zero_shot_classification.DistilBertForZeroShotClassification)

{%- endcapture -%}

{%- capture source_link -%}
[DistilBertForZeroShotClassification](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
139 changes: 139 additions & 0 deletions docs/en/transformer_entries/RoBertaForZeroShotClassification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
{%- capture title -%}
RoBertaForZeroShotClassification
{%- endcapture -%}

{%- capture description -%}
RoBertaForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI
(natural language inference) tasks. Equivalent of `RoBertaForZeroShotClassification ` models,
but these models don't require a hardcoded number of potential classes, they can be chosen at
runtime. It usually means it's slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the
longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a
premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val sequenceClassifier = RoBertaForZeroShotClassification .pretrained()
.setInputCols("token", "document")
.setOutputCol("label")
```

The default model is `"roberta_base_zero_shot_classifier_nli"`, if no name is provided.

For available pretrained models please see the
[Models Hub](https://sparknlp.org/models?task=Text+Classification).

To see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT, TOKEN
{%- endcapture -%}

{%- capture output_anno -%}
CATEGORY
{%- endcapture -%}

{%- capture python_example -%}
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

sequenceClassifier = RoBertaForZeroShotClassification.pretrained() \
.setInputCols(["token", "document"]) \
.setOutputCol("label") \
.setCaseSensitive(True)

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child."], ["It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+
{%- endcapture -%}

{%- capture scala_example -%}
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val sequenceClassifier = RoBertaForZeroShotClassification .pretrained()
.setInputCols("token", "document")
.setOutputCol("label")
.setCaseSensitive(true)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
sequenceClassifier
))

val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("label.result").show(false)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

{%- endcapture -%}

{%- capture api_link -%}
[RoBertaForZeroShotClassification](/api/com/johnsnowlabs/nlp/annotators/classifier/dl/RoBertaForZeroShotClassification)
{%- endcapture -%}

{%- capture python_api_link -%}
[RoBertaForZeroShotClassification](/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/roberta_bert_for_zero_shot_classification/index.html#sparknlp.annotator.classifier_dl.roberta_bert_for_zero_shot_classification.RoBertaForZeroShotClassification)
{%- endcapture -%}

{%- capture source_link -%}
[RoBertaForZeroShotClassification](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/RoBertaForZeroShotClassification.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ class BertForZeroShotClassification(AnnotatorModel,
number of potential classes, they can be chosen at runtime. It usually means it's slower but it is much more
flexible.

Note that the model will loop through all provided labels. So the more labels you have, the
longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
pair and passed to the pretrained model.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ class DistilBertForZeroShotClassification(AnnotatorModel,
number of potential classes, they can be chosen at runtime. It usually means it's slower but it is much more
flexible.

Note that the model will loop through all provided labels. So the more labels you have, the
longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
pair and passed to the pretrained model.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,9 @@ class RoBertaForZeroShotClassification(AnnotatorModel,
number of potential classes, they can be chosen at runtime. It usually means it's slower but it is much more
flexible.

Note that the model will loop through all provided labels. So the more labels you have, the
longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
pair and passed to the pretrained model.

Expand Down
8 changes: 4 additions & 4 deletions python/sparknlp/base/document_assembler.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,10 @@ class DocumentAssembler(AnnotatorTransformer):
"""Prepares data into a format that is processable by Spark NLP.

This is the entry point for every Spark NLP pipeline. The
`DocumentAssembler` can read either a ``String`` column or an
``Array[String]``. Additionally, :meth:`.setCleanupMode` can be used to
pre-process the text (Default: ``disabled``). For possible options please
refer the parameters section.
`DocumentAssembler` reads ``String`` columns. Additionally,
:meth:`.setCleanupMode` can be used to pre-process the
text (Default: ``disabled``). For possible options please refer the
parameters section.

For more extended examples on document pre-processing see the
`Examples <https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-assembler/Loading_Documents_With_DocumentAssembler.ipynb>`__.
Expand Down
6 changes: 3 additions & 3 deletions src/main/scala/com/johnsnowlabs/nlp/DocumentAssembler.scala
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Row}

/** Prepares data into a format that is processable by Spark NLP. This is the entry point for
* every Spark NLP pipeline. The `DocumentAssembler` can read either a `String` column or an
* `Array[String]`. Additionally, [[setCleanupMode]] can be used to pre-process the text
* (Default: `disabled`). For possible options please refer the parameters section.
* every Spark NLP pipeline. The `DocumentAssembler` reads `String` columns. Additionally,
* [[setCleanupMode]] can be used to pre-process the text (Default: `disabled`). For possible
* options please refer to the parameters section.
*
* For more extended examples on document pre-processing see the
* [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-assembler/Loading_Documents_With_DocumentAssembler.ipynb Examples]].
Expand Down
Loading