JohnSnowLabs · maziyarpanahi · May 25, 2023 · May 16, 2023 · May 23, 2023
diff --git a/docs/en/annotator_entries/DocumentAssembler.md b/docs/en/annotator_entries/DocumentAssembler.md
@@ -4,7 +4,7 @@ DocumentAssembler
 
 {%- capture description -%}
 Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline.
-The `DocumentAssembler` can read either a `String` column or an `Array[String]`. Additionally, setCleanupMode
+The `DocumentAssembler` reads `String` columns. Additionally, setCleanupMode
 can be used to pre-process the text (Default: `disabled`). For possible options please refer the parameters section.
 
 For more extended examples on document pre-processing see the

diff --git a/docs/en/annotators.md b/docs/en/annotators.md
@@ -110,12 +110,11 @@ Additionally, these transformers are available.
 {% include templates/anno_table_entry.md path="./transformers" name="AlbertForQuestionAnswering" summary="AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
 {% include templates/anno_table_entry.md path="./transformers" name="AlbertForTokenClassification" summary="AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="AlbertForSequenceClassification" summary="AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks."%}
-{% include templates/anno_table_entry.md path="./transformers" name="AlbertForSequenceClassification" summary="AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top e.g. for multi-class document classification tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="BartTransformer" summary="BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer"%}
 {% include templates/anno_table_entry.md path="./transformers" name="BertForQuestionAnswering" summary="BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
 {% include templates/anno_table_entry.md path="./transformers" name="BertForSequenceClassification" summary="Bert Models with sequence classification/regression head on top."%}
 {% include templates/anno_table_entry.md path="./transformers" name="BertForTokenClassification" summary="BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
-{% include templates/anno_table_entry.md path="./transformers" name="BertForZeroShotClassification" summary="BertForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI (natural language inference) tasks."%}
+{% include templates/anno_table_entry.md path="./transformers" name="BertForZeroShotClassification" summary="BertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="BertSentenceEmbeddings" summary="Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture."%}
 {% include templates/anno_table_entry.md path="./transformers" name="CamemBertEmbeddings" summary="CamemBert is based on Facebook’s RoBERTa model released in 2019."%}
 {% include templates/anno_table_entry.md path="./transformers" name="CamemBertForSequenceClassification" summary="amemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks."%}
@@ -127,6 +126,7 @@ Additionally, these transformers are available.
 {% include templates/anno_table_entry.md path="./transformers" name="DistilBertForQuestionAnswering" summary="DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
 {% include templates/anno_table_entry.md path="./transformers" name="DistilBertForSequenceClassification" summary="DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="DistilBertForTokenClassification" summary="DistilBertForTokenClassification can load DistilBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
+{% include templates/anno_table_entry.md path="./transformers" name="DistilBertForZeroShotClassification" summary="DistilBertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="ElmoEmbeddings" summary="Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark."%}
 {% include templates/anno_table_entry.md path="./transformers" name="GPT2Transformer" summary="GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages."%}
 {% include templates/anno_table_entry.md path="./transformers" name="HubertForCTC" summary="Hubert Model with a language modeling head on top for Connectionist Temporal Classification (CTC)."%}
@@ -139,6 +139,7 @@ Additionally, these transformers are available.
 {% include templates/anno_table_entry.md path="./transformers" name="RoBertaForQuestionAnswering" summary="RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD."%}
 {% include templates/anno_table_entry.md path="./transformers" name="RoBertaForSequenceClassification" summary="RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top e.g. for multi-class document classification tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="RoBertaForTokenClassification" summary="RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks."%}
+{% include templates/anno_table_entry.md path="./transformers" name="RoBertaForZeroShotClassification" summary="RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks."%}
 {% include templates/anno_table_entry.md path="./transformers" name="RoBertaSentenceEmbeddings" summary="Sentence-level embeddings using RoBERTa."%}
 {% include templates/anno_table_entry.md path="./transformers" name="SpanBertCoref" summary="A coreference resolution model based on SpanBert."%}
 {% include templates/anno_table_entry.md path="./transformers" name="SwinForImageClassification" summary="SwinImageClassification is an image classifier based on Swin."%}

diff --git a/docs/en/transformer_entries/BertForZeroShotClassification.md b/docs/en/transformer_entries/BertForZeroShotClassification.md
@@ -8,6 +8,9 @@ language inference) tasks. Equivalent of `BertForSequenceClassification` models,
 models don't require a hardcoded number of potential classes, they can be chosen at runtime.
 It usually means it's slower but it is much more flexible.
 
+Note that the model will loop through all provided labels. So the more labels you have, the
+longer this process will take.
+
 Any combination of sequences and labels can be passed and each combination will be posed as a
 premise/hypothesis pair and passed to the pretrained model.
 

diff --git a/docs/en/transformer_entries/DistilBertForZeroShotClassification.md b/docs/en/transformer_entries/DistilBertForZeroShotClassification.md
@@ -0,0 +1,140 @@
+{%- capture title -%}
+DistilBertForZeroShotClassification
+{%- endcapture -%}
+
+{%- capture description -%}
+DistilBertForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI
+(natural language inference) tasks. Equivalent of `DistilBertForZeroShotClassification `
+models, but these models don't require a hardcoded number of potential classes, they can be
+chosen at runtime. It usually means it's slower but it is much more flexible.
+
+Note that the model will loop through all provided labels. So the more labels you have, the
+longer this process will take.
+
+Any combination of sequences and labels can be passed and each combination will be posed as a
+premise/hypothesis pair and passed to the pretrained model.
+
+Pretrained models can be loaded with `pretrained` of the companion object:
+
+```scala
+val sequenceClassifier = DistilBertForZeroShotClassification .pretrained()
+  .setInputCols("token", "document")
+  .setOutputCol("label")
+```
+
+The default model is `"distilbert_base_zero_shot_classifier_uncased_mnli"`, if no name is
+provided.
+
+For available pretrained models please see the
+[Models Hub](https://sparknlp.org/models?task=Text+Classification).
+
+To see which models are compatible and how to import them see
+https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
+{%- endcapture -%}
+
+{%- capture input_anno -%}
+DOCUMENT, TOKEN
+{%- endcapture -%}
+
+{%- capture output_anno -%}
+CATEGORY
+{%- endcapture -%}
+
+{%- capture python_example -%}
+import sparknlp
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+
+documentAssembler = DocumentAssembler() \
+    .setInputCol("text") \
+    .setOutputCol("document")
+
+tokenizer = Tokenizer() \
+    .setInputCols(["document"]) \
+    .setOutputCol("token")
+
+sequenceClassifier = BertForZeroShotClassification.pretrained() \
+    .setInputCols(["token", "document"]) \
+    .setOutputCol("label") \
+    .setCaseSensitive(True)
+
+pipeline = Pipeline().setStages([
+    documentAssembler,
+    tokenizer,
+    sequenceClassifier
+])
+
+data = spark.createDataFrame([["I loved this movie when I was a child."], ["It was pretty boring."]]).toDF("text")
+result = pipeline.fit(data).transform(data)
+result.select("label.result").show(truncate=False)
++------+
+|result|
++------+
+|[pos] |
+|[neg] |
++------+
+{%- endcapture -%}
+
+{%- capture scala_example -%}
+import spark.implicits._
+import com.johnsnowlabs.nlp.base._
+import com.johnsnowlabs.nlp.annotator._
+import org.apache.spark.ml.Pipeline
+
+val documentAssembler = new DocumentAssembler()
+  .setInputCol("text")
+  .setOutputCol("document")
+
+val tokenizer = new Tokenizer()
+  .setInputCols("document")
+  .setOutputCol("token")
+
+val sequenceClassifier = DistilBertForZeroShotClassification .pretrained()
+  .setInputCols("token", "document")
+  .setOutputCol("label")
+  .setCaseSensitive(true)
+
+val pipeline = new Pipeline().setStages(Array(
+  documentAssembler,
+  tokenizer,
+  sequenceClassifier
+))
+
+val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
+val result = pipeline.fit(data).transform(data)
+
+result.select("label.result").show(false)
++------+
+|result|
++------+
+|[pos] |
+|[neg] |
++------+
+
+{%- endcapture -%}
+
+{%- capture api_link -%}
+[DistilBertForZeroShotClassification](/api/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification)
+{%- endcapture -%}
+
+{%- capture python_api_link -%}
+[DistilBertForZeroShotClassification](/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/distil_bert_for_zero_shot_classification/index.html#sparknlp.annotator.classifier_dl.distil_bert_for_zero_shot_classification.DistilBertForZeroShotClassification)
+
+{%- endcapture -%}
+
+{%- capture source_link -%}
+[DistilBertForZeroShotClassification](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification.scala)
+{%- endcapture -%}
+
+{% include templates/anno_template.md
+title=title
+description=description
+input_anno=input_anno
+output_anno=output_anno
+python_example=python_example
+scala_example=scala_example
+api_link=api_link
+python_api_link=python_api_link
+source_link=source_link
+%}
diff --git a/docs/en/transformer_entries/RoBertaForZeroShotClassification.md b/docs/en/transformer_entries/RoBertaForZeroShotClassification.md
@@ -0,0 +1,139 @@
+{%- capture title -%}
+RoBertaForZeroShotClassification
+{%- endcapture -%}
+
+{%- capture description -%}
+RoBertaForZeroShotClassification using a `ModelForSequenceClassification` trained on NLI
+(natural language inference) tasks. Equivalent of `RoBertaForZeroShotClassification ` models,
+but these models don't require a hardcoded number of potential classes, they can be chosen at
+runtime. It usually means it's slower but it is much more flexible.
+
+Note that the model will loop through all provided labels. So the more labels you have, the
+longer this process will take.
+
+Any combination of sequences and labels can be passed and each combination will be posed as a
+premise/hypothesis pair and passed to the pretrained model.
+
+Pretrained models can be loaded with `pretrained` of the companion object:
+
+```scala
+val sequenceClassifier = RoBertaForZeroShotClassification .pretrained()
+  .setInputCols("token", "document")
+  .setOutputCol("label")
+```
+
+The default model is `"roberta_base_zero_shot_classifier_nli"`, if no name is provided.
+
+For available pretrained models please see the
+[Models Hub](https://sparknlp.org/models?task=Text+Classification).
+
+To see which models are compatible and how to import them see
+https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
+
+{%- endcapture -%}
+
+{%- capture input_anno -%}
+DOCUMENT, TOKEN
+{%- endcapture -%}
+
+{%- capture output_anno -%}
+CATEGORY
+{%- endcapture -%}
+
+{%- capture python_example -%}
+import sparknlp
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+
+documentAssembler = DocumentAssembler() \
+    .setInputCol("text") \
+    .setOutputCol("document")
+
+tokenizer = Tokenizer() \
+    .setInputCols(["document"]) \
+    .setOutputCol("token")
+
+sequenceClassifier = RoBertaForZeroShotClassification.pretrained() \
+    .setInputCols(["token", "document"]) \
+    .setOutputCol("label") \
+    .setCaseSensitive(True)
+
+pipeline = Pipeline().setStages([
+    documentAssembler,
+    tokenizer,
+    sequenceClassifier
+])
+
+data = spark.createDataFrame([["I loved this movie when I was a child."], ["It was pretty boring."]]).toDF("text")
+result = pipeline.fit(data).transform(data)
+result.select("label.result").show(truncate=False)
++------+
+|result|
++------+
+|[pos] |
+|[neg] |
++------+
+{%- endcapture -%}
+
+{%- capture scala_example -%}
+import spark.implicits._
+import com.johnsnowlabs.nlp.base._
+import com.johnsnowlabs.nlp.annotator._
+import org.apache.spark.ml.Pipeline
+
+val documentAssembler = new DocumentAssembler()
+  .setInputCol("text")
+  .setOutputCol("document")
+
+val tokenizer = new Tokenizer()
+  .setInputCols("document")
+  .setOutputCol("token")
+
+val sequenceClassifier = RoBertaForZeroShotClassification .pretrained()
+  .setInputCols("token", "document")
+  .setOutputCol("label")
+  .setCaseSensitive(true)
+
+val pipeline = new Pipeline().setStages(Array(
+  documentAssembler,
+  tokenizer,
+  sequenceClassifier
+))
+
+val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
+val result = pipeline.fit(data).transform(data)
+
+result.select("label.result").show(false)
++------+
+|result|
++------+
+|[pos] |
+|[neg] |
++------+
+
+{%- endcapture -%}
+
+{%- capture api_link -%}
+[RoBertaForZeroShotClassification](/api/com/johnsnowlabs/nlp/annotators/classifier/dl/RoBertaForZeroShotClassification)
+{%- endcapture -%}
+
+{%- capture python_api_link -%}
+[RoBertaForZeroShotClassification](/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/roberta_bert_for_zero_shot_classification/index.html#sparknlp.annotator.classifier_dl.roberta_bert_for_zero_shot_classification.RoBertaForZeroShotClassification)
+{%- endcapture -%}
+
+{%- capture source_link -%}
+[RoBertaForZeroShotClassification](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/RoBertaForZeroShotClassification.scala)
+{%- endcapture -%}
+
+{% include templates/anno_template.md
+title=title
+description=description
+input_anno=input_anno
+output_anno=output_anno
+python_example=python_example
+scala_example=scala_example
+api_link=api_link
+python_api_link=python_api_link
+source_link=source_link
+%}
diff --git a/python/sparknlp/annotator/classifier_dl/bert_for_zero_shot_classification.py b/python/sparknlp/annotator/classifier_dl/bert_for_zero_shot_classification.py
@@ -28,6 +28,9 @@ class BertForZeroShotClassification(AnnotatorModel,
     number of potential classes, they can be chosen at runtime. It usually means it's slower but it is much more
     flexible.
 
+    Note that the model will loop through all provided labels. So the more labels you have, the
+    longer this process will take.
+
     Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
     pair and passed to the pretrained model.
 

diff --git a/python/sparknlp/annotator/classifier_dl/distil_bert_for_zero_shot_classification.py b/python/sparknlp/annotator/classifier_dl/distil_bert_for_zero_shot_classification.py
@@ -28,6 +28,9 @@ class DistilBertForZeroShotClassification(AnnotatorModel,
     number of potential classes, they can be chosen at runtime. It usually means it's slower but it is much more
     flexible.
 
+    Note that the model will loop through all provided labels. So the more labels you have, the
+    longer this process will take.
+
     Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
     pair and passed to the pretrained model.
 

diff --git a/python/sparknlp/annotator/classifier_dl/roberta_bert_for_zero_shot_classification.py b/python/sparknlp/annotator/classifier_dl/roberta_bert_for_zero_shot_classification.py
@@ -27,6 +27,9 @@ class RoBertaForZeroShotClassification(AnnotatorModel,
     number of potential classes, they can be chosen at runtime. It usually means it's slower but it is much more
     flexible.
 
+    Note that the model will loop through all provided labels. So the more labels you have, the
+    longer this process will take.
+
     Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis
     pair and passed to the pretrained model.
 

diff --git a/python/sparknlp/base/document_assembler.py b/python/sparknlp/base/document_assembler.py
@@ -24,10 +24,10 @@ class DocumentAssembler(AnnotatorTransformer):
     """Prepares data into a format that is processable by Spark NLP.
 
     This is the entry point for every Spark NLP pipeline. The
-    `DocumentAssembler` can read either a ``String`` column or an
-    ``Array[String]``. Additionally, :meth:`.setCleanupMode` can be used to
-    pre-process the text (Default: ``disabled``). For possible options please
-    refer the parameters section.
+    `DocumentAssembler` reads ``String`` columns. Additionally,
+    :meth:`.setCleanupMode` can be used to pre-process the
+    text (Default: ``disabled``). For possible options please refer the
+    parameters section.
 
     For more extended examples on document pre-processing see the
     `Examples <https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-assembler/Loading_Documents_With_DocumentAssembler.ipynb>`__.

diff --git a/src/main/scala/com/johnsnowlabs/nlp/DocumentAssembler.scala b/src/main/scala/com/johnsnowlabs/nlp/DocumentAssembler.scala
@@ -25,9 +25,9 @@ import org.apache.spark.sql.types._
 import org.apache.spark.sql.{DataFrame, Dataset, Row}
 
 /** Prepares data into a format that is processable by Spark NLP. This is the entry point for
-  * every Spark NLP pipeline. The `DocumentAssembler` can read either a `String` column or an
-  * `Array[String]`. Additionally, [[setCleanupMode]] can be used to pre-process the text
-  * (Default: `disabled`). For possible options please refer the parameters section.
+  * every Spark NLP pipeline. The `DocumentAssembler` reads `String` columns. Additionally,
+  * [[setCleanupMode]] can be used to pre-process the text (Default: `disabled`). For possible
+  * options please refer to the parameters section.
   *
   * For more extended examples on document pre-processing see the
   * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/document-assembler/Loading_Documents_With_DocumentAssembler.ipynb Examples]].