From 0cc970a962f84b3222710467e7171c5e690a0e02 Mon Sep 17 00:00:00 2001 From: ahmedlone127 Date: Sun, 14 Jul 2024 19:46:34 +0500 Subject: [PATCH 1/7] Fixing default names for Phi2 and MistralAI (#14338) * Fixing default names for Phi2 and MistralAI * Phi2 is 2.7B in size --------- Co-authored-by: Maziyar Panahi --- .../nlp/annotators/seq2seq/MistralTransformer.html | 6 +++--- .../nlp/annotators/seq2seq/Phi2Transformer.html | 2 +- .../johnsnowlabs/nlp/annotators/seq2seq/index.html | 8 ++++---- .../annotator/seq2seq/mistral_transformer.html | 10 +++++----- .../sparknlp/annotator/seq2seq/phi2_transformer.html | 6 +++--- .../annotator/seq2seq/mistral_transformer/index.html | 6 +++--- .../annotator/seq2seq/phi2_transformer/index.html | 6 +++--- .../annotator/seq2seq/mistral_transformer.py | 10 +++++----- .../sparknlp/annotator/seq2seq/phi2_transformer.py | 6 +++--- .../nlp/annotators/seq2seq/MistralTransformer.scala | 8 ++++---- .../nlp/annotators/seq2seq/Phi2Transformer.scala | 12 ++++++------ .../nlp/annotators/seq2seq/MistralTestSpec.scala | 2 +- 12 files changed, 41 insertions(+), 41 deletions(-) diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.html index 049e910424d89a..6b77bb475b6259 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.html @@ -286,9 +286,9 @@

process extensive textual input, expanding its utility in handling more complex tasks.

In summary, Mistral 7B represents a notable advancement in language models, offering a reliable and versatile solution for various natural language processing challenges.

Pretrained models can be loaded with pretrained of the companion object:

val mistral = MistralTransformer.pretrained()
   .setInputCols("document")
-  .setOutputCol("generation")

The default model is "mistral-7b", if no name is provided. For available pretrained models + .setOutputCol("generation")

The default model is "mistral_7b", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see -MistralTestSpec.

References:

Paper Abstract:

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior +MistralTestSpec.

References:

Paper Abstract:

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window @@ -305,7 +305,7 @@

.setInputCol("text") .setOutputCol("documents") -val mistral = MistralTransformer.pretrained("mistral-7b") +val mistral = MistralTransformer.pretrained("mistral_7b") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.html index 66127d9bf656bb..4f5409c45443cb 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.html @@ -311,7 +311,7 @@

.setInputCol("text") .setOutputCol("documents") -val Phi2 = Phi2Transformer.pretrained("Phi2-7b") +val Phi2 = Phi2Transformer.pretrained("phi2_7b") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/index.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/index.html index c63e37ee760c8f..29e29cd55d177e 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/index.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/index.html @@ -1040,9 +1040,9 @@

Type Members

process extensive textual input, expanding its utility in handling more complex tasks.

In summary, Mistral 7B represents a notable advancement in language models, offering a reliable and versatile solution for various natural language processing challenges.

Pretrained models can be loaded with pretrained of the companion object:

val mistral = MistralTransformer.pretrained()
   .setInputCols("document")
-  .setOutputCol("generation")

The default model is "mistral-7b", if no name is provided. For available pretrained models + .setOutputCol("generation")

The default model is "mistral_7b", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see -MistralTestSpec.

References:

Paper Abstract:

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior +MistralTestSpec.

References:

Paper Abstract:

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window @@ -1059,7 +1059,7 @@

Type Members

.setInputCol("text") .setOutputCol("documents") -val mistral = MistralTransformer.pretrained("mistral-7b") +val mistral = MistralTransformer.pretrained("mistral_7b") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) @@ -1134,7 +1134,7 @@

Type Members

.setInputCol("text") .setOutputCol("documents") -val Phi2 = Phi2Transformer.pretrained("Phi2-7b") +val Phi2 = Phi2Transformer.pretrained("phi2_7b") .setInputCols(Array("documents")) .setMinOutputLength(10) .setMaxOutputLength(50) diff --git a/docs/api/python/modules/sparknlp/annotator/seq2seq/mistral_transformer.html b/docs/api/python/modules/sparknlp/annotator/seq2seq/mistral_transformer.html index 4456d2a7d70a0a..d73a77c9461133 100644 --- a/docs/api/python/modules/sparknlp/annotator/seq2seq/mistral_transformer.html +++ b/docs/api/python/modules/sparknlp/annotator/seq2seq/mistral_transformer.html @@ -387,7 +387,7 @@

Source code for sparknlp.annotator.seq2seq.mistral_transformer

... .setOutputCol("generation") - The default model is ``"mistral-7b"``, if no name is provided. For available + The default model is ``"mistral_7b"``, if no name is provided. For available pretrained models please see the `Models Hub <https://sparknlp.org/models?q=mistral>`__. @@ -435,7 +435,7 @@

Source code for sparknlp.annotator.seq2seq.mistral_transformer

References ---------- - `Mistral 7B - <https://mistral.ai/news/announcing-mistral-7b/>`__ + <https://mistral.ai/news/announcing-mistral_7b/>`__ - https://github.com/mistralai/mistral-src **Paper Abstract:** @@ -458,7 +458,7 @@

Source code for sparknlp.annotator.seq2seq.mistral_transformer

>>> documentAssembler = DocumentAssembler() \\ ... .setInputCol("text") \\ ... .setOutputCol("documents") - >>> mistral = MistralTransformer.pretrained("mistral-7b") \\ + >>> mistral = MistralTransformer.pretrained("mistral_7b") \\ ... .setInputCols(["documents"]) \\ ... .setMaxOutputLength(50) \\ ... .setOutputCol("generation") @@ -670,13 +670,13 @@

Source code for sparknlp.annotator.seq2seq.mistral_transformer

return MistralTransformer(java_model=jModel)
@staticmethod -
[docs] def pretrained(name="mistral-7b", lang="en", remote_loc=None): +
[docs] def pretrained(name="mistral_7b", lang="en", remote_loc=None): """Downloads and loads a pretrained model. Parameters ---------- name : str, optional - Name of the pretrained model, by default "mistral-7b" + Name of the pretrained model, by default "mistral_7b" lang : str, optional Language of the pretrained model, by default "en" remote_loc : str, optional diff --git a/docs/api/python/modules/sparknlp/annotator/seq2seq/phi2_transformer.html b/docs/api/python/modules/sparknlp/annotator/seq2seq/phi2_transformer.html index 07ee3e86c061c8..26e7ce683ee170 100644 --- a/docs/api/python/modules/sparknlp/annotator/seq2seq/phi2_transformer.html +++ b/docs/api/python/modules/sparknlp/annotator/seq2seq/phi2_transformer.html @@ -451,7 +451,7 @@

Source code for sparknlp.annotator.seq2seq.phi2_transformer

>>> documentAssembler = DocumentAssembler() \\ ... .setInputCol("text") \\ ... .setOutputCol("documents") - >>> phi2 = Phi2Transformer.pretrained("phi2-7b") \\ + >>> phi2 = Phi2Transformer.pretrained("phi2_7b") \\ ... .setInputCols(["documents"]) \\ ... .setMaxOutputLength(50) \\ ... .setOutputCol("generation") @@ -647,13 +647,13 @@

Source code for sparknlp.annotator.seq2seq.phi2_transformer

return Phi2Transformer(java_model=jModel)
@staticmethod -
[docs] def pretrained(name="phi2-7b", lang="en", remote_loc=None): +
[docs] def pretrained(name="phi2_7b", lang="en", remote_loc=None): """Downloads and loads a pretrained model. Parameters ---------- name : str, optional - Name of the pretrained model, by default "phi2-7b" + Name of the pretrained model, by default "phi2_7b" lang : str, optional Language of the pretrained model, by default "en" remote_loc : str, optional diff --git a/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/mistral_transformer/index.html b/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/mistral_transformer/index.html index 0ca0553b9d8768..b632b744fa5637 100644 --- a/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/mistral_transformer/index.html +++ b/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/mistral_transformer/index.html @@ -543,7 +543,7 @@

Classes... .setOutputCol("generation")

-

The default model is "mistral-7b", if no name is provided. For available +

The default model is "mistral_7b", if no name is provided. For available pretrained models please see the Models Hub.

@@ -772,12 +772,12 @@

Classes
-static pretrained(name='mistral-7b', lang='en', remote_loc=None)[source]#
+static pretrained(name='mistral_7b', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
-
namestr, optional

Name of the pretrained model, by default “mistral-7b”

+
namestr, optional

Name of the pretrained model, by default “mistral_7b”

langstr, optional

Language of the pretrained model, by default “en”

diff --git a/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/phi2_transformer/index.html b/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/phi2_transformer/index.html index 596773e05d4d21..ea8b3e5ac7ae33 100644 --- a/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/phi2_transformer/index.html +++ b/docs/api/python/reference/autosummary/sparknlp/annotator/seq2seq/phi2_transformer/index.html @@ -608,7 +608,7 @@

Classes>>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("documents") ->>> phi2 = Phi2Transformer.pretrained("phi2-7b") \ +>>> phi2 = Phi2Transformer.pretrained("phi2_7b") \ ... .setInputCols(["documents"]) \ ... .setMaxOutputLength(50) \ ... .setOutputCol("generation") @@ -795,12 +795,12 @@

Classes
-static pretrained(name='phi2-7b', lang='en', remote_loc=None)[source]#
+static pretrained(name='phi2_7b', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
-
namestr, optional

Name of the pretrained model, by default “phi2-7b”

+
namestr, optional

Name of the pretrained model, by default “phi2_7b”

langstr, optional

Language of the pretrained model, by default “en”

diff --git a/python/sparknlp/annotator/seq2seq/mistral_transformer.py b/python/sparknlp/annotator/seq2seq/mistral_transformer.py index 29eff367e5b52f..893f9a871c11c7 100644 --- a/python/sparknlp/annotator/seq2seq/mistral_transformer.py +++ b/python/sparknlp/annotator/seq2seq/mistral_transformer.py @@ -44,7 +44,7 @@ class MistralTransformer(AnnotatorModel, HasBatchedAnnotate, HasEngine): ... .setOutputCol("generation") - The default model is ``"mistral-7b"``, if no name is provided. For available + The default model is ``"mistral_7b"``, if no name is provided. For available pretrained models please see the `Models Hub `__. @@ -92,7 +92,7 @@ class MistralTransformer(AnnotatorModel, HasBatchedAnnotate, HasEngine): References ---------- - `Mistral 7B - `__ + `__ - https://github.com/mistralai/mistral-src **Paper Abstract:** @@ -115,7 +115,7 @@ class MistralTransformer(AnnotatorModel, HasBatchedAnnotate, HasEngine): >>> documentAssembler = DocumentAssembler() \\ ... .setInputCol("text") \\ ... .setOutputCol("documents") - >>> mistral = MistralTransformer.pretrained("mistral-7b") \\ + >>> mistral = MistralTransformer.pretrained("mistral_7b") \\ ... .setInputCols(["documents"]) \\ ... .setMaxOutputLength(50) \\ ... .setOutputCol("generation") @@ -327,13 +327,13 @@ def loadSavedModel(folder, spark_session, use_openvino=False): return MistralTransformer(java_model=jModel) @staticmethod - def pretrained(name="mistral-7b", lang="en", remote_loc=None): + def pretrained(name="mistral_7b", lang="en", remote_loc=None): """Downloads and loads a pretrained model. Parameters ---------- name : str, optional - Name of the pretrained model, by default "mistral-7b" + Name of the pretrained model, by default "mistral_7b" lang : str, optional Language of the pretrained model, by default "en" remote_loc : str, optional diff --git a/python/sparknlp/annotator/seq2seq/phi2_transformer.py b/python/sparknlp/annotator/seq2seq/phi2_transformer.py index e7cf7604da03c4..d2eaaad2b960e7 100644 --- a/python/sparknlp/annotator/seq2seq/phi2_transformer.py +++ b/python/sparknlp/annotator/seq2seq/phi2_transformer.py @@ -108,7 +108,7 @@ class Phi2Transformer(AnnotatorModel, HasBatchedAnnotate, HasEngine): >>> documentAssembler = DocumentAssembler() \\ ... .setInputCol("text") \\ ... .setOutputCol("documents") - >>> phi2 = Phi2Transformer.pretrained("phi2-7b") \\ + >>> phi2 = Phi2Transformer.pretrained("phi2") \\ ... .setInputCols(["documents"]) \\ ... .setMaxOutputLength(50) \\ ... .setOutputCol("generation") @@ -304,13 +304,13 @@ def loadSavedModel(folder, spark_session, use_openvino=False): return Phi2Transformer(java_model=jModel) @staticmethod - def pretrained(name="phi2-7b", lang="en", remote_loc=None): + def pretrained(name="phi2", lang="en", remote_loc=None): """Downloads and loads a pretrained model. Parameters ---------- name : str, optional - Name of the pretrained model, by default "phi2-7b" + Name of the pretrained model, by default "phi2" lang : str, optional Language of the pretrained model, by default "en" remote_loc : str, optional diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala index 0614b7b91ffd31..43ab7a9f6264dd 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala @@ -69,14 +69,14 @@ import org.json4s.jackson.JsonMethods._ * .setInputCols("document") * .setOutputCol("generation") * }}} - * The default model is `"mistral-7b"`, if no name is provided. For available pretrained models + * The default model is `"mistral_7b"`, if no name is provided. For available pretrained models * please see the [[https://sparknlp.org/models?q=mistral Models Hub]]. * * For extended examples of usage, see * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTestSpec.scala MistralTestSpec]]. * * '''References:''' - * - [[https://mistral.ai/news/announcing-mistral-7b/ Mistral 7B]] + * - [[https://mistral.ai/news/announcing-mistral_7b/ Mistral 7B]] * - [[https://github.com/mistralai/mistral-src]] * * '''Paper Abstract:''' @@ -106,7 +106,7 @@ import org.json4s.jackson.JsonMethods._ * .setInputCol("text") * .setOutputCol("documents") * - * val mistral = MistralTransformer.pretrained("mistral-7b") + * val mistral = MistralTransformer.pretrained("mistral_7b") * .setInputCols(Array("documents")) * .setMinOutputLength(10) * .setMaxOutputLength(50) @@ -323,7 +323,7 @@ class MistralTransformer(override val uid: String) trait ReadablePretrainedMistralTransformerModel extends ParamsAndFeaturesReadable[MistralTransformer] with HasPretrained[MistralTransformer] { - override val defaultModelName: Some[String] = Some("mistral-7b") + override val defaultModelName: Some[String] = Some("mistral_7b") /** Java compliant-overrides */ override def pretrained(): MistralTransformer = super.pretrained() diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala index 9f7657eeeac09c..fbb16fa7e13ea2 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala @@ -113,7 +113,7 @@ import org.json4s.jackson.JsonMethods._ * .setInputCol("text") * .setOutputCol("documents") * - * val Phi2 = Phi2Transformer.pretrained("Phi2-7b") + * val Phi2 = Phi2Transformer.pretrained("phi2") * .setInputCols(Array("documents")) * .setMinOutputLength(10) * .setMaxOutputLength(50) @@ -323,8 +323,8 @@ class Phi2Transformer(override val uid: String) path, spark, wrappers.get, - LLAMA2Transformer.suffix, - LLAMA2Transformer.openvinoFile) + Phi2Transformer.suffix, + Phi2Transformer.openvinoFile) } } } @@ -332,7 +332,7 @@ class Phi2Transformer(override val uid: String) trait ReadablePretrainedPhi2TransformerModel extends ParamsAndFeaturesReadable[Phi2Transformer] with HasPretrained[Phi2Transformer] { - override val defaultModelName: Some[String] = Some("Phi2-7b") + override val defaultModelName: Some[String] = Some("phi2") /** Java compliant-overrides */ override def pretrained(): Phi2Transformer = super.pretrained() @@ -351,7 +351,7 @@ trait ReadPhi2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel { override val onnxFile: String = "phi2_onnx" val suffix: String = "_phi2" - override val openvinoFile: String = "llama2_openvino" + override val openvinoFile: String = "phi2_openvino" def readModel(instance: Phi2Transformer, path: String, spark: SparkSession): Unit = { instance.getEngine match { @@ -363,7 +363,7 @@ trait ReadPhi2TransformerDLModel extends ReadOnnxModel with ReadOpenvinoModel { instance.setModelIfNotSet(spark, Some(onnxWrappers), None) case Openvino.name => val ovWrapper = - readOpenvinoModel(path, spark, "_llama2_ov") + readOpenvinoModel(path, spark, "_phi2_ov") instance.setModelIfNotSet(spark, None, Some(ovWrapper)) case _ => throw new Exception(notSupportedEngineError) diff --git a/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTestSpec.scala b/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTestSpec.scala index 0a51ae130360f2..d307c67ccfa9d8 100644 --- a/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTestSpec.scala +++ b/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTestSpec.scala @@ -24,7 +24,7 @@ import org.scalatest.flatspec.AnyFlatSpec class MistralTestSpec extends AnyFlatSpec { - "mistral-7b" should "should handle temperature=0 correctly and not crash when predicting more than 1 element with doSample=True" taggedAs SlowTest in { + "mistral_7b" should "should handle temperature=0 correctly and not crash when predicting more than 1 element with doSample=True" taggedAs SlowTest in { // Even tough the Paper states temperature in interval [0,1), using temperature=0 will result in division by 0 error. // Also DoSample=True may result in infinities being generated and distFiltered.length==0 which results in exception if we don't return 0 instead internally. val testData = ResourceHelper.spark From a070adce78404221a3c0df0da737e07bd9df0256 Mon Sep 17 00:00:00 2001 From: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Date: Sun, 14 Jul 2024 09:56:51 -0500 Subject: [PATCH 2/7] [SPARKNLP-1052] Adding random suffix to avoid duplication in spark files (#14340) --- .../com/johnsnowlabs/ml/onnx/OnnxWrapper.scala | 9 ++++++++- .../johnsnowlabs/ml/openvino/OpenvinoWrapper.scala | 12 +++++++++--- .../johnsnowlabs/ml/util/LoadExternalModel.scala | 13 +++++++++++++ .../com/johnsnowlabs/util/ZipArchiveUtil.scala | 7 ++++++- 4 files changed, 36 insertions(+), 5 deletions(-) diff --git a/src/main/scala/com/johnsnowlabs/ml/onnx/OnnxWrapper.scala b/src/main/scala/com/johnsnowlabs/ml/onnx/OnnxWrapper.scala index 3b08931558a41a..6e748faa72ee63 100644 --- a/src/main/scala/com/johnsnowlabs/ml/onnx/OnnxWrapper.scala +++ b/src/main/scala/com/johnsnowlabs/ml/onnx/OnnxWrapper.scala @@ -20,6 +20,7 @@ import ai.onnxruntime.OrtSession.SessionOptions import ai.onnxruntime.OrtSession.SessionOptions.{ExecutionMode, OptLevel} import ai.onnxruntime.providers.OrtCUDAProviderOptions import ai.onnxruntime.{OrtEnvironment, OrtSession} +import com.johnsnowlabs.ml.util.LoadExternalModel import com.johnsnowlabs.util.{ConfigHelper, FileHelper, ZipArchiveUtil} import org.apache.spark.SparkFiles import org.apache.spark.sql.SparkSession @@ -114,9 +115,10 @@ object OnnxWrapper { .toString // 2. Unpack archive + val randomSuffix = generateRandomSuffix(onnxFileSuffix) val folder = if (zipped) - ZipArchiveUtil.unzip(new File(modelPath), Some(tmpFolder), onnxFileSuffix) + ZipArchiveUtil.unzip(new File(modelPath), Some(tmpFolder), randomSuffix) else modelPath @@ -151,6 +153,11 @@ object OnnxWrapper { onnxWrapper } + private def generateRandomSuffix(fileSuffix: Option[String]): Option[String] = { + val randomSuffix = Some(LoadExternalModel.generateRandomString(10)) + Some(s"${randomSuffix.get}${fileSuffix.getOrElse("")}") + } + private def mapToSessionOptionsObject(sessionOptions: Map[String, String]): SessionOptions = { val providers = OrtEnvironment.getAvailableProviders if (providers.toArray.map(x => x.toString).contains("CUDA")) { diff --git a/src/main/scala/com/johnsnowlabs/ml/openvino/OpenvinoWrapper.scala b/src/main/scala/com/johnsnowlabs/ml/openvino/OpenvinoWrapper.scala index dd8b5f466a2927..fa5908383bae97 100644 --- a/src/main/scala/com/johnsnowlabs/ml/openvino/OpenvinoWrapper.scala +++ b/src/main/scala/com/johnsnowlabs/ml/openvino/OpenvinoWrapper.scala @@ -17,8 +17,8 @@ package com.johnsnowlabs.ml.openvino import com.johnsnowlabs.ml.util.LoadExternalModel.notSupportedEngineError -import com.johnsnowlabs.ml.util.{ONNX, Openvino, TensorFlow} -import com.johnsnowlabs.util.{ConfigHelper, ConfigLoader, FileHelper, ZipArchiveUtil} +import com.johnsnowlabs.ml.util.{LoadExternalModel, ONNX, Openvino, TensorFlow} +import com.johnsnowlabs.util.{FileHelper, ZipArchiveUtil} import org.apache.commons.io.{FileUtils, FilenameUtils} import org.apache.spark.SparkFiles import org.apache.spark.sql.SparkSession @@ -113,9 +113,10 @@ object OpenvinoWrapper { .toAbsolutePath .toString + val randomSuffix = generateRandomSuffix(ovFileSuffix) val folder = if (zipped) - ZipArchiveUtil.unzip(new File(modelPath), Some(tmpFolder), ovFileSuffix) + ZipArchiveUtil.unzip(new File(modelPath), Some(tmpFolder), randomSuffix) else modelPath @@ -151,6 +152,11 @@ object OpenvinoWrapper { openvinoWrapper } + private def generateRandomSuffix(fileSuffix: Option[String]): Option[String] = { + val randomSuffix = Some(LoadExternalModel.generateRandomString(10)) + Some(s"${randomSuffix.get}${fileSuffix.getOrElse("")}") + } + /** Convert the model at srcPath to OpenVINO IR Format and export to exportPath. * * @param srcPath diff --git a/src/main/scala/com/johnsnowlabs/ml/util/LoadExternalModel.scala b/src/main/scala/com/johnsnowlabs/ml/util/LoadExternalModel.scala index 93cab6a0a89dd7..cd0761f0f9daa3 100644 --- a/src/main/scala/com/johnsnowlabs/ml/util/LoadExternalModel.scala +++ b/src/main/scala/com/johnsnowlabs/ml/util/LoadExternalModel.scala @@ -22,6 +22,7 @@ import com.johnsnowlabs.nlp.util.io.{ExternalResource, ReadAs, ResourceHelper} import java.io.File import java.nio.file.Paths import scala.io.Source +import scala.util.Random object LoadExternalModel { @@ -228,4 +229,16 @@ object LoadExternalModel { f } + /** Generates a random alphanumeric string of a given length. + * + * @param n + * the length of the generated string + * @return + * a random alphanumeric string of length n + */ + def generateRandomString(n: Int): String = { + val alphanumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789" + (1 to n).map(_ => alphanumeric(Random.nextInt(alphanumeric.length))).mkString + } + } diff --git a/src/main/scala/com/johnsnowlabs/util/ZipArchiveUtil.scala b/src/main/scala/com/johnsnowlabs/util/ZipArchiveUtil.scala index 8c85f2915561f3..0443471e5080bc 100644 --- a/src/main/scala/com/johnsnowlabs/util/ZipArchiveUtil.scala +++ b/src/main/scala/com/johnsnowlabs/util/ZipArchiveUtil.scala @@ -135,7 +135,7 @@ object ZipArchiveUtil { val zip = new ZipFile(file) zip.entries.asScala foreach { entry => - val entryName = if (suffix.isDefined) suffix.get + "_" + entry.getName else entry.getName + val entryName = buildEntryName(entry, suffix) val entryPath = { if (entryName.startsWith(basename)) entryName.substring(0, basename.length) @@ -165,4 +165,9 @@ object ZipArchiveUtil { destDir.getPath } + private def buildEntryName(entry: ZipEntry, suffix: Option[String]): String = { + val entryName = if (suffix.isDefined) suffix.get + "_" + entry.getName else entry.getName + entryName.split("_").distinct.mkString("_") + } + } From e120e61e7c5bb2e0592c26aaea6a85bc6265df09 Mon Sep 17 00:00:00 2001 From: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Date: Sun, 14 Jul 2024 09:59:05 -0500 Subject: [PATCH 3/7] [SPARKNLP-1015] Restructuring Readme and Documentation (#14341) --- README.md | 1227 +++------------------------------- docs/_data/navigation.yml | 6 + docs/en/advanced_settings.md | 142 ++++ docs/en/features.md | 120 ++++ docs/en/install.md | 435 +++++++++++- docs/en/pipelines.md | 1035 +++------------------------- 6 files changed, 906 insertions(+), 2059 deletions(-) create mode 100644 docs/en/advanced_settings.md create mode 100644 docs/en/features.md diff --git a/README.md b/README.md index cb7c32736e8638..fe8f9fe9fcc625 100644 --- a/README.md +++ b/README.md @@ -29,148 +29,17 @@ It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of- Take a look at our official Spark NLP page: [https://sparknlp.org/](https://sparknlp.org/) for user documentation and examples -## Community support - -- [Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q) For live discussion with the Spark NLP community and the team -- [GitHub](https://github.com/JohnSnowLabs/spark-nlp) Bug reports, feature requests, and contributions -- [Discussions](https://github.com/JohnSnowLabs/spark-nlp/discussions) Engage with other community members, share ideas, - and show off how you use Spark NLP! -- [Medium](https://medium.com/spark-nlp) Spark NLP articles -- [YouTube](https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos) Spark NLP video tutorials - -## Table of contents - -- [Features](#features) -- [Requirements](#requirements) -- [Quick Start](#quick-start) -- [Apache Spark Support](#apache-spark-support) -- [Scala & Python Support](#scala-and-python-support) -- [Databricks Support](#databricks-support) -- [EMR Support](#emr-support) -- [Using Spark NLP](#usage) - - [Packages Cheatsheet](#packages-cheatsheet) - - [Spark Packages](#spark-packages) - - [Scala](#scala) - - [Maven](#maven) - - [SBT](#sbt) - - [Python](#python) - - [Pip/Conda](#pipconda) - - [Compiled JARs](#compiled-jars) - - [Apache Zeppelin](#apache-zeppelin) - - [Jupyter Notebook](#jupyter-notebook-python) - - [Google Colab Notebook](#google-colab-notebook) - - [Kaggle Kernel](#kaggle-kernel) - - [Databricks Cluster](#databricks-cluster) - - [EMR Cluster](#emr-cluster) - - [GCP Dataproc](#gcp-dataproc) - - [Spark NLP Configuration](#spark-nlp-configuration) -- [Pipelines & Models](#pipelines-and-models) - - [Pipelines](#pipelines) - - [Models](#models) -- [Offline](#offline) -- [Examples](#examples) -- [FAQ](#faq) -- [Citation](#citation) -- [Contributing](#contributing) - ## Features - -- Tokenization -- Trainable Word Segmentation -- Stop Words Removal -- Token Normalizer -- Document Normalizer -- Document & Text Splitter -- Stemmer -- Lemmatizer -- NGrams -- Regex Matching -- Text Matching -- Chunking -- Date Matcher -- Sentence Detector -- Deep Sentence Detector (Deep learning) -- Dependency parsing (Labeled/unlabeled) -- SpanBertCorefModel (Coreference Resolution) -- Part-of-speech tagging -- Sentiment Detection (ML models) -- Spell Checker (ML and DL models) -- Word Embeddings (GloVe and Word2Vec) -- Doc2Vec (based on Word2Vec) -- BERT Embeddings (TF Hub & HuggingFace models) -- DistilBERT Embeddings (HuggingFace models) -- CamemBERT Embeddings (HuggingFace models) -- RoBERTa Embeddings (HuggingFace models) -- DeBERTa Embeddings (HuggingFace v2 & v3 models) -- XLM-RoBERTa Embeddings (HuggingFace models) -- Longformer Embeddings (HuggingFace models) -- ALBERT Embeddings (TF Hub & HuggingFace models) -- XLNet Embeddings -- ELMO Embeddings (TF Hub models) -- Universal Sentence Encoder (TF Hub models) -- BERT Sentence Embeddings (TF Hub & HuggingFace models) -- RoBerta Sentence Embeddings (HuggingFace models) -- XLM-RoBerta Sentence Embeddings (HuggingFace models) -- INSTRUCTOR Embeddings (HuggingFace models) -- E5 Embeddings (HuggingFace models) -- MPNet Embeddings (HuggingFace models) -- UAE Embeddings (HuggingFace models) -- OpenAI Embeddings -- Sentence & Chunk Embeddings -- Unsupervised keywords extraction -- Language Detection & Identification (up to 375 languages) -- Multi-class & Multi-labe Sentiment analysis (Deep learning) -- Multi-class Text Classification (Deep learning) -- BERT for Token & Sequence Classification & Question Answering -- DistilBERT for Token & Sequence Classification & Question Answering -- CamemBERT for Token & Sequence Classification & Question Answering -- ALBERT for Token & Sequence Classification & Question Answering -- RoBERTa for Token & Sequence Classification & Question Answering -- DeBERTa for Token & Sequence Classification & Question Answering -- XLM-RoBERTa for Token & Sequence Classification & Question Answering -- Longformer for Token & Sequence Classification & Question Answering -- MPnet for Token & Sequence Classification & Question Answering -- XLNet for Token & Sequence Classification -- Zero-Shot NER Model -- Zero-Shot Text Classification by Transformers (ZSL) -- Neural Machine Translation (MarianMT) -- Many-to-Many multilingual translation model (Facebook M2M100) -- Table Question Answering (TAPAS) -- Text-To-Text Transfer Transformer (Google T5) -- Generative Pre-trained Transformer 2 (OpenAI GPT2) -- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART) -- Chat and Conversational LLMs (Facebook Llama-2) -- Vision Transformer (Google ViT) -- Swin Image Classification (Microsoft Swin Transformer) -- ConvNext Image Classification (Facebook ConvNext) -- Vision Encoder Decoder for image-to-text like captioning -- Zero-Shot Image Classification by OpenAI's CLIP -- Automatic Speech Recognition (Wav2Vec2) -- Automatic Speech Recognition (HuBERT) -- Automatic Speech Recognition (OpenAI Whisper) -- Named entity recognition (Deep learning) -- Easy ONNX, OpenVINO, and TensorFlow integrations -- GPU Support -- Full integration with Spark ML functions -- +31000 pre-trained models in +200 languages! -- +6000 pre-trained pipelines in +200 languages! -- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, - Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more. - -## Requirements - -To use Spark NLP you need the following requirements: - -- Java 8 and 11 -- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x - -**GPU (optional):** - -Spark NLP 5.4.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - -- NVIDIA® GPU drivers version 450.80.02 or higher -- CUDA® Toolkit 11.2 -- cuDNN SDK 8.1.0 +- [Text Preprocessing](https://sparknlp.org/docs/en/features#text-preproccesing) +- [Parsing and Analysis](https://sparknlp.org/docs/en/features#parsing-and-analysis) +- [Sentiment and Classification](https://sparknlp.org/docs/en/features#sentiment-and-classification) +- [Embeddings](https://sparknlp.org/docs/en/features#embeddings) +- [Classification and Question Answering](https://sparknlp.org/docs/en/features#classification-and-question-answering-models) +- [Machine Translation and Generation](https://sparknlp.org/docs/en/features#machine-translation-and-generation) +- [Image and Speech](https://sparknlp.org/docs/en/features#image-and-speech) +- [Integration and Interoperability (ONNX, OpenVINO)](https://sparknlp.org/docs/en/features#integration-and-interoperability) +- [Pre-trained Models (36000+ in +200 languages)](https://sparknlp.org/docs/en/features#pre-trained-models) +- [Multi-lingual Support](https://sparknlp.org/docs/en/features#multi-lingual-support) ## Quick Start @@ -225,7 +94,27 @@ Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris'] For more examples, you can visit our dedicated [examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) to showcase all Spark NLP use cases! -## Apache Spark Support +### Packages Cheatsheet + +This is a cheatsheet for corresponding Spark NLP Maven package to Apache Spark / PySpark major version: + +| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | +|-------------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| +| 3.0/3.1/3.2/3.3/3.4/3.5 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | +| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | + +NOTE: `M1/M2` and `AArch64` are under `experimental` support. Access and support to these architectures are limited by the +community and we had to build most of the dependencies by ourselves to make them compatible. We support these two +architectures, however, they may not work in some environments. + +## Pipelines and Models +For a quick example of using pipelines and models take a look at our official [documentation](https://sparknlp.org/docs/en/install#pipelines-and-models) + +#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more + +## Platform and Ecosystem Support + +### Apache Spark Support Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x @@ -236,15 +125,10 @@ Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports | 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO | | 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO | | 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | -| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | -| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO | -| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO | -| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO | -| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO | Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases). -## Scala and Python Support +### Scala and Python Support | Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | |-----------|------------|------------|------------|------------|------------|------------|------------| @@ -252,737 +136,87 @@ Find out more about `Spark NLP` versions from our [release notes](https://github | 5.2.x | NO | YES | YES | YES | YES | NO | YES | | 5.1.x | NO | YES | YES | YES | YES | NO | YES | | 5.0.x | NO | YES | YES | YES | YES | NO | YES | -| 4.4.x | NO | YES | YES | YES | YES | NO | YES | -| 4.3.x | YES | YES | YES | YES | YES | NO | YES | -| 4.2.x | YES | YES | YES | YES | YES | NO | YES | -| 4.1.x | YES | YES | YES | YES | NO | NO | YES | -| 4.0.x | YES | YES | YES | YES | NO | NO | YES | -## Databricks Support +Find out more about 4.x `SparkNLP` versions in our official [documentation](https://sparknlp.org/docs/en/install#apache-spark-support) + +### Databricks Support Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: -**CPU:** - -- 9.1 -- 9.1 ML -- 10.1 -- 10.1 ML -- 10.2 -- 10.2 ML -- 10.3 -- 10.3 ML -- 10.4 -- 10.4 ML -- 10.5 -- 10.5 ML -- 11.0 -- 11.0 ML -- 11.1 -- 11.1 ML -- 11.2 -- 11.2 ML -- 11.3 -- 11.3 ML -- 12.0 -- 12.0 ML -- 12.1 -- 12.1 ML -- 12.2 -- 12.2 ML -- 13.0 -- 13.0 ML -- 13.1 -- 13.1 ML -- 13.2 -- 13.2 ML -- 13.3 -- 13.3 ML -- 14.0 -- 14.0 ML -- 14.1 -- 14.1 ML -- 14.2 -- 14.2 ML -- 14.3 -- 14.3 ML - -**GPU:** - -- 9.1 ML & GPU -- 10.1 ML & GPU -- 10.2 ML & GPU -- 10.3 ML & GPU -- 10.4 ML & GPU -- 10.5 ML & GPU -- 11.0 ML & GPU -- 11.1 ML & GPU -- 11.2 ML & GPU -- 11.3 ML & GPU -- 12.0 ML & GPU -- 12.1 ML & GPU -- 12.2 ML & GPU -- 13.0 ML & GPU -- 13.1 ML & GPU -- 13.2 ML & GPU -- 13.3 ML & GPU -- 14.0 ML & GPU -- 14.1 ML & GPU -- 14.2 ML & GPU -- 14.3 ML & GPU - -## EMR Support +| **CPU** | **GPU** | +|--------------------|--------------------| +| 14.0 / 14.0 ML | 14.0 ML & GPU | +| 14.1 / 14.1 ML | 14.1 ML & GPU | +| 14.2 / 14.2 ML | 14.2 ML & GPU | +| 14.3 / 14.3 ML | 14.3 ML & GPU | + +We are compatible with older runtimes. For a full list check databricks support in our official [documentation](https://sparknlp.org/docs/en/install#databricks-support) + +### EMR Support Spark NLP 5.4.0 has been tested and is compatible with the following EMR releases: -- emr-6.2.0 -- emr-6.3.0 -- emr-6.3.1 -- emr-6.4.0 -- emr-6.5.0 -- emr-6.6.0 -- emr-6.7.0 -- emr-6.8.0 -- emr-6.9.0 -- emr-6.10.0 -- emr-6.11.0 -- emr-6.12.0 -- emr-6.13.0 -- emr-6.14.0 -- emr-6.15.0 -- emr-7.0.0 +| **EMR Release** | +|--------------------| +| emr-6.13.0 | +| emr-6.14.0 | +| emr-6.15.0 | +| emr-7.0.0 | + +We are compatible with older EMR releases. For a full list check EMR support in our official [documentation](https://sparknlp.org/docs/en/install#emr-support) Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) Full list of [Amazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html) NOTE: The EMR 6.1.0 and 6.1.1 are not supported. -## Usage - -## Packages Cheatsheet - -This is a cheatsheet for corresponding Spark NLP Maven package to Apache Spark / PySpark major version: - -| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | -|-------------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| -| 3.0/3.1/3.2/3.3/3.4/3.5 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | -| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | - -NOTE: `M1/M2` and `AArch64` are under `experimental` support. Access and support to these architectures are limited by the -community and we had to build most of the dependencies by ourselves to make them compatible. We support these two -architectures, however, they may not work in some environments. - -## Spark Packages +## Installation ### Command line (requires internet connection) +To install spark-nlp packages through command line follow [these instructions](https://sparknlp.org/docs/en/install#command-line) from our official documentation -Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x - -#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12) - -```sh -# CPU - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -The `spark-nlp` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). - -```sh -# GPU - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 - -``` - -The `spark-nlp-gpu` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu). - -```sh -# AArch64 - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 - -``` - -The `spark-nlp-aarch64` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64). - -```sh -# M1/M2 (Apple Silicon) - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 - -``` - -The `spark-nlp-silicon` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon). - -**NOTE**: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following -set in your SparkSession: - -```sh -spark-shell \ - --driver-memory 16g \ - --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -## Scala +### Scala Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x versions. Our packages are -deployed to Maven central. To add any of our packages as a dependency in your application you can follow these -coordinates: - -### Maven - -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: - -```xml - - - com.johnsnowlabs.nlp - spark-nlp_2.12 - 5.4.0 - -``` - -**spark-nlp-gpu:** - -```xml - - - com.johnsnowlabs.nlp - spark-nlp-gpu_2.12 - 5.4.0 - -``` - -**spark-nlp-aarch64:** - -```xml - - - com.johnsnowlabs.nlp - spark-nlp-aarch64_2.12 - 5.4.0 - -``` - -**spark-nlp-silicon:** - -```xml - - - com.johnsnowlabs.nlp - spark-nlp-silicon_2.12 - 5.4.0 - -``` - -### SBT - -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.4.0" -``` - -**spark-nlp-gpu:** - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.4.0" -``` - -**spark-nlp-aarch64:** - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.4.0" -``` - -**spark-nlp-silicon:** - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.4.0" -``` - -Maven -Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp) +deployed to Maven central. To add any of our packages as a dependency in your application you can follow [these instructions](https://sparknlp.org/docs/en/install#scala-and-java) +from our official documentation. If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your projects [Spark NLP SBT Starter](https://github.com/maziyarpanahi/spark-nlp-starter) -## Python - -Spark NLP supports Python 3.6.x and above depending on your major PySpark version. - -### Python without explicit Pyspark installation - -### Pip/Conda - -If you installed pyspark through pip/conda, you can install `spark-nlp` through the same channel. - -Pip: - -```bash -pip install spark-nlp==5.4.0 -``` - -Conda: - -```bash -conda install -c johnsnowlabs spark-nlp -``` - -PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/) / -Anaconda [spark-nlp package](https://anaconda.org/JohnSnowLabs/spark-nlp) - -Then you'll have to create a SparkSession either from Spark NLP: - -```python -import sparknlp - -spark = sparknlp.start() -``` - -or manually: - -```python -spark = SparkSession.builder - .appName("Spark NLP") - .master("local[*]") - .config("spark.driver.memory", "16G") - .config("spark.driver.maxResultSize", "0") - .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") - .getOrCreate() -``` - -If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course, -you'll have to put the jars in a reachable location for all driver and executor nodes. - -**Quick example:** - -```python -import sparknlp -from sparknlp.pretrained import PretrainedPipeline - -# create or get Spark Session - -spark = sparknlp.start() - -sparknlp.version() -spark.version - -# download, load and annotate a text by pre-trained pipeline - -pipeline = PretrainedPipeline('recognize_entities_dl', 'en') -result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo') -``` - -## Compiled JARs - -### Build from source - -#### spark-nlp - -- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - -```bash -sbt assembly -``` - -- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - -```bash -sbt -Dis_gpu=true assembly -``` - -- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - -```bash -sbt -Dis_silicon=true assembly -``` - -### Using the jar manually - -If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it -from [Maven Central](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp). - -To add JARs to spark programs use the `--jars` option: - -```sh -spark-shell --jars spark-nlp.jar -``` - -The preferred way to use the library when running spark programs is using the `--packages` option as specified in -the `spark-packages` section. - -## Apache Zeppelin - -Use either one of the following options - -- Add the following Maven Coordinates to the interpreter's library list - -```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is - available to driver path - -### Python in Zeppelin - -Apart from the previous step, install the python module through pip - -```bash -pip install spark-nlp==5.4.0 -``` - -Or you can install `spark-nlp` from inside Zeppelin by using Conda: - -```bash -python.conda install -c johnsnowlabs spark-nlp -``` - -Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose. - -Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and -install the pip library with (e.g. `python3`). - -An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as -shown earlier since it includes both scala and python side installation. - -## Jupyter Notebook (Python) - -**Recommended:** - -The easiest way to get this done on Linux and macOS is to simply install `spark-nlp` and `pyspark` PyPI packages and -launch the Jupyter from the same Python environment: - -```sh -$ conda create -n sparknlp python=3.8 -y -$ conda activate sparknlp -# spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter -$ jupyter notebook -``` - -Then you can use `python3` kernel to run your code with creating SparkSession via `spark = sparknlp.start()`. - -**Optional:** - -If you are in different operating systems and require to make Jupyter Notebook run by using pyspark, you can follow -these steps: - -```bash -export SPARK_HOME=/path/to/your/spark/folder -export PYSPARK_PYTHON=python3 -export PYSPARK_DRIVER_PYTHON=jupyter -export PYSPARK_DRIVER_PYTHON_OPTS=notebook - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` - -If not using pyspark at all, you'll have to run the instructions -pointed [here](#python-without-explicit-pyspark-installation) - -## Google Colab Notebook - -Google Colab is perhaps the easiest way to get started with spark-nlp. It requires no installation or setup other than -having a Google account. - -Run the following code in Google Colab notebook and start using spark-nlp right away. - -```sh -# This is only to setup PySpark and Spark NLP on Colab -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash -``` - -This script comes with the two options to define `pyspark` and `spark-nlp` versions via options: - -```sh -# -p is for pyspark -# -s is for spark-nlp -# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage -# by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0 -``` - -[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) -is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP -pretrained pipelines. - -## Kaggle Kernel - -Run the following code in Kaggle Kernel and start using spark-nlp right away. - -```sh -# Let's setup Kaggle for Spark NLP and PySpark -!wget https://setup.johnsnowlabs.com/kaggle.sh -O - | bash -``` - -This script comes with the two options to define `pyspark` and `spark-nlp` versions via options: - -```sh -# -p is for pyspark -# -s is for spark-nlp -# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage -# by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0 -``` - -[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live -demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP pretrained pipeline. - -## Databricks Cluster - -1. Create a cluster if you don't have one already - -2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: - - ```bash - spark.kryoserializer.buffer.max 2000M - spark.serializer org.apache.spark.serializer.KryoSerializer - ``` - -3. In `Libraries` tab inside your cluster you need to follow these steps: - - 3.1. Install New -> PyPI -> `spark-nlp==5.4.0` -> Install - - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install - -4. Now you can attach your notebook to the cluster and use Spark NLP! - -NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark -NLP Maven package name (Maven Coordinate) for your runtime from -our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet) - -## EMR Cluster - -To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software -configuration. - -A sample of your bootstrap script - -```.sh -#!/bin/bash -set -x -e - -echo -e 'export PYSPARK_PYTHON=/usr/bin/python3 -export HADOOP_CONF_DIR=/etc/hadoop/conf -export SPARK_JARS_DIR=/usr/lib/spark/jars -export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc - -sudo python3 -m pip install awscli boto spark-nlp - -set +x -exit 0 - -``` - -A sample of your software configuration in JSON on S3 (must be public access): - -```.json -[{ - "Classification": "spark-env", - "Configurations": [{ - "Classification": "export", - "Properties": { - "PYSPARK_PYTHON": "/usr/bin/python3" - } - }] -}, -{ - "Classification": "spark-defaults", - "Properties": { - "spark.yarn.stagingDir": "hdfs:///tmp", - "spark.yarn.preserve.staging.files": "true", - "spark.kryoserializer.buffer.max": "2000M", - "spark.serializer": "org.apache.spark.serializer.KryoSerializer", - "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0" - } -}] -``` - -A sample of AWS CLI to launch EMR cluster: - -```.sh -aws emr create-cluster \ ---name "Spark NLP 5.4.0" \ ---release-label emr-6.2.0 \ ---applications Name=Hadoop Name=Spark Name=Hive \ ---instance-type m4.4xlarge \ ---instance-count 3 \ ---use-default-roles \ ---log-uri "s3:///" \ ---bootstrap-actions Path=s3:///emr-bootstrap.sh,Name=custome \ ---configurations "https:///sparknlp-config.json" \ ---ec2-attributes KeyName=,EmrManagedMasterSecurityGroup=,EmrManagedSlaveSecurityGroup= \ ---profile -``` - -## GCP Dataproc - -1. Create a cluster if you don't have one already as follows. - -At gcloud shell: - -```bash -gcloud services enable dataproc.googleapis.com \ - compute.googleapis.com \ - storage-component.googleapis.com \ - bigquery.googleapis.com \ - bigquerystorage.googleapis.com -``` - -```bash -REGION= -``` - -```bash -BUCKET_NAME= -gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME} -``` - -```bash -REGION= -ZONE= -CLUSTER_NAME= -BUCKET_NAME= -``` - -You can set image-version, master-machine-type, worker-machine-type, -master-boot-disk-size, worker-boot-disk-size, num-workers as your needs. -If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components. -And, you should enable gateway. -Don't forget to set the maven coordinates for the jar in properties. - -```bash -gcloud dataproc clusters create ${CLUSTER_NAME} \ - --region=${REGION} \ - --zone=${ZONE} \ - --image-version=2.0 \ - --master-machine-type=n1-standard-4 \ - --worker-machine-type=n1-standard-2 \ - --master-boot-disk-size=128GB \ - --worker-boot-disk-size=128GB \ - --num-workers=2 \ - --bucket=${BUCKET_NAME} \ - --optional-components=JUPYTER \ - --enable-component-gateway \ - --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ - --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. +### Python -3. Now, you can attach your notebook to the cluster and use the Spark NLP! +Spark NLP supports Python 3.7.x and above depending on your major PySpark version. +Check all available installations for Python in our official [documentation](https://sparknlp.org/docs/en/install#python) -## Spark NLP Configuration -You can change the following Spark NLP configurations via Spark Configuration: +### Compiled JARs +To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation -| Property Name | Default | Meaning | -|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory | -| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS | -| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory | -| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. | -| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. | -| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. | -| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. | +## Platform-Specific Instructions +For detailed instructions on how to use Spark NLP on supported platforms, please refer to our official documentation: -### How to set Spark NLP Configuration +| Platform | Supported Language(s) | +|-------------------------|-----------------------| +| [Apache Zeppelin](https://sparknlp.org/docs/en/install#apache-zeppelin) | Scala, Python | +| [Jupyter Notebook](https://sparknlp.org/docs/en/install#jupter-notebook) | Python | +| [Google Colab Notebook](https://sparknlp.org/docs/en/install#google-colab-notebook) | Python | +| [Kaggle Kernel](https://sparknlp.org/docs/en/install#kaggle-kernel) | Python | +| [Databricks Cluster](https://sparknlp.org/docs/en/install#databricks-cluster) | Scala, Python | +| [EMR Cluster](https://sparknlp.org/docs/en/install#emr-cluster) | Scala, Python | +| [GCP Dataproc Cluster](https://sparknlp.org/docs/en/install#gcp-dataproc) | Scala, Python | -**SparkSession:** - -You can use `.config()` during SparkSession creation to set Spark NLP configurations. - -```python -from pyspark.sql import SparkSession - -spark = SparkSession.builder - .master("local[*]") - .config("spark.driver.memory", "16G") - .config("spark.driver.maxResultSize", "0") - .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - .config("spark.kryoserializer.buffer.max", "2000m") - .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") - .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") - .getOrCreate() -``` - -**spark-shell:** - -```sh -spark-shell \ - --driver-memory 16g \ - --conf spark.driver.maxResultSize=0 \ - --conf spark.serializer=org.apache.spark.serializer.KryoSerializer - --conf spark.kryoserializer.buffer.max=2000M \ - --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ - --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` -**pyspark:** +### Offline -```sh -pyspark \ - --driver-memory 16g \ - --conf spark.driver.maxResultSize=0 \ - --conf spark.serializer=org.apache.spark.serializer.KryoSerializer - --conf spark.kryoserializer.buffer.max=2000M \ - --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ - --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -**Databricks:** - -On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: +Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet. +Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation +to use Spark NLP offline -```bash -spark.kryoserializer.buffer.max 2000M -spark.serializer org.apache.spark.serializer.KryoSerializer -spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE -spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE -spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS -``` +## Advanced Settings -NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it. +You can change Spark NLP configurations via Spark properties configuration. +Please check [these instructions](https://sparknlp.org/docs/en/install#sparknlp-properties) from our official documentation. ### S3 Integration @@ -991,302 +225,24 @@ In Spark NLP we can define S3 locations to: - Export log files of training models - Store tensorflow graphs used in `NerDLApproach` -**Logging:** - -To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path - -```bash -spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs") -spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") -spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") -spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket") -spark.conf.set("spark.jsl.settings.aws.region", "my-region") -``` - -Now you can check the log on your S3 path defined in *spark.jsl.settings.annotator.log_folder* property. -Make sure to use the prefix *s3://*, otherwise it will use the default configuration. - -**Tensorflow Graphs:** - -To reference S3 location for downloading graphs. We need to set up AWS credentials - -```bash -spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") -spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") -spark.conf.set("spark.jsl.settings.aws.region", "my-region") -``` - -**MFA Configuration:** - -In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token -to the configuration as shown in the examples below -For logging: - -```bash -spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN") -``` - -An example of a bash script that gets temporal AWS credentials can be -found [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh) -This script requires three arguments: - -```bash -./aws_tmp_credentials.sh iam_user duration serial_number -``` - -## Pipelines and Models - -### Pipelines - -**Quick example:** - -```scala -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( - (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), - (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("explain_document_dl", lang = "en") - -val annotation = pipeline.transform(testData) - -annotation.show() -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.5.0 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| checked| lemma| stem| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Google has announ...|[[document, 0, 10...|[[token, 0, 5, Go...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...| -| 2|The Paris metro w...|[[document, 0, 11...|[[token, 0, 2, Th...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 4, 8, Pa...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+----------------------------------+ -|result | -+----------------------------------+ -|[Google, TensorFlow] | -|[Donald John Trump, United States]| -+----------------------------------+ -*/ -``` - -#### Showing Available Pipelines - -There are functions in Spark NLP that will list all the available Pipelines -of a particular language for you: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicPipelines(lang = "en") -/* -+--------------------------------------------+------+---------+ -| Pipeline | lang | version | -+--------------------------------------------+------+---------+ -| dependency_parse | en | 2.0.2 | -| analyze_sentiment_ml | en | 2.0.2 | -| check_spelling | en | 2.1.0 | -| match_datetime | en | 2.1.0 | - ... -| explain_document_ml | en | 3.1.3 | -+--------------------------------------------+------+---------+ -*/ -``` - -Or if we want to check for a particular version: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0") -/* -+---------------------------------------+------+---------+ -| Pipeline | lang | version | -+---------------------------------------+------+---------+ -| dependency_parse | en | 2.0.2 | - ... -| clean_slang | en | 3.0.0 | -| clean_pattern | en | 3.0.0 | -| check_spelling | en | 3.0.0 | -| dependency_parse | en | 3.0.0 | -+---------------------------------------+------+---------+ -*/ -``` - -#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more - -### Models +Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation. -**Some selected languages: -** `Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu` +## Documentation -**Quick online example:** - -```python -# load NER model trained by deep learning approach and GloVe word embeddings -ner_dl = NerDLModel.pretrained('ner_dl') -# load NER model trained by deep learning approach and BERT word embeddings -ner_bert = NerDLModel.pretrained('ner_dl_bert') -``` - -```scala -// load French POS tagger model trained by Universal Dependencies -val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang = "fr") -// load Italian LemmatizerModel -val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang = "it") -```` - -**Quick offline example:** - -- Loading `PerceptronModel` annotator model inside Spark NLP Pipeline - -```scala -val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/") - .setInputCols("document", "token") - .setOutputCol("pos") -``` - -#### Showing Available Models - -There are functions in Spark NLP that will list all the available Models -of a particular Annotator and language for you: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en") -/* -+---------------------------------------------+------+---------+ -| Model | lang | version | -+---------------------------------------------+------+---------+ -| onto_100 | en | 2.1.0 | -| onto_300 | en | 2.1.0 | -| ner_dl_bert | en | 2.2.0 | -| onto_100 | en | 2.4.0 | -| ner_conll_elmo | en | 3.2.2 | -+---------------------------------------------+------+---------+ -*/ -``` - -Or if we want to check for a particular version: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en", version = "3.1.0") -/* -+----------------------------+------+---------+ -| Model | lang | version | -+----------------------------+------+---------+ -| onto_100 | en | 2.1.0 | -| ner_aspect_based_sentiment | en | 2.6.2 | -| ner_weibo_glove_840B_300d | en | 2.6.2 | -| nerdl_atis_840b_300d | en | 2.7.1 | -| nerdl_snips_100d | en | 2.7.3 | -+----------------------------+------+---------+ -*/ -``` - -And to see a list of available annotators, you can use: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showAvailableAnnotators() -/* -AlbertEmbeddings -AlbertForTokenClassification -AssertionDLModel -... -XlmRoBertaSentenceEmbeddings -XlnetEmbeddings -*/ -``` - -#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more - -## Offline - -Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet. -If you are behind a proxy or a firewall with no access to the Maven repository (to download packages) or/and no access -to S3 (to automatically download models and pipelines), you can simply follow the instructions to have Spark NLP without -any limitations offline: - -- Instead of using the Maven package, you need to load our Fat JAR -- Instead of using PretrainedPipeline for pretrained pipelines or the `.pretrained()` function to download pretrained - models, you will need to manually download your pipeline/model from [Models Hub](https://sparknlp.org/models), - extract it, and load it. - -Example of `SparkSession` with Fat JAR to have Spark NLP offline: - -```python -spark = SparkSession.builder - .appName("Spark NLP") - .master("local[*]") - .config("spark.driver.memory", "16G") - .config("spark.driver.maxResultSize", "0") - .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.4.0.jar") - .getOrCreate() -``` - -- You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), - please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark - version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) -- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need - to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.4.0.jar`) - -Example of using pretrained Models and Pipelines in offline: - -```python -# instead of using pretrained() for online: -# french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang="fr") -# you download this model, extract it, and use .load -french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/") - .setInputCols("document", "token") - .setOutputCol("pos") - -# example for pipelines -# instead of using PretrainedPipeline -# pipeline = PretrainedPipeline('explain_document_dl', lang='en') -# you download this pipeline, extract it, and use PipelineModel -PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/") -``` - -- Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most - recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you -- If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup - you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`) - -## Examples +### Examples Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) repository to showcase all Spark NLP use cases! Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit. -### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) +#### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) -## FAQ +### FAQ [Check our Articles and Videos page here](https://sparknlp.org/learn) -## Citation +### Citation We have published a [paper](https://www.sciencedirect.com/science/article/pii/S2665963821000063) that you can cite for the Spark NLP library: @@ -1307,6 +263,15 @@ the Spark NLP library: } ``` +## Community support + +- [Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q) For live discussion with the Spark NLP community and the team +- [GitHub](https://github.com/JohnSnowLabs/spark-nlp) Bug reports, feature requests, and contributions +- [Discussions](https://github.com/JohnSnowLabs/spark-nlp/discussions) Engage with other community members, share ideas, + and show off how you use Spark NLP! +- [Medium](https://medium.com/spark-nlp) Spark NLP articles +- [YouTube](https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos) Spark NLP video tutorials + ## Contributing We appreciate any sort of contributions: diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml index 21b4f372614dd6..c6e75a2a846237 100755 --- a/docs/_data/navigation.yml +++ b/docs/_data/navigation.yml @@ -36,6 +36,12 @@ sparknlp: url: /docs/en/quickstart - title: Install Spark NLP url: /docs/en/install + - title: Advanced Settings + url: /docs/en/advanced_settings + - title: Features + url: /docs/en/features + - title: Pipelines and Models + url: /docs/en/pipelines - title: General Concepts url: /docs/en/concepts - title: Annotators diff --git a/docs/en/advanced_settings.md b/docs/en/advanced_settings.md new file mode 100644 index 00000000000000..84c8dc5751187e --- /dev/null +++ b/docs/en/advanced_settings.md @@ -0,0 +1,142 @@ +--- +layout: docs +header: true +seotitle: Spark NLP - Advanced Settings +title: Spark NLP - Advanced Settings +permalink: /docs/en/advanced_settings +key: docs-install +modify_date: "2024-07-04" +show_nav: true +sidebar: + nav: sparknlp +--- + +
+ +## SparkNLP Properties + +You can change the following Spark NLP configurations via Spark Configuration: + +| Property Name | Default | Meaning | +|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory | +| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS | +| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory | +| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. | +| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. | +| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. | +| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. | + +### How to set Spark NLP Configuration + +**SparkSession:** + +You can use `.config()` during SparkSession creation to set Spark NLP configurations. + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder + .master("local[*]") + .config("spark.driver.memory", "16G") + .config("spark.driver.maxResultSize", "0") + .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") + .config("spark.kryoserializer.buffer.max", "2000m") + .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") + .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") + .getOrCreate() +``` + +**spark-shell:** + +```sh +spark-shell \ + --driver-memory 16g \ + --conf spark.driver.maxResultSize=0 \ + --conf spark.serializer=org.apache.spark.serializer.KryoSerializer + --conf spark.kryoserializer.buffer.max=2000M \ + --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ + --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +**pyspark:** + +```sh +pyspark \ + --driver-memory 16g \ + --conf spark.driver.maxResultSize=0 \ + --conf spark.serializer=org.apache.spark.serializer.KryoSerializer + --conf spark.kryoserializer.buffer.max=2000M \ + --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ + --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +**Databricks:** + +On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: + +```bash +spark.kryoserializer.buffer.max 2000M +spark.serializer org.apache.spark.serializer.KryoSerializer +spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE +spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE +spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS +``` + +NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it. + + +### S3 Integration + +**Logging:** + +To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path + +```bash +spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs") +spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") +spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") +spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket") +spark.conf.set("spark.jsl.settings.aws.region", "my-region") +``` + +Now you can check the log on your S3 path defined in *spark.jsl.settings.annotator.log_folder* property. +Make sure to use the prefix *s3://*, otherwise it will use the default configuration. + +**Tensorflow Graphs:** + +To reference S3 location for downloading graphs. We need to set up AWS credentials + +```bash +spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") +spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") +spark.conf.set("spark.jsl.settings.aws.region", "my-region") +``` + +**MFA Configuration:** + +In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token +to the configuration as shown in the examples below +For logging: + +```bash +spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN") +``` + +An example of a bash script that gets temporal AWS credentials can be +found [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh) +This script requires three arguments: + +```bash +./aws_tmp_credentials.sh iam_user duration serial_number +``` + +
\ No newline at end of file diff --git a/docs/en/features.md b/docs/en/features.md new file mode 100644 index 00000000000000..1a9a5b80470828 --- /dev/null +++ b/docs/en/features.md @@ -0,0 +1,120 @@ +--- +layout: docs +header: true +seotitle: Spark NLP - Features +title: Spark NLP - Features +permalink: /docs/en/features +key: docs-install +modify_date: "2024-07-03" +show_nav: true +sidebar: + nav: sparknlp +--- + + +
+ +## Text Preprocessing +- Tokenization +- Trainable Word Segmentation +- Stop Words Removal +- Token Normalizer +- Document Normalizer +- Document & Text Splitter +- Stemmer +- Lemmatizer +- NGrams +- Regex Matching +- Text Matching +- Spell Checker (ML and DL models) + +## Parsing and Analysis +- Chunking +- Date Matcher +- Sentence Detector +- Deep Sentence Detector (Deep learning) +- Dependency parsing (Labeled/unlabeled) +- SpanBertCorefModel (Coreference Resolution) +- Part-of-speech tagging +- Named entity recognition (Deep learning) +- Unsupervised keywords extraction +- Language Detection & Identification (up to 375 languages) + +## Sentiment and Classification +- Sentiment Detection (ML models) +- Multi-class & Multi-label Sentiment analysis (Deep learning) +- Multi-class Text Classification (Deep learning) +- Zero-Shot NER Model +- Zero-Shot Text Classification by Transformers (ZSL) + +## Embeddings +- Word Embeddings (GloVe and Word2Vec) +- Doc2Vec (based on Word2Vec) +- BERT Embeddings (TF Hub & HuggingFace models) +- DistilBERT Embeddings (HuggingFace models) +- CamemBERT Embeddings (HuggingFace models) +- RoBERTa Embeddings (HuggingFace models) +- DeBERTa Embeddings (HuggingFace v2 & v3 models) +- XLM-RoBERTa Embeddings (HuggingFace models) +- Longformer Embeddings (HuggingFace models) +- ALBERT Embeddings (TF Hub & HuggingFace models) +- XLNet Embeddings +- ELMO Embeddings (TF Hub models) +- Universal Sentence Encoder (TF Hub models) +- BERT Sentence Embeddings (TF Hub & HuggingFace models) +- RoBerta Sentence Embeddings (HuggingFace models) +- XLM-RoBerta Sentence Embeddings (HuggingFace models) +- INSTRUCTOR Embeddings (HuggingFace models) +- E5 Embeddings (HuggingFace models) +- MPNet Embeddings (HuggingFace models) +- UAE Embeddings (HuggingFace models) +- OpenAI Embeddings +- Sentence & Chunk Embeddings + +## Classification and Question Answering Models +- BERT for Token & Sequence Classification & Question Answering +- DistilBERT for Token & Sequence Classification & Question Answering +- CamemBERT for Token & Sequence Classification & Question Answering +- ALBERT for Token & Sequence Classification & Question Answering +- RoBERTa for Token & Sequence Classification & Question Answering +- DeBERTa for Token & Sequence Classification & Question Answering +- XLM-RoBERTa for Token & Sequence Classification & Question Answering +- Longformer for Token & Sequence Classification & Question Answering +- MPnet for Token & Sequence Classification & Question Answering +- XLNet for Token & Sequence Classification + +## Machine Translation and Generation +- Neural Machine Translation (MarianMT) +- Many-to-Many multilingual translation model (Facebook M2M100) +- Table Question Answering (TAPAS) +- Text-To-Text Transfer Transformer (Google T5) +- Generative Pre-trained Transformer 2 (OpenAI GPT2) +- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART) +- Chat and Conversational LLMs (Facebook Llama-2) + +## Image and Speech +- Vision Transformer (Google ViT) +- Swin Image Classification (Microsoft Swin Transformer) +- ConvNext Image Classification (Facebook ConvNext) +- Vision Encoder Decoder for image-to-text like captioning +- Zero-Shot Image Classification by OpenAI's CLIP +- Automatic Speech Recognition (Wav2Vec2) +- Automatic Speech Recognition (HuBERT) +- Automatic Speech Recognition (OpenAI Whisper) + +## Integration and Interoperability +- Easy [ONNX](https://github.com/JohnSnowLabs/spark-nlp/tree/feature/SPARKNLP-1015-Modernizing-GitHub-repo/examples/python/transformers/onnx), [OpenVINO](https://github.com/JohnSnowLabs/spark-nlp/tree/feature/SPARKNLP-1015-Modernizing-GitHub-repo/examples/python/transformers/openvino), and [TensorFlow](https://github.com/JohnSnowLabs/spark-nlp/tree/feature/SPARKNLP-1015-Modernizing-GitHub-repo/examples/python/transformers) integrations +- Full integration with Spark ML functions +- GPU Support + +## Pre-trained Models +- +31000 pre-trained models in +200 languages! +- +6000 pre-trained pipelines in +200 languages! + +#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more + +## Multi-lingual Support +- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, + Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more. + +
\ No newline at end of file diff --git a/docs/en/install.md b/docs/en/install.md index 4bc861a2c0d496..3d32683830df96 100644 --- a/docs/en/install.md +++ b/docs/en/install.md @@ -5,7 +5,7 @@ seotitle: Spark NLP - Installation title: Spark NLP - Installation permalink: /docs/en/install key: docs-install -modify_date: "2023-05-10" +modify_date: "2024-07-04" show_nav: true sidebar: nav: sparknlp @@ -35,6 +35,14 @@ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 spark-shell --jars spark-nlp-assembly-5.4.0.jar ``` +**GPU (optional):** + +Spark NLP 5.4.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: + +- NVIDIA® GPU drivers version 450.80.02 or higher +- CUDA® Toolkit 11.2 +- cuDNN SDK 8.1.0 +
## Python @@ -95,15 +103,73 @@ spark = SparkSession.builder \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") \ .getOrCreate() ``` +If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course, +you'll have to put the jars in a reachable location for all driver and executor nodes. + +### Python without explicit Pyspark installation + +### Pip/Conda + +If you installed pyspark through pip/conda, you can install `spark-nlp` through the same channel. + +Pip: + +```bash +pip install spark-nlp==5.4.0 +``` + +Conda: + +```bash +conda install -c johnsnowlabs spark-nlp +``` + +PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/) / +Anaconda [spark-nlp package](https://anaconda.org/JohnSnowLabs/spark-nlp) + +Then you'll have to create a SparkSession either from Spark NLP: + +```python +import sparknlp + +spark = sparknlp.start() +``` + +**Quick example:** + +```python +import sparknlp +from sparknlp.pretrained import PretrainedPipeline + +# create or get Spark Session + +spark = sparknlp.start() + +sparknlp.version() +spark.version + +# download, load and annotate a text by pre-trained pipeline + +pipeline = PretrainedPipeline('recognize_entities_dl', 'en') +result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo') +```
## Scala and Java +To use Spark NLP you need the following requirements: + +- Java 8 and 11 +- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x + #### Maven **spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +The `spark-nlp` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). + ```xml @@ -240,6 +306,81 @@ as expected.
+ +## Command line + +Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x +This steps require internet connection. + +#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12) + +```sh +# CPU + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +The `spark-nlp` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). + +```sh +# GPU + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 + +``` + +The `spark-nlp-gpu` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu). + +```sh +# AArch64 + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 + +``` + +The `spark-nlp-aarch64` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64). + +```sh +# M1/M2 (Apple Silicon) + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 + +``` + +The `spark-nlp-silicon` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon). + +**NOTE**: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following +set in your SparkSession: + +```sh +spark-shell \ + --driver-memory 16g \ + --conf spark.kryoserializer.buffer.max=2000M \ + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +## Installation for M1 & M2 Chips + ### Scala and Java for M1 Adding Spark NLP to your Scala or Java project is easy: @@ -370,6 +511,258 @@ Run the following code in Kaggle Kernel and start using spark-nlp right away.
+## Apache Zeppelin + +Use either one of the following options + +- Add the following Maven Coordinates to the interpreter's library list + +```bash +com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is + available to driver path + +## Python in Zeppelin + +Apart from the previous step, install the python module through pip + +```bash +pip install spark-nlp==5.4.0 +``` + +Or you can install `spark-nlp` from inside Zeppelin by using Conda: + +```bash +python.conda install -c johnsnowlabs spark-nlp +``` + +Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose. + +Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and +install the pip library with (e.g. `python3`). + +An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as +shown earlier since it includes both scala and python side installation. + +## Jupyter Notebook + +**Recommended:** + +The easiest way to get this done on Linux and macOS is to simply install `spark-nlp` and `pyspark` PyPI packages and +launch the Jupyter from the same Python environment: + +```sh +$ conda create -n sparknlp python=3.8 -y +$ conda activate sparknlp +# spark-nlp by default is based on pyspark 3.x +$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter +$ jupyter notebook +``` + +Then you can use `python3` kernel to run your code with creating SparkSession via `spark = sparknlp.start()`. + +**Optional:** + +If you are in different operating systems and require to make Jupyter Notebook run by using pyspark, you can follow +these steps: + +```bash +export SPARK_HOME=/path/to/your/spark/folder +export PYSPARK_PYTHON=python3 +export PYSPARK_DRIVER_PYTHON=jupyter +export PYSPARK_DRIVER_PYTHON_OPTS=notebook + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` + +If not using pyspark at all, you'll have to run the instructions +pointed [here](#python-without-explicit-pyspark-installation) + +## Databricks Cluster + +1. Create a cluster if you don't have one already + +2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: + + ```bash + spark.kryoserializer.buffer.max 2000M + spark.serializer org.apache.spark.serializer.KryoSerializer + ``` + +3. In `Libraries` tab inside your cluster you need to follow these steps: + + 3.1. Install New -> PyPI -> `spark-nlp==5.4.0` -> Install + + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install + +4. Now you can attach your notebook to the cluster and use Spark NLP! + +NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark +NLP Maven package name (Maven Coordinate) for your runtime from +our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet) + +## EMR Cluster + +To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software +configuration. + +A sample of your bootstrap script + +```.sh +#!/bin/bash +set -x -e + +echo -e 'export PYSPARK_PYTHON=/usr/bin/python3 +export HADOOP_CONF_DIR=/etc/hadoop/conf +export SPARK_JARS_DIR=/usr/lib/spark/jars +export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc + +sudo python3 -m pip install awscli boto spark-nlp + +set +x +exit 0 + +``` + +A sample of your software configuration in JSON on S3 (must be public access): + +```.json +[{ + "Classification": "spark-env", + "Configurations": [{ + "Classification": "export", + "Properties": { + "PYSPARK_PYTHON": "/usr/bin/python3" + } + }] +}, +{ + "Classification": "spark-defaults", + "Properties": { + "spark.yarn.stagingDir": "hdfs:///tmp", + "spark.yarn.preserve.staging.files": "true", + "spark.kryoserializer.buffer.max": "2000M", + "spark.serializer": "org.apache.spark.serializer.KryoSerializer", + "spark.driver.maxResultSize": "0", + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0" + } +}] +``` + +A sample of AWS CLI to launch EMR cluster: + +```.sh +aws emr create-cluster \ +--name "Spark NLP 5.4.0" \ +--release-label emr-6.2.0 \ +--applications Name=Hadoop Name=Spark Name=Hive \ +--instance-type m4.4xlarge \ +--instance-count 3 \ +--use-default-roles \ +--log-uri "s3:///" \ +--bootstrap-actions Path=s3:///emr-bootstrap.sh,Name=custome \ +--configurations "https:///sparknlp-config.json" \ +--ec2-attributes KeyName=,EmrManagedMasterSecurityGroup=,EmrManagedSlaveSecurityGroup= \ +--profile +``` + +## GCP Dataproc + +1. Create a cluster if you don't have one already as follows. + +At gcloud shell: + +```bash +gcloud services enable dataproc.googleapis.com \ + compute.googleapis.com \ + storage-component.googleapis.com \ + bigquery.googleapis.com \ + bigquerystorage.googleapis.com +``` + +```bash +REGION= +``` + +```bash +BUCKET_NAME= +gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME} +``` + +```bash +REGION= +ZONE= +CLUSTER_NAME= +BUCKET_NAME= +``` + +You can set image-version, master-machine-type, worker-machine-type, +master-boot-disk-size, worker-boot-disk-size, num-workers as your needs. +If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components. +And, you should enable gateway. +Don't forget to set the maven coordinates for the jar in properties. + +```bash +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region=${REGION} \ + --zone=${ZONE} \ + --image-version=2.0 \ + --master-machine-type=n1-standard-4 \ + --worker-machine-type=n1-standard-2 \ + --master-boot-disk-size=128GB \ + --worker-boot-disk-size=128GB \ + --num-workers=2 \ + --bucket=${BUCKET_NAME} \ + --optional-components=JUPYTER \ + --enable-component-gateway \ + --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. + +3. Now, you can attach your notebook to the cluster and use the Spark NLP! + + +## Apache Spark Support + +Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | +|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| +| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.3.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO | +| 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO | + +Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases). + +## Scala and Python Support + +| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | +|-----------|------------|------------|------------|------------|------------|------------|------------| +| 5.3.x | NO | YES | YES | YES | YES | NO | YES | +| 5.2.x | NO | YES | YES | YES | YES | NO | YES | +| 5.1.x | NO | YES | YES | YES | YES | NO | YES | +| 5.0.x | NO | YES | YES | YES | YES | NO | YES | +| 4.4.x | NO | YES | YES | YES | YES | NO | YES | +| 4.3.x | YES | YES | YES | YES | YES | NO | YES | +| 4.2.x | YES | YES | YES | YES | YES | NO | YES | +| 4.1.x | YES | YES | YES | YES | NO | NO | YES | +| 4.0.x | YES | YES | YES | YES | NO | NO | YES | + + ## Databricks Support Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: @@ -867,4 +1260,44 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/") - Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you - If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`) + +## Compiled JARs + +### Build from source + +#### spark-nlp + +- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +```bash +sbt assembly +``` + +- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +```bash +sbt -Dis_gpu=true assembly +``` + +- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +```bash +sbt -Dis_silicon=true assembly +``` + +### Using the jar manually + +If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it +from [Maven Central](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp). + +To add JARs to spark programs use the `--jars` option: + +```sh +spark-shell --jars spark-nlp.jar +``` + +The preferred way to use the library when running spark programs is using the `--packages` option as specified in +the `spark-packages` section. + +
diff --git a/docs/en/pipelines.md b/docs/en/pipelines.md index 43728d43863270..0204f8c62b88f9 100644 --- a/docs/en/pipelines.md +++ b/docs/en/pipelines.md @@ -5,7 +5,7 @@ seotitle: Spark NLP - Pipelines title: Spark NLP - Pipelines permalink: /docs/en/pipelines key: docs-pipelines -modify_date: "2021-11-20" +modify_date: "2024-07-04" show_nav: true sidebar: nav: sparknlp @@ -13,96 +13,24 @@ sidebar:
-Pretrained Pipelines have moved to Models Hub. -Please follow this link for the updated list of all models and pipelines: -[Models Hub](https://sparknlp.org/models) -{:.success} - -
- -## English - -**NOTE:** -`noncontrib` pipelines are compatible with `Windows` operating systems. - -{:.table-model-big} -| Pipelines | Name | -| -------------------- | ---------------------- | -| [Explain Document ML](#explaindocumentml) | `explain_document_ml` -| [Explain Document DL](#explaindocumentdl) | `explain_document_dl` -| [Explain Document DL Win]() | `explain_document_dl_noncontrib` -| Explain Document DL Fast | `explain_document_dl_fast` -| Explain Document DL Fast Win | `explain_document_dl_fast_noncontrib` | -| [Recognize Entities DL](#recognizeentitiesdl) | `recognize_entities_dl` | -| Recognize Entities DL Win | `recognize_entities_dl_noncontrib` | -| [OntoNotes Entities Small](#ontorecognizeentitiessm) | `onto_recognize_entities_sm` | -| [OntoNotes Entities Large](#ontorecognizeentitieslg) | `onto_recognize_entities_lg` | -| [Match Datetime](#matchdatetime) | `match_datetime` | -| [Match Pattern](#matchpattern) | `match_pattern` | -| [Match Chunk](#matchchunks) | `match_chunks` | -| Match Phrases | `match_phrases`| -| Clean Stop | `clean_stop`| -| Clean Pattern | `clean_pattern`| -| Clean Slang | `clean_slang`| -| Check Spelling | `check_spelling`| -| Analyze Sentiment | `analyze_sentiment` | -| Analyze Sentiment DL | `analyze_sentimentdl_use_imdb` | -| Analyze Sentiment DL | `analyze_sentimentdl_use_twitter` | -| Dependency Parse | `dependency_parse` | - -
- -### explain_document_ml - -{% highlight scala %} -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), -(2, "The Paris metro will soon enter the 21st century, ditching single-use paper tickets for rechargeable electronic cards.") -)).toDF("id", "text") +## Pipelines and Models -val pipeline = PretrainedPipeline("explain_document_ml", lang="en") +### Pipelines -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_ml,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 7 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| checked| lemmas| stems| pos| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...| -| 2|The Paris metro w...|[[document, 0, 11...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -{% endhighlight %} - -
- -### explain_document_dl - -{% highlight scala %} +**Quick example:** +```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP SparkNLP.version() val testData = spark.createDataFrame(Seq( -(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), -(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") + (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), + (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") )).toDF("id", "text") -val pipeline = PretrainedPipeline("explain_document_dl", lang="en") +val pipeline = PretrainedPipeline("explain_document_dl", lang = "en") val annotation = pipeline.transform(testData) @@ -110,7 +38,7 @@ annotation.show() /* import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP -2.0.8 +2.5.0 testData: org.apache.spark.sql.DataFrame = [id: int, text: string] pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models) annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields] @@ -132,888 +60,141 @@ annotation.select("entities.result").show(false) |[Donald John Trump, United States]| +----------------------------------+ */ +``` -{% endhighlight %} +#### Showing Available Pipelines -
+There are functions in Spark NLP that will list all the available Pipelines +of a particular language for you: -### recognize_entities_dl - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), -(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| embeddings| ner| ner_converter| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...| -| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicPipelines(lang = "en") /* -+----------------------------------+ -|result | -+----------------------------------+ -|[Google, TensorFlow] | -|[Donald John Trump, United States]| -+----------------------------------+ ++--------------------------------------------+------+---------+ +| Pipeline | lang | version | ++--------------------------------------------+------+---------+ +| dependency_parse | en | 2.0.2 | +| analyze_sentiment_ml | en | 2.0.2 | +| check_spelling | en | 2.1.0 | +| match_datetime | en | 2.1.0 | + ... +| explain_document_ml | en | 3.1.3 | ++--------------------------------------------+------+---------+ */ +``` -{% endhighlight %} - -
+Or if we want to check for a particular version: -### onto_recognize_entities_sm - -Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **OntoNotes** corpus and supports the identification of 18 entities. - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament. "), -(2, "A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("onto_recognize_entities_sm", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.1.0 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(onto_recognize_entities_sm,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Johnson first ent...|[[document, 0, 17...|[[token, 0, 6, Jo...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 6, Jo...| -| 2|A little less tha...|[[document, 0, 22...|[[token, 0, 0, A,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 32, A...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0") /* -+---------------------------------------------------------------------------------+ -|result | -+---------------------------------------------------------------------------------+ -|[Johnson, first, 2001, Parliament, eight years, London, 2008 to 2016, Parliament]| -|[A little less than a decade later, dozens] | -+---------------------------------------------------------------------------------+ ++---------------------------------------+------+---------+ +| Pipeline | lang | version | ++---------------------------------------+------+---------+ +| dependency_parse | en | 2.0.2 | + ... +| clean_slang | en | 3.0.0 | +| clean_pattern | en | 3.0.0 | +| check_spelling | en | 3.0.0 | +| dependency_parse | en | 3.0.0 | ++---------------------------------------+------+---------+ */ +``` -{% endhighlight %} +#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more -
+### Models -### onto_recognize_entities_lg +**Some selected languages: +** `Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu` -Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **OntoNotes** corpus and supports the identification of 18 entities. +**Quick online example:** -{% highlight scala %} +```python +# load NER model trained by deep learning approach and GloVe word embeddings +ner_dl = NerDLModel.pretrained('ner_dl') +# load NER model trained by deep learning approach and BERT word embeddings +ner_bert = NerDLModel.pretrained('ner_dl_bert') +``` -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP +```scala +// load French POS tagger model trained by Universal Dependencies +val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang = "fr") +// load Italian LemmatizerModel +val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang = "it") +```` -SparkNLP.version() +**Quick offline example:** -val testData = spark.createDataFrame(Seq( -(1, "Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament. "), -(2, "A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("onto_recognize_entities_lg", lang="en") +- Loading `PerceptronModel` annotator model inside Spark NLP Pipeline -val annotation = pipeline.transform(testData) +```scala +val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/") + .setInputCols("document", "token") + .setOutputCol("pos") +``` -annotation.show() +#### Showing Available Models -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.1.0 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(onto_recognize_entities_lg,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Johnson first ent...|[[document, 0, 17...|[[token, 0, 6, Jo...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 6, Jo...| -| 2|A little less tha...|[[document, 0, 22...|[[token, 0, 0, A,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 32, A...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ +There are functions in Spark NLP that will list all the available Models +of a particular Annotator and language for you: -annotation.select("entities.result").show(false) +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en") /* -+-------------------------------------------------------------------------------+ -|result | -+-------------------------------------------------------------------------------+ -|[Johnson, first, 2001, Parliament, eight years, London, 2008, 2016, Parliament]| -|[A little less than a decade later, dozens] | -+-------------------------------------------------------------------------------+ ++---------------------------------------------+------+---------+ +| Model | lang | version | ++---------------------------------------------+------+---------+ +| onto_100 | en | 2.1.0 | +| onto_300 | en | 2.1.0 | +| ner_dl_bert | en | 2.2.0 | +| onto_100 | en | 2.4.0 | +| ner_conll_elmo | en | 3.2.2 | ++---------------------------------------------+------+---------+ */ +``` -{% endhighlight %} - -
- -### match_datetime - -#### DateMatcher yyyy/MM/dd - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "I would like to come over and see you in 01/02/2019."), -(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") -)).toDF("id", "text") +Or if we want to check for a particular version: -val pipeline = PretrainedPipeline("match_datetime", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en", version = "3.1.0") /* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_datetime,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 4 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| date| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|I would like to c...|[[document, 0, 51...|[[document, 0, 51...|[[token, 0, 0, I,...|[[date, 41, 50, 2...| -| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[date, 24, 36, 1...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ ++----------------------------+------+---------+ +| Model | lang | version | ++----------------------------+------+---------+ +| onto_100 | en | 2.1.0 | +| ner_aspect_based_sentiment | en | 2.6.2 | +| ner_weibo_glove_840B_300d | en | 2.6.2 | +| nerdl_atis_840b_300d | en | 2.7.1 | +| nerdl_snips_100d | en | 2.7.3 | ++----------------------------+------+---------+ */ +``` -annotation.select("date.result").show(false) +And to see a list of available annotators, you can use: -/* -+------------+ -|result | -+------------+ -|[2019/01/02]| -|[1946/06/14]| -+------------+ -*/ - -{% endhighlight %} - -
- -### match_pattern - -RegexMatcher (match phone numbers) - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "You should call Mr. Jon Doe at +33 1 79 01 22 89") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("match_pattern", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_pattern,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 4 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| regex| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|You should call M...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 2, Yo...|[[chunk, 31, 47, ...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("regex.result").show(false) - -/* -+-------------------+ -|result | -+-------------------+ -|[+33 1 79 01 22 89]| -+-------------------+ -*/ - -{% endhighlight %} - -
- -### match_chunks - -The pipeline uses regex `
?/*+` - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "The book has many chapters"), -(2, "the little yellow dog barked at the cat") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("match_chunks", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_chunks,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 5 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| pos| chunk| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|The book has many...|[[document, 0, 25...|[[document, 0, 25...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|[[chunk, 0, 7, Th...| -| 2|the little yellow...|[[document, 0, 38...|[[document, 0, 38...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[chunk, 0, 20, t...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("chunk.result").show(false) - -/* -+--------------------------------+ -|result | -+--------------------------------+ -|[The book] | -|[the little yellow dog, the cat]| -+--------------------------------+ -*/ - -{% endhighlight %} - -
- -## French - -{:.table-model-big} -| Pipelines | Name | -| ----------------------- | --------------------- | -| [Explain Document Large](#french-explain_document_lg) | `explain_document_lg` | -| [Explain Document Medium](#french-explain_document_md) | `explain_document_md` | -| [Entity Recognizer Large](#french-entity_recognizer_lg) | `entity_recognizer_lg` | -| [Entity Recognizer Medium](#french-entity_recognizer_md) | `entity_recognizer_md` | - -{:.table-model-big} -|Feature | Description| -|---|----| -|**NER**|Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities -|**Lemma**|Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura` -|**POS**| Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/fr_gsd/index.html) -|**Size**| Model size indicator, **md** and **lg**. The large pipeline uses **glove_840B_300** and the medium uses **glove_6B_300** WordEmbeddings - -
- -### French explain_document_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_lg", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showAvailableAnnotators() /* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,fr,public/models) -testData: org.apache.spark.sql.DataFrame = [id: bigint, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: bigint, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[token, 0, 12, C...|[[pos, 0, 12, ADV...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[token, 0, 7, Em...|[[pos, 0, 7, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ +AlbertEmbeddings +AlbertForTokenClassification +AssertionDLModel +... +XlmRoBertaSentenceEmbeddings +XlnetEmbeddings */ +``` -annotation.select("entities.result").show(false) - -/*+-------------------------------------------------------------------------------------------------------------+ -|result | -+-------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+-------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -### French explain_document_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_md", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_md,fr,public/models) -testData: org.apache.spark.sql.DataFrame = [id: bigint, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: bigint, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[token, 0, 12, C...|[[pos, 0, 12, ADV...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[token, 0, 7, Em...|[[pos, 0, 7, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -|result | -+----------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, au CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+----------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -### French entity_recognizer_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_lg", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-------------------------------------------------------------------------------------------------------------+ -|result | -+-------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+-------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -### French entity_recognizer_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_md", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/*+-------------------------------------------------------------------------------------------------------------+ -|result | -+----------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, au CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+----------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -## Italian - -{:.table-model-big} -| Pipelines | Name | -| ----------------------- | --------------------- | -| [Explain Document Large](#italian-explain_document_lg) | `explain_document_lg` | -| [Explain Document Medium](#italian-explain_document_md) | `explain_document_md` | -| [Entity Recognizer Large](#italian-entity_recognizer_lg) | `entity_recognizer_lg` | -| [Entity Recognizer Medium](#italian-entity_recognizer_md) | `entity_recognizer_md` | - -{:.table-model-big} -|Feature | Description| -|---|----| -|**NER**|Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities -|**Lemma**|Trained by **Lemmatizer** annotator on **DXC Technology** dataset -|**POS**| Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/it_isdt/index.html) -|**Size**| Model size indicator, **md** and **lg**. The large pipeline uses **glove_840B_300** and the medium uses **glove_6B_300** WordEmbeddings - -
- -### Italian explain_document_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_lg", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[token, 0, 1, La...|[[pos, 0, 1, DET,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 3, 6, FI...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[token, 0, 4, Re...|[[pos, 0, 4, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-----------------------------------+ -|result | -+-----------------------------------+ -|[FIFA, Zidane, Materazzi] | -|[Reims, Domani, Mondiali femminili]| -+-----------------------------------+ -*/ - -{% endhighlight %} - -
- -### Italian explain_document_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_md", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[token, 0, 1, La...|[[pos, 0, 1, DET,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 9, La...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[token, 0, 4, Re...|[[pos, 0, 4, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-------------------------------+ -|result | -+-------------------------------+ -|[La FIFA, Zidane, Materazzi]| -|[Reims, Domani, Mondiali] | -+-------------------------------+ -*/ - -{% endhighlight %} - -
- -### Italian entity_recognizer_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_lg", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 3, 6, FI...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-----------------------------------+ -|result | -+-----------------------------------+ -|[FIFA, Zidane, Materazzi] | -|[Reims, Domani, Mondiali femminili]| -+-----------------------------------+ -*/ - -{% endhighlight %} - -
- -### Italian entity_recognizer_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_md", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 9, La...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-------------------------------+ -|result | -+-------------------------------+ -|[La FIFA, Zidane, Materazzi]| -|[Reims, Domani, Mondiali] | -+-------------------------------+ -*/ - -{% endhighlight %} - -
- -## Spanish - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_es_2.4.0_2.4_1581977077084.zip) | -| Explain Document Medium | `explain_document_md` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_es_2.4.0_2.4_1581976836224.zip) | -| Explain Document Large | `explain_document_lg` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_2.4.0_2.4_1581975536033.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_2.4.0_2.4_1581978479912.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_es_2.4.0_2.4_1581978260094.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_2.4.0_2.4_1581977172660.zip) | - -{:.table-model-big} -| Feature | Description | -|:----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| **Lemma** | Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura` | -| **POS** | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/es_gsd/index.html) | -| **NER** | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities | -|**Size**| Model size indicator, **sm**, **md**, and **lg**. The small pipelines use **glove_100d**, the medium pipelines use **glove_6B_300**, and large pipelines use **glove_840B_300** WordEmbeddings - -
- -## Russian - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_2.4.4_2.4_1584017142719.zip) | -| Explain Document Medium | `explain_document_md` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_ru_2.4.4_2.4_1584016917220.zip) | -| Explain Document Large | `explain_document_lg` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_ru_2.4.4_2.4_1584015824836.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_ru_2.4.4_2.4_1584018543619.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_2.4.4_2.4_1584018332357.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_2.4.4_2.4_1584017227871.zip) | - -{:.table-model-big} -| Feature | Description | -|:----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| **Lemma** | Trained by **Lemmatizer** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/ru_gsd/index.html)| -| **POS** | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/ru_gsd/index.html) | -| **NER** | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities | - -
- -## Dutch - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_nl_2.5.0_2.4_1588546621618.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_2.5.0_2.4_1588546605329.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_2.5.0_2.4_1588612556770.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_nl_2.5.0_2.4_1588546655907.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_nl_2.5.0_2.4_1588546645304.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_nl_2.5.0_2.4_1588612569958.zip) | - -
- -## Norwegian - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_2.5.0_2.4_1588784132955.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_no_2.5.0_2.4_1588783879809.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_no_2.5.0_2.4_1588782610672.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_no_2.5.0_2.4_1588794567766.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_2.5.0_2.4_1588794357614.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_2.5.0_2.4_1588793261642.zip) | - -
- -## Polish - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_2.5.0_2.4_1588531081173.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pl_2.5.0_2.4_1588530841737.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pl_2.5.0_2.4_1588529695577.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pl_2.5.0_2.4_1588532616080.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pl_2.5.0_2.4_1588532376753.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pl_2.5.0_2.4_1588531171903.zip) | - -
- -## Portuguese - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_2.5.0_2.4_1588501423743.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pt_2.5.0_2.4_1588501189804.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pt_2.5.0_2.4_1588500056427.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pt_2.5.0_2.4_1588502815900.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pt_2.5.0_2.4_1588502606198.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pt_2.5.0_2.4_1588501526324.zip) | - -
- -## Multi-language - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| LanguageDetectorDL | `detect_language_7` | 2.5.2 | `xx` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_7_xx_2.5.0_2.4_1591875676774.zip) | -| LanguageDetectorDL | `detect_language_20` | 2.5.2 | `xx` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_20_xx_2.5.0_2.4_1591875683182.zip) | - -* The model with 7 languages: Czech, German, English, Spanish, French, Italy, and Slovak -* The model with 20 languages: Bulgarian, Czech, German, Greek, English, Spanish, Finnish, French, Croatian, Hungarian, Italy, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Turkish, and Ukrainian - -
- -## How to use - -### Online - -To use Spark NLP pretrained pipelines, you can call `PretrainedPipeline` with pipeline's name and its language (default is `en`): - -{% highlight python %} - -pipeline = PretrainedPipeline('explain_document_dl', lang='en') - -{% endhighlight %} - -Same in Scala - -{% highlight scala %} - -val pipeline = PretrainedPipeline("explain_document_dl", lang="en") - -{% endhighlight %} - -
- -### Offline - -If you have any trouble using online pipelines or models in your environment (maybe it's air-gapped), you can directly download them for `offline` use. - -After downloading offline models/pipelines and extracting them, here is how you can use them iside your code (the path could be a shared storage like HDFS in a cluster): - -{% highlight scala %} -val advancedPipeline = PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/") -// To use the loaded Pipeline for prediction -advancedPipeline.transform(predictionDF) - -{% endhighlight %} +#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
\ No newline at end of file From 257fd57f2088afa7eb2cdf6ec7e47d278dfaf376 Mon Sep 17 00:00:00 2001 From: Prabod Rathnayaka Date: Mon, 15 Jul 2024 01:04:12 +1000 Subject: [PATCH 4/7] Added custom stop token id support (#14344) --- .../scala/com/johnsnowlabs/ml/ai/LLAMA2.scala | 16 ++-- .../com/johnsnowlabs/ml/ai/Mistral.scala | 18 +++-- .../scala/com/johnsnowlabs/ml/ai/Phi2.scala | 12 ++- .../ml/ai/util/Generation/Generate.scala | 29 +++++--- .../Logit/LogitWarper/TopKLogitWarper.scala | 8 +- .../Logit/LogitWarper/TopPLogitWarper.scala | 34 +++++++-- .../util/Generation/Search/BeamScorer.scala | 4 +- .../Generation/Search/BeamSearchScorer.scala | 8 +- .../nlp/HasGeneratorProperties.scala | 15 ++++ .../seq2seq/LLAMA2Transformer.scala | 6 +- .../seq2seq/MistralTransformer.scala | 6 +- .../annotators/seq2seq/Phi2Transformer.scala | 6 +- .../LogitProcess/LogitProcessorTest.scala | 23 ++++++ .../Logit/LogitWarper/LogitWarperTest.scala | 74 +++++++++++++++++++ 14 files changed, 212 insertions(+), 47 deletions(-) create mode 100644 src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/LogitWarperTest.scala diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/LLAMA2.scala b/src/main/scala/com/johnsnowlabs/ml/ai/LLAMA2.scala index ed3444a3059ee2..9e9757d0115c37 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/LLAMA2.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/LLAMA2.scala @@ -79,8 +79,8 @@ private[johnsnowlabs] class LLAMA2( */ def encode(sentences: Seq[Annotation]): Seq[Array[Int]] = { sentences.map(s => { - val sentWithTask = s.result - spp.getSppModel.encodeAsIds(sentWithTask) + val sentWithTask = "_" + s.result + Array(bosTokenId) ++ spp.getSppModel.encodeAsIds(sentWithTask) }) } @@ -97,7 +97,8 @@ private[johnsnowlabs] class LLAMA2( randomSeed: Option[Long], ignoreTokenIds: Array[Int] = Array(), beamSize: Int, - maxInputLength: Int): Array[Array[Int]] = { + maxInputLength: Int, + stopTokenIds: Array[Int]): Array[Array[Int]] = { val ignoreTokenIdsInt = ignoreTokenIds val expandedDecoderInputsVals = batch val sequencesLength = expandedDecoderInputsVals.map(x => x.length).toArray @@ -165,7 +166,8 @@ private[johnsnowlabs] class LLAMA2( ignoreTokenIdsInt, session, applySoftmax = true, - ovInferRequest = ovInferRequest) + ovInferRequest = ovInferRequest, + stopTokenIds = stopTokenIds) modelOutputs } @@ -184,7 +186,8 @@ private[johnsnowlabs] class LLAMA2( randomSeed: Option[Long] = None, ignoreTokenIds: Array[Int] = Array(), beamSize: Int, - maxInputLength: Int): Seq[Annotation] = { + maxInputLength: Int, + stopTokenIds: Array[Int]): Seq[Annotation] = { val batchDecoder = sentences.grouped(batchSize).toArray.flatMap { batch => val batchSP = encode(batch) @@ -201,7 +204,8 @@ private[johnsnowlabs] class LLAMA2( randomSeed, ignoreTokenIds, beamSize, - maxInputLength) + maxInputLength, + stopTokenIds) decode(spIds) diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/Mistral.scala b/src/main/scala/com/johnsnowlabs/ml/ai/Mistral.scala index 58d074a90cba32..e37ee56abac5e5 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/Mistral.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/Mistral.scala @@ -78,8 +78,8 @@ private[johnsnowlabs] class Mistral( */ def encode(sentences: Seq[Annotation]): Seq[Array[Int]] = { sentences.map(s => { - val sentWithTask = s.result - spp.getSppModel.encodeAsIds(sentWithTask) + val sentWithTask = "_" + s.result + Array(bosTokenId) ++ spp.getSppModel.encodeAsIds(sentWithTask) }) } @@ -96,7 +96,8 @@ private[johnsnowlabs] class Mistral( randomSeed: Option[Long], ignoreTokenIds: Array[Int] = Array(), beamSize: Int, - maxInputLength: Int): Array[Array[Int]] = { + maxInputLength: Int, + stopTokenIds: Array[Int] = Array()): Array[Array[Int]] = { val ignoreTokenIdsInt = ignoreTokenIds val expandedDecoderInputsVals = batch val sequencesLength = expandedDecoderInputsVals.map(x => x.length).toArray @@ -162,8 +163,9 @@ private[johnsnowlabs] class Mistral( randomSeed, ignoreTokenIdsInt, session, - applySoftmax = false, - ovInferRequest = ovInferRequest) + applySoftmax = true, + ovInferRequest = ovInferRequest, + stopTokenIds = stopTokenIds) // decoderOutputs modelOutputs @@ -183,7 +185,8 @@ private[johnsnowlabs] class Mistral( randomSeed: Option[Long] = None, ignoreTokenIds: Array[Int] = Array(), beamSize: Int, - maxInputLength: Int): Seq[Annotation] = { + maxInputLength: Int, + stopTokenIds: Array[Int]): Seq[Annotation] = { val batchDecoder = sentences.grouped(batchSize).toArray.flatMap { batch => val batchSP = encode(batch) @@ -200,7 +203,8 @@ private[johnsnowlabs] class Mistral( randomSeed, ignoreTokenIds, beamSize, - maxInputLength) + maxInputLength, + stopTokenIds) decode(spIds) diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/Phi2.scala b/src/main/scala/com/johnsnowlabs/ml/ai/Phi2.scala index 400a103abb22cd..36fa9927431663 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/Phi2.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/Phi2.scala @@ -103,7 +103,8 @@ private[johnsnowlabs] class Phi2( randomSeed: Option[Long], ignoreTokenIds: Array[Int] = Array(), beamSize: Int, - maxInputLength: Int): Array[Array[Int]] = { + maxInputLength: Int, + stopTokenIds: Array[Int]): Array[Array[Int]] = { val ignoreTokenIdsInt = ignoreTokenIds val expandedDecoderInputsVals = batch val sequencesLength = expandedDecoderInputsVals.map(x => x.length).toArray @@ -169,7 +170,8 @@ private[johnsnowlabs] class Phi2( ignoreTokenIdsInt, session, applySoftmax = false, - ovInferRequest = ovInferRequest) + ovInferRequest = ovInferRequest, + stopTokenIds = stopTokenIds) // decoderOutputs modelOutputs @@ -189,7 +191,8 @@ private[johnsnowlabs] class Phi2( randomSeed: Option[Long] = None, ignoreTokenIds: Array[Int] = Array(), beamSize: Int, - maxInputLength: Int): Seq[Annotation] = { + maxInputLength: Int, + stopTokenIds: Array[Int]): Seq[Annotation] = { val batchDecoder = sentences.grouped(batchSize).toArray.flatMap { batch => val batchSP = encode(batch) @@ -206,7 +209,8 @@ private[johnsnowlabs] class Phi2( randomSeed, ignoreTokenIds, beamSize, - maxInputLength) + maxInputLength, + stopTokenIds) decode(spIds) diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Generate.scala b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Generate.scala index 4e4140f7735ab2..24d2ac1d3f6696 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Generate.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Generate.scala @@ -104,7 +104,8 @@ trait Generate { ignoreTokenIds: Array[Int] = Array(), session: Either[Session, (OrtEnvironment, OrtSession)], applySoftmax: Boolean = true, - ovInferRequest: Option[InferRequest] = None): Array[Array[Int]] = { + ovInferRequest: Option[InferRequest] = None, + stopTokenIds: Array[Int] = Array()): Array[Array[Int]] = { // TODO: Add support for ignoreTokenIds @@ -117,8 +118,8 @@ trait Generate { noRepeatNgramSize = noRepeatNgramSize, vocabSize = vocabSize)) - logitProcessorList.addProcess( - new MinLengthLogitProcessor(eosTokenId, minOutputLength, vocabSize)) +// logitProcessorList.addProcess( +// new MinLengthLogitProcessor(eosTokenId, minOutputLength, vocabSize)) logitProcessorList.addProcess(new TemperatureLogitWarper(temperature)) @@ -148,7 +149,8 @@ trait Generate { randomSeed, session, applySoftmax, - ovInferRequest) + ovInferRequest, + stopTokenIds) } /** Beam Search for text generation @@ -193,7 +195,8 @@ trait Generate { randomSeed: Option[Long], session: Either[Session, (OrtEnvironment, OrtSession)], applySoftmax: Boolean, - ovInferRequest: Option[InferRequest] = None): Array[Array[Int]] = { + ovInferRequest: Option[InferRequest] = None, + stopTokenIds: Array[Int] = Array()): Array[Array[Int]] = { val inputIds = inputIdsVal val batchSize = beamScorer.getBeamHypothesesSeq.length val numBeams = beamScorer.getNumBeams @@ -227,21 +230,22 @@ trait Generate { // Optionally Apply log softmax to model outputs var nextTokenScores = if (applySoftmax) nextTokenLogits.map(logSoftmax) else nextTokenLogits - // Process the logits by defined logit processors val nextTokenScoresProcessed = logitProcessor.process(expandedInputs, nextTokenScores, currentLength) + // Process the logits by defined logit warpers + if (doSample) { + nextTokenScores = + logitProcessor.warp(expandedInputs, nextTokenScoresProcessed, currentLength) + } // Add previous beam scores to the output - nextTokenScores = nextTokenScoresProcessed.zipWithIndex.map { case (x, ind1) => + nextTokenScores = nextTokenScores.zipWithIndex.map { case (x, ind1) => x.zipWithIndex.map { case (y, _) => y + beamScores(ind1) } } - // Process the logits by defined logit warpers - if (doSample) { - nextTokenScores = logitProcessor.warp(expandedInputs, nextTokenScores, currentLength) - } + // Reshape next token score to (batchSize, vocabSize * numBeams) val vocabSize = nextTokenScores.head.length val reshapedNextTokenScores = @@ -290,7 +294,8 @@ trait Generate { padTokenId, eosTokenId, beamIndices, - currentLength) + currentLength, + stopTokenIds) val newBeamScores = beamOutputs._1.flatMap(_.toList) val beamNextTokens = beamOutputs._2.flatMap(_.toList) val beamIdx = beamOutputs._3.flatMap(_.toList) diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopKLogitWarper.scala b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopKLogitWarper.scala index 4d60a0e1684eda..f63fbba4ea7b1a 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopKLogitWarper.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopKLogitWarper.scala @@ -43,7 +43,13 @@ class TopKLogitWarper( } private def getTopKIndices(logits: Array[Float], k: Int): Array[Int] = { - logits.indices.sortBy(logits(_)).reverse.take(k).toArray + // ignore float.NegativeInfinity values + val topKIndices = new ArrayBuffer[Int]() + val sortedLogits = logits.zipWithIndex.filter(_._1 != filterValue).sortBy(-_._1) + for ((_, i) <- sortedLogits.take(k)) { + topKIndices += i + } + topKIndices.toArray } private def maskNotTopKValues(logits: Array[Float], topKIndices: Array[Int]): Array[Float] = { diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopPLogitWarper.scala b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopPLogitWarper.scala index 85e0dcf0e2893a..9c0ce72c6e45ce 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopPLogitWarper.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/TopPLogitWarper.scala @@ -24,13 +24,13 @@ class TopPLogitWarper(val p: Double, val minTokensToKeep: Int = 1) extends Logit val logitsUpd = scores.map(_.clone()) // Deep copy of the scores if (p < 1.0) { - val scoresFiltered = scores.map(_.filterNot(_.isInfinite)) // Filter out infinite values - val scoresShape = Array(scoresFiltered.length, scoresFiltered.head.length) - val topPThreshold = math.ceil(p * scoresShape.last).toInt // Determine top-p threshold + val scoresFiltered = scores // Filter out infinite values + val scoresSoftmaxed = scoresFiltered.map(softmax) // Softmax the scores - for ((logits, i) <- scores.zipWithIndex) { - val topPIndices = getTopPIndices(logits, topPThreshold) - val maskedValues = maskNotTopPValues(logits, topPIndices) + for ((logits, i) <- scoresSoftmaxed.zipWithIndex) { + val topPIndices = getTopPIndices(logits, p) + // Mask the values that are not in the top-p + val maskedValues = maskNotTopPValues(logitsUpd(i), topPIndices) logitsUpd(i) = maskedValues } } @@ -38,8 +38,26 @@ class TopPLogitWarper(val p: Double, val minTokensToKeep: Int = 1) extends Logit logitsUpd } - private def getTopPIndices(logits: Array[Float], k: Int): Array[Int] = { - logits.zipWithIndex.sortBy(-_._1).take(k).map(_._2) + private def getTopPIndices(logits: Array[Float], p: Double): Array[Int] = { + // sort the logits in descending order + var sortedLogits = logits.zipWithIndex.sortBy(-_._1) + + // filter out the negative infinity values + sortedLogits = sortedLogits.filter(_._1 > 0.0) + + // cumulative sum of the probabilities + val cumSum = sortedLogits.map(_._1).scanLeft(0.0)(_ + _) + + // find the index of the last element that is less than p + val lastIdx = cumSum.indexWhere(_ >= p) + // if the last index is less than the minimum tokens to keep, return the top p tokens + + if (lastIdx < minTokensToKeep) { + sortedLogits.take(math.ceil(p * logits.length).toInt).map(_._2) + } else { + sortedLogits.take(lastIdx).map(_._2) + } + } private def maskNotTopPValues(logits: Array[Float], topPIndices: Array[Int]): Array[Float] = { diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamScorer.scala b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamScorer.scala index 2fcbcada95337f..9f6eaed16b6361 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamScorer.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamScorer.scala @@ -26,7 +26,8 @@ abstract class BeamScorer() { padTokenId: Int, eosTokenId: Int, beamIndices: Seq[Array[Int]], - currentLength: Int): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) + currentLength: Int, + stopTokenIds: Array[Int]): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) def finalize( inputIds: Seq[Array[Int]], @@ -40,4 +41,5 @@ abstract class BeamScorer() { def getBeamHypothesesSeq: Seq[BeamHypotheses] def getNumBeams: Int def isDone: Boolean + def getDone: Array[Boolean] } diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.scala b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.scala index fbc4cb466215bc..577da0571698ab 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.scala @@ -43,6 +43,8 @@ class BeamSearchScorer( override def getNumBeams: Int = numBeams private val done: Array[Boolean] = Array.fill(batchSize)(false) + override def getDone: Array[Boolean] = done + override def process( inputIds: Seq[Array[Int]], nextScores: Seq[Array[Float]], @@ -51,7 +53,8 @@ class BeamSearchScorer( padTokenId: Int, eosTokenId: Int, beamIndices: Seq[Array[Int]], - currentLength: Int): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) = { + currentLength: Int, + stopTokenIds: Array[Int]): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) = { // val currentLength = inputIds.length val batchSize = this.beamHypothesesSeq.length val nextBeamScores = Array.ofDim[Float](batchSize, this.beamSize) @@ -75,7 +78,8 @@ class BeamSearchScorer( val nextIndex = nextIndices(batchIdx)(beamTokenRank) val batchBeamIdx = batchIdx * this.beamSize + nextIndex - if (eosTokenId == nextToken) { + // either eos token or stop tokens are found + if (eosTokenId == nextToken || stopTokenIds.contains(nextToken)) { if (beamTokenRank >= this.beamSize) { break } diff --git a/src/main/scala/com/johnsnowlabs/nlp/HasGeneratorProperties.scala b/src/main/scala/com/johnsnowlabs/nlp/HasGeneratorProperties.scala index eeddad13aacd32..6f13d946a21bc3 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/HasGeneratorProperties.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/HasGeneratorProperties.scala @@ -222,4 +222,19 @@ trait HasGeneratorProperties { /** @group getParam */ def getNReturnSequences: Int = $(nReturnSequences) + + /** Stop tokens to terminate the generation + * + * @group param + */ + var stopTokenIds = + new IntArrayParam(this, "stopTokens", "Stop tokens to terminate the generation") + + /** @group setParam */ + def setStopTokenIds(value: Array[Int]): this.type = { + set(stopTokenIds, value) + } + + /** @group getParam */ + def getStopTokenIds: Array[Int] = $(stopTokenIds) } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/LLAMA2Transformer.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/LLAMA2Transformer.scala index b9c114ea62de5f..9ecec85caa2520 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/LLAMA2Transformer.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/LLAMA2Transformer.scala @@ -235,7 +235,8 @@ class LLAMA2Transformer(override val uid: String) ignoreTokenIds -> Array(), batchSize -> 1, beamSize -> 1, - maxInputLength -> 4096) + maxInputLength -> 4096, + stopTokenIds -> Array()) /** takes a document and annotations and produces new annotations of this annotator's annotation * type @@ -269,7 +270,8 @@ class LLAMA2Transformer(override val uid: String) randomSeed = this.randomSeed, ignoreTokenIds = $(ignoreTokenIds), beamSize = $(beamSize), - maxInputLength = $(maxInputLength)) + maxInputLength = $(maxInputLength), + stopTokenIds = $(stopTokenIds)) } else { Seq() } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala index 43ab7a9f6264dd..ba2cf5af900030 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/MistralTransformer.scala @@ -243,7 +243,8 @@ class MistralTransformer(override val uid: String) ignoreTokenIds -> Array(), batchSize -> 1, beamSize -> 1, - maxInputLength -> 4096) + maxInputLength -> 4096, + stopTokenIds -> Array()) /** takes a document and annotations and produces new annotations of this annotator's annotation * type @@ -277,7 +278,8 @@ class MistralTransformer(override val uid: String) randomSeed = this.randomSeed, ignoreTokenIds = $(ignoreTokenIds), beamSize = $(beamSize), - maxInputLength = $(maxInputLength)) + maxInputLength = $(maxInputLength), + stopTokenIds = $(stopTokenIds)) } else { Seq() } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala index fbb16fa7e13ea2..ecb8dbb88f768c 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer.scala @@ -266,7 +266,8 @@ class Phi2Transformer(override val uid: String) ignoreTokenIds -> Array(), batchSize -> 1, beamSize -> 1, - maxInputLength -> 4096) + maxInputLength -> 4096, + stopTokenIds -> Array()) /** takes a document and annotations and produces new annotations of this annotator's annotation * type @@ -300,7 +301,8 @@ class Phi2Transformer(override val uid: String) randomSeed = this.randomSeed, ignoreTokenIds = $(ignoreTokenIds), beamSize = $(beamSize), - maxInputLength = $(maxInputLength)) + maxInputLength = $(maxInputLength), + stopTokenIds = $(stopTokenIds)) } else { Seq() } diff --git a/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitProcess/LogitProcessorTest.scala b/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitProcess/LogitProcessorTest.scala index 8ea0a1a5a26b71..c21fe6079259a2 100644 --- a/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitProcess/LogitProcessorTest.scala +++ b/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitProcess/LogitProcessorTest.scala @@ -69,4 +69,27 @@ class LogitProcessorTest extends AnyFlatSpec { assert(forcedScoresMultiple(1) == 0) } + "MinlengthLogitProcessor" should "process correctly" taggedAs FastTest in { + + val vocabSize = 32 + val scoresBatches: Array[Array[Float]] = Array(Array.fill(vocabSize)(1.0f)) + + val minLength = 2 + val minLengthLogitProcessor = new MinLengthLogitProcessor( + eosTokenId = vocabSize - 1, + minLength = minLength, + vocabSize = vocabSize) + + // if the min length is not reached, the eos token should be suppressed + val processedScores = + minLengthLogitProcessor.call(Seq.empty, scoresBatches, minLength - 1).head + + assert(processedScores(vocabSize - 1) == Float.NegativeInfinity) + + // if the min length is reached, the eos token should not be suppressed + val processedScoresAfter = + minLengthLogitProcessor.call(Seq.empty, scoresBatches, minLength).head + + assert(processedScoresAfter(vocabSize - 1) == 1.0f) + } } diff --git a/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/LogitWarperTest.scala b/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/LogitWarperTest.scala new file mode 100644 index 00000000000000..7f7112f5879ce9 --- /dev/null +++ b/src/test/scala/com/johnsnowlabs/ml/ai/util/Generation/Logit/LogitWarper/LogitWarperTest.scala @@ -0,0 +1,74 @@ +package com.johnsnowlabs.ml.ai.util.Generation.Logit.LogitWarper + +import com.johnsnowlabs.tags.FastTest +import org.scalatest.flatspec.AnyFlatSpec + +class LogitWarperTest extends AnyFlatSpec { + + "TopKLogitWarper" should "process correctly" taggedAs FastTest in { + val vocabSize = 10 + val topK = 5 + + val logitWarper = new TopKLogitWarper(k = topK, minTokensToKeep = 1) + val scoresBatches: Array[Array[Float]] = + Array(Array(0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f, 1.0f)) + + val processedScores = logitWarper.call(Seq.empty, scoresBatches, 1).head + + // Check that the top 5 scores are the same and the rest are -inf + assert(processedScores(0) == Float.NegativeInfinity) + assert(processedScores(1) == Float.NegativeInfinity) + assert(processedScores(2) == Float.NegativeInfinity) + assert(processedScores(3) == Float.NegativeInfinity) + assert(processedScores(4) == Float.NegativeInfinity) + assert(processedScores(5) == 0.6f) + assert(processedScores(6) == 0.7f) + assert(processedScores(7) == 0.8f) + assert(processedScores(8) == 0.9f) + assert(processedScores(9) == 1.0f) + + } + + "TemperatureLogitWarper" should "process correctly" taggedAs FastTest in { + val vocabSize = 10 + val temperature = 0.5f + + val logitWarper = new TemperatureLogitWarper(temperature = temperature) + val scoresBatches: Array[Array[Float]] = + Array(Array(0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f, 1.0f)) + + val processedScores = logitWarper.call(Seq.empty, scoresBatches, 1).head + + // Check that the scores are correctly scaled + processedScores.zipWithIndex.foreach({ case (score, i) => + assert(score == scoresBatches(0)(i) / temperature) + }) + + } + + "TopPLogitWarper" should "process correctly" taggedAs FastTest in { + val vocabSize = 10 + val topP = 0.5f + + val logitWarper = new TopPLogitWarper(p = topP, minTokensToKeep = 1) + val scoresBatches: Array[Array[Float]] = + Array(Array(0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f, Float.NegativeInfinity)) + + val processedScores = logitWarper.call(Seq.empty, scoresBatches, 1).head + + // print out the processed scores + processedScores.foreach(println) + + // Check that the top 5 scores are the same and the rest are -inf + assert(processedScores(0) == Float.NegativeInfinity) + assert(processedScores(1) == Float.NegativeInfinity) + assert(processedScores(2) == Float.NegativeInfinity) + assert(processedScores(3) == Float.NegativeInfinity) + assert(processedScores(4) == Float.NegativeInfinity) + assert(processedScores(5) !== Float.NegativeInfinity) + assert(processedScores(6) !== Float.NegativeInfinity) + assert(processedScores(7) !== Float.NegativeInfinity) + assert(processedScores(8) !== Float.NegativeInfinity) + assert(processedScores(9) == Float.NegativeInfinity) + } +} From 61e1892738d6d62775fcccb236725a7c90accf08 Mon Sep 17 00:00:00 2001 From: David Cecchini Date: Sun, 14 Jul 2024 12:04:43 -0300 Subject: [PATCH 5/7] Update 2023-03-01-t5_flan_base_xx.md (#14345) Fix typo in example code. --- docs/_posts/Cabir40/2023-03-01-t5_flan_base_xx.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/_posts/Cabir40/2023-03-01-t5_flan_base_xx.md b/docs/_posts/Cabir40/2023-03-01-t5_flan_base_xx.md index a62fda8156585a..f0641bc33902a3 100644 --- a/docs/_posts/Cabir40/2023-03-01-t5_flan_base_xx.md +++ b/docs/_posts/Cabir40/2023-03-01-t5_flan_base_xx.md @@ -50,8 +50,8 @@ result = pipeline.fit(data).transform(data) ``` ```scala val documentAssembler = new DocumentAssembler() -.setInputCols("text") -.setOutputCols("document") +.setInputCol("text") +.setOutputCol("document") val t5 = T5Transformer.pretrained("t5_flan_base","xx") .setInputCols("document") @@ -81,4 +81,4 @@ val result = pipeline.fit(data).transform(data) ## References -https://huggingface.co/google/flan-t5-base \ No newline at end of file +https://huggingface.co/google/flan-t5-base From c18508dea96f5d58fa7192f4f2667b91a02efe2e Mon Sep 17 00:00:00 2001 From: Maziyar Panahi Date: Sun, 14 Jul 2024 18:27:47 +0200 Subject: [PATCH 6/7] Bump version to 5.4.1 [run doc] --- CHANGELOG | 15 ++++ README.md | 11 +-- build.sbt | 2 +- docs/_layouts/landing.html | 2 +- docs/en/concepts.md | 2 +- docs/en/examples.md | 4 +- docs/en/hardware_acceleration.md | 2 +- docs/en/install.md | 54 ++++++------ docs/en/spark_nlp.md | 2 +- python/README.md | 88 +++++++++---------- python/docs/conf.py | 2 +- python/setup.py | 2 +- python/sparknlp/__init__.py | 4 +- scripts/colab_setup.sh | 2 +- scripts/kaggle_setup.sh | 2 +- scripts/sagemaker_setup.sh | 2 +- .../scala/com/johnsnowlabs/nlp/SparkNLP.scala | 2 +- .../scala/com/johnsnowlabs/util/Build.scala | 3 +- 18 files changed, 109 insertions(+), 92 deletions(-) diff --git a/CHANGELOG b/CHANGELOG index a7d44214610baf..e61057fcec5940 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,18 @@ +======== +5.4.1 +======== +---------------- +New Features & Enhancements +---------------- +* Added support for loading duplicate models in Spark NLP, allowing multiple models from the same annotator to be loaded simultaneously. +* Updated the README for better coherence and added new pages to the website. +* Added support for a stop IDs list to halt text generation in Phi, Mistral, and Llama annotators. + +---------------- +Bug Fixes +---------------- +* Fixed the default model names for Phi2 and Mistral AI annotators. + ======== 5.4.0 ======== diff --git a/README.md b/README.md index fe8f9fe9fcc625..d36039f3940836 100644 --- a/README.md +++ b/README.md @@ -166,7 +166,7 @@ Spark NLP 5.4.0 has been tested and is compatible with the following EMR release We are compatible with older EMR releases. For a full list check EMR support in our official [documentation](https://sparknlp.org/docs/en/install#emr-support) Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) -Full list of [Amazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html) +Full list 5.4.1mazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html) NOTE: The EMR 6.1.0 and 6.1.1 are not supported. @@ -182,7 +182,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap from our official documentation. If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your -projects [Spark NLP SBT Starter](https://github.com/maziyarpanahi/spark-nlp-starter) +projects [Spark NLP SBT S5.4.1r](https://github.com/maziyarpanahi/spark-nlp-starter) ### Python @@ -215,7 +215,7 @@ to use Spark NLP offline ## Advanced Settings -You can change Spark NLP configurations via Spark properties configuration. +You can change Spark NLP configurations via Spark properties configuration. Please check [these instructions](https://sparknlp.org/docs/en/install#sparknlp-properties) from our official documentation. ### S3 Integration @@ -227,7 +227,7 @@ In Spark NLP we can define S3 locations to: Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation. -## Documentation +## Document5.4.1 ### Examples @@ -260,7 +260,7 @@ the Spark NLP library: keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster}, abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.} } -} +}5.4.1 ``` ## Community support @@ -288,3 +288,4 @@ Clone the repo and submit your pull-requests! Or directly create issues in this ## John Snow Labs [http://johnsnowlabs.com](http://johnsnowlabs.com) +5.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.15.4.1 \ No newline at end of file diff --git a/build.sbt b/build.sbt index 9e0e57ac29e51b..15ed5090c58334 100644 --- a/build.sbt +++ b/build.sbt @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64) organization := "com.johnsnowlabs.nlp" -version := "5.4.0" +version := "5.4.1" (ThisBuild / scalaVersion) := scalaVer diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html index ee4766b9904aa2..e6201dedfbe193 100755 --- a/docs/_layouts/landing.html +++ b/docs/_layouts/landing.html @@ -201,7 +201,7 @@

{{ _section.title }}

{% highlight bash %} # Using PyPI - $ pip install spark-nlp==5.4.0 + $ pip install spark-nlp==5.4.1 # Using Anaconda/Conda $ conda install -c johnsnowlabs spark-nlp diff --git a/docs/en/concepts.md b/docs/en/concepts.md index 61295da699db91..b6ea00a8d1c171 100644 --- a/docs/en/concepts.md +++ b/docs/en/concepts.md @@ -66,7 +66,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.4.1 pyspark==3.3.1 jupyter $ jupyter notebook ``` diff --git a/docs/en/examples.md b/docs/en/examples.md index adc9b982acf24b..9ba7698fa65afc 100644 --- a/docs/en/examples.md +++ b/docs/en/examples.md @@ -18,7 +18,7 @@ $ java -version # should be Java 8 (Oracle or OpenJDK) $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp -$ pip install spark-nlp==5.4.0 pyspark==3.3.1 +$ pip install spark-nlp==5.4.1 pyspark==3.3.1 ```
@@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -p is for pyspark # -s is for spark-nlp # by default they are set to the latest -!bash colab.sh -p 3.2.3 -s 5.4.0 +!bash colab.sh -p 3.2.3 -s 5.4.1 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines. diff --git a/docs/en/hardware_acceleration.md b/docs/en/hardware_acceleration.md index eaa8802d53a55f..ca87d75debc680 100644 --- a/docs/en/hardware_acceleration.md +++ b/docs/en/hardware_acceleration.md @@ -49,7 +49,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a | DeBERTa Large | +477%(5.8x) | | Longformer Base | +52%(1.5x) | -Spark NLP 5.4.0 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: +Spark NLP 5.4.1 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 diff --git a/docs/en/install.md b/docs/en/install.md index 3d32683830df96..eb4cb54d728905 100644 --- a/docs/en/install.md +++ b/docs/en/install.md @@ -17,22 +17,22 @@ sidebar: ```bash # Install Spark NLP from PyPI -pip install spark-nlp==5.4.0 +pip install spark-nlp==5.4.1 # Install Spark NLP from Anaconda/Conda conda install -c johnsnowlabs spark-nlp # Load Spark NLP with Spark Shell -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 # Load Spark NLP with PySpark -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 # Load Spark NLP with Spark Submit -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 # Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly` -spark-shell --jars spark-nlp-assembly-5.4.0.jar +spark-shell --jars spark-nlp-assembly-5.4.1.jar ``` **GPU (optional):** @@ -55,7 +55,7 @@ python version, consider sticking to lower versions of Spark.
#### Quick Install - +5.4.1 Let's create a new Conda environment to manage all the dependencies there. You can use Python Virtual Environment if you prefer or not have any environment. ```bash @@ -92,7 +92,7 @@ spark = sparknlp.start() If you need to manually start SparkSession because you have other configurations and `sparknlp.start()` is not including them, you can manually start the SparkSession with: -```python +```python5.4.1 spark = SparkSession.builder \ .appName("Spark NLP") \ .master("local[*]") \ @@ -109,7 +109,7 @@ you'll have to put the jars in a reachable location for all driver and executor ### Python without explicit Pyspark installation ### Pip/Conda - +5.4.1 If you installed pyspark through pip/conda, you can install `spark-nlp` through the same channel. Pip: @@ -120,7 +120,7 @@ pip install spark-nlp==5.4.0 Conda: -```bash +```bash5.4.1 conda install -c johnsnowlabs spark-nlp ``` @@ -131,7 +131,7 @@ Then you'll have to create a SparkSession either from Spark NLP: ```python import sparknlp - +5.4.1 spark = sparknlp.start() ``` @@ -142,7 +142,7 @@ import sparknlp from sparknlp.pretrained import PretrainedPipeline # create or get Spark Session - +5.4.1 spark = sparknlp.start() sparknlp.version() @@ -154,28 +154,28 @@ pipeline = PretrainedPipeline('recognize_entities_dl', 'en') result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo') ``` -
+
5.4.1 ## Scala and Java To use Spark NLP you need the following requirements: - Java 8 and 11 -- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x +- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x5.4.1 #### Maven **spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x The `spark-nlp` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowla5.4.1p/spark-nlp). ```xml com.johnsnowlabs.nlp spark-nlp_2.12 - 5.4.0 + 5.4.05.4.1 ``` @@ -257,7 +257,7 @@ at the moment, only the standard variant of the M1 is supported. Other variants M1 Pro/Max/Ultra, M2) will most likely not work. Make sure the following prerequisites are met: - +5.4.1 1. An M1 compiled java version needs to be installed. For example to install the Zulu Java 11 JDK head to [Download Azul JDKs](https://www.azul.com/downloads/?version=java-11-lts&os=macos&architecture=arm-64-bit&package=jdk) and install that java version. @@ -265,7 +265,7 @@ Make sure the following prerequisites are met: rosetta, you can run the following commands in your shell: ```shell - johnsnow@m1mac ~ % cat $(which java) | file - + johnsnow@m1mac ~ % cat $(which java) | file -5.4.1 /dev/stdin: Mach-O 64-bit executable arm64 ``` @@ -302,7 +302,7 @@ rocksdbjni-6.20.3.jar ``` to find the jar you have to remove. After removing the jar, the pipelines should work -as expected. +as expected.5.4.1
@@ -350,7 +350,7 @@ pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 -``` +```5.4.1 The `spark-nlp-aarch64` has been published to the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64). @@ -372,7 +372,7 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s **NOTE**: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following set in your SparkSession: -```sh +```sh5.4.1 spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ @@ -454,7 +454,7 @@ to install Spark NLP for your system. ### Starting Spark NLP -Spark NLP needs to be started with the `aarch64` flag set to `true`: +Spark NLP needs to be started with the `aarch64` flag set to `true`:5.4.1 For Scala: @@ -474,7 +474,7 @@ spark = sparknlp.start(aarch64=True)
-## Google Colab Notebook +## Google 5.4.1 Notebook Google Colab is perhaps the easiest way to get started with spark-nlp. It requires no installation or setup other than having a Google account. @@ -537,7 +537,7 @@ Or you can install `spark-nlp` from inside Zeppelin by using Conda: ```bash python.conda install -c johnsnowlabs spark-nlp ``` - +5.4.1 Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose. Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and @@ -547,7 +547,7 @@ An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) a shown earlier since it includes both scala and python side installation. ## Jupyter Notebook - +5.4.1 **Recommended:** The easiest way to get this done on Linux and macOS is to simply install `spark-nlp` and `pyspark` PyPI packages and @@ -812,7 +812,7 @@ Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: - 9.1 ML & GPU - 10.1 ML & GPU -- 10.2 ML & GPU +- 10.2 ML & GPU5.4.1 - 10.3 ML & GPU - 10.4 ML & GPU - 10.5 ML & GPU @@ -840,12 +840,12 @@ Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: ```bash spark.kryoserializer.buffer.max 2000M - spark.serializer org.apache.spark.serializer.KryoSerializer + spark.serializer org.apache.spark.serializer.Kr5.4.1ializer ``` 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp` -> Install + 3.1. Install New -> PyPI -> `spark-nlp` -> Install5.4.1 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install diff --git a/docs/en/spark_nlp.md b/docs/en/spark_nlp.md index dac35142b800e6..db8df8250a2b6d 100644 --- a/docs/en/spark_nlp.md +++ b/docs/en/spark_nlp.md @@ -25,7 +25,7 @@ Spark NLP is built on top of **Apache Spark 3.x**. For using Spark NLP you need: **GPU (optional):** -Spark NLP 5.4.0 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: +Spark NLP 5.4.1 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 diff --git a/python/README.md b/python/README.md index cb7c32736e8638..b247a2b965a13f 100644 --- a/python/README.md +++ b/python/README.md @@ -166,7 +166,7 @@ To use Spark NLP you need the following requirements: **GPU (optional):** -Spark NLP 5.4.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.4.1 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -182,7 +182,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.4.0 pyspark==3.3.1 +$ pip install spark-nlp==5.4.1 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -227,7 +227,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x +Spark NLP *5.4.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| @@ -260,7 +260,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: +Spark NLP 5.4.1 has been tested and is compatible with the following runtimes: **CPU:** @@ -333,7 +333,7 @@ Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: ## EMR Support -Spark NLP 5.4.0 has been tested and is compatible with the following EMR releases: +Spark NLP 5.4.1 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -383,11 +383,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` The `spark-nlp` has been published to @@ -396,11 +396,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.1 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.1 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.1 ``` @@ -410,11 +410,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.1 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.1 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.1 ``` @@ -424,11 +424,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.1 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.1 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.1 ``` @@ -442,7 +442,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` ## Scala @@ -460,7 +460,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp_2.12 - 5.4.0 + 5.4.1 ``` @@ -471,7 +471,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.4.0 + 5.4.1 ``` @@ -482,7 +482,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.4.0 + 5.4.1 ``` @@ -493,7 +493,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.4.0 + 5.4.1 ``` @@ -503,28 +503,28 @@ coordinates: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.4.0" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.4.1" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.4.0" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.4.1" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.4.0" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.4.1" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.4.0" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.4.1" ``` Maven @@ -546,7 +546,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.4.0 +pip install spark-nlp==5.4.1 ``` Conda: @@ -575,7 +575,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1") .getOrCreate() ``` @@ -646,7 +646,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -657,7 +657,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.4.0 +pip install spark-nlp==5.4.1 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -685,7 +685,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.4.1 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -702,7 +702,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -729,7 +729,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.1 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -752,7 +752,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.1 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -771,9 +771,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.4.0` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.4.1` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -824,7 +824,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1" } }] ``` @@ -833,7 +833,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.4.0" \ +--name "Spark NLP 5.4.1" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -897,7 +897,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -940,7 +940,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1") .getOrCreate() ``` @@ -954,7 +954,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` **pyspark:** @@ -967,7 +967,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.1 ``` **Databricks:** @@ -1239,7 +1239,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.4.0.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.4.1.jar") .getOrCreate() ``` @@ -1248,7 +1248,7 @@ spark = SparkSession.builder version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.4.0.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.4.1.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/python/docs/conf.py b/python/docs/conf.py index 88d28fb0e8a4e8..93eb44d0e868e5 100644 --- a/python/docs/conf.py +++ b/python/docs/conf.py @@ -23,7 +23,7 @@ author = "John Snow Labs" # The full version, including alpha/beta/rc tags -release = "5.4.0" +release = "5.4.1" pyspark_version = "3.2.3" # -- General configuration --------------------------------------------------- diff --git a/python/setup.py b/python/setup.py index 53fb03dbfdd3e5..196ee7b848c7e1 100644 --- a/python/setup.py +++ b/python/setup.py @@ -41,7 +41,7 @@ # project code, see # https://packaging.python.org/en/latest/single_source_version.html - version='5.4.0', # Required + version='5.4.1', # Required # This is a one-line description or tagline of what your project does. This # corresponds to the 'Summary' metadata field: diff --git a/python/sparknlp/__init__.py b/python/sparknlp/__init__.py index 39bd6341e0a15f..982d00072d58bb 100644 --- a/python/sparknlp/__init__.py +++ b/python/sparknlp/__init__.py @@ -129,7 +129,7 @@ def start(gpu=False, The initiated Spark session. """ - current_version = "5.4.0" + current_version = "5.4.1" if params is None: params = {} @@ -310,4 +310,4 @@ def version(): str The current Spark NLP version. """ - return '5.4.0' + return '5.4.1' diff --git a/scripts/colab_setup.sh b/scripts/colab_setup.sh index 1871e9364d837d..e701577fa515a4 100644 --- a/scripts/colab_setup.sh +++ b/scripts/colab_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash #default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.4.0" +SPARKNLP="5.4.1" PYSPARK="3.2.3" while getopts s:p:g option diff --git a/scripts/kaggle_setup.sh b/scripts/kaggle_setup.sh index 847624604a69a9..fd7d36543e949e 100644 --- a/scripts/kaggle_setup.sh +++ b/scripts/kaggle_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash #default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.4.0" +SPARKNLP="5.4.1" PYSPARK="3.2.3" while getopts s:p:g option diff --git a/scripts/sagemaker_setup.sh b/scripts/sagemaker_setup.sh index 2b147480f4ed5a..283622a9d94d32 100644 --- a/scripts/sagemaker_setup.sh +++ b/scripts/sagemaker_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash # Default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.4.0" +SPARKNLP="5.4.1" PYSPARK="3.2.3" echo "Setup SageMaker for PySpark $PYSPARK and Spark NLP $SPARKNLP" diff --git a/src/main/scala/com/johnsnowlabs/nlp/SparkNLP.scala b/src/main/scala/com/johnsnowlabs/nlp/SparkNLP.scala index d87a3f5d47e860..eee9f04736aa33 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/SparkNLP.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/SparkNLP.scala @@ -20,7 +20,7 @@ import org.apache.spark.sql.SparkSession object SparkNLP { - val currentVersion = "5.4.0" + val currentVersion = "5.4.1" val MavenSpark3 = s"com.johnsnowlabs.nlp:spark-nlp_2.12:$currentVersion" val MavenGpuSpark3 = s"com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:$currentVersion" val MavenSparkSilicon = s"com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:$currentVersion" diff --git a/src/main/scala/com/johnsnowlabs/util/Build.scala b/src/main/scala/com/johnsnowlabs/util/Build.scala index 7d20d0cf72106a..672eda7bedb3e0 100644 --- a/src/main/scala/com/johnsnowlabs/util/Build.scala +++ b/src/main/scala/com/johnsnowlabs/util/Build.scala @@ -17,5 +17,6 @@ package com.johnsnowlabs.util object Build { - val version: String = "5.4.0" + val version: String = "5.4.1" } + From cb5bd21597fce7a7235542cdea043126379b1545 Mon Sep 17 00:00:00 2001 From: github-actions Date: Sun, 14 Jul 2024 16:37:02 +0000 Subject: [PATCH 7/7] Update Scala and Python APIs --- docs/api/com/index.html | 8 +-- .../com/johnsnowlabs/client/CloudClient.html | 8 +-- .../com/johnsnowlabs/client/CloudManager.html | 8 +-- .../johnsnowlabs/client/CloudResources$.html | 8 +-- .../com/johnsnowlabs/client/CloudStorage.html | 8 +-- .../client/aws/AWSAnonymousCredentials.html | 8 +-- .../client/aws/AWSBasicCredentials.html | 8 +-- .../johnsnowlabs/client/aws/AWSClient.html | 8 +-- .../client/aws/AWSCredentialsProvider.html | 8 +-- .../johnsnowlabs/client/aws/AWSGateway.html | 8 +-- .../client/aws/AWSProfileCredentials.html | 8 +-- .../client/aws/AWSTokenCredentials.html | 8 +-- .../client/aws/CredentialParams.html | 8 +-- .../johnsnowlabs/client/aws/Credentials.html | 8 +-- .../com/johnsnowlabs/client/aws/index.html | 8 +-- .../client/azure/AzureClient.html | 8 +-- .../client/azure/AzureGateway.html | 8 +-- .../com/johnsnowlabs/client/azure/index.html | 8 +-- .../johnsnowlabs/client/gcp/GCPClient.html | 8 +-- .../johnsnowlabs/client/gcp/GCPGateway.html | 8 +-- .../com/johnsnowlabs/client/gcp/index.html | 8 +-- docs/api/com/johnsnowlabs/client/index.html | 8 +-- .../client/util/CloudHelper$.html | 8 +-- .../com/johnsnowlabs/client/util/index.html | 8 +-- .../johnsnowlabs/collections/SearchTrie$.html | 8 +-- .../johnsnowlabs/collections/SearchTrie.html | 8 +-- .../collections/StorageSearchTrie$.html | 8 +-- .../collections/StorageSearchTrie.html | 8 +-- .../com/johnsnowlabs/collections/index.html | 8 +-- docs/api/com/johnsnowlabs/index.html | 8 +-- docs/api/com/johnsnowlabs/ml/ai/DeBerta.html | 8 +-- .../ml/ai/MergeTokenStrategy$.html | 8 +-- .../johnsnowlabs/ml/ai/OpenAICompletion.html | 8 +-- .../johnsnowlabs/ml/ai/OpenAIEmbeddings$.html | 8 +-- .../johnsnowlabs/ml/ai/OpenAIEmbeddings.html | 8 +-- docs/api/com/johnsnowlabs/ml/ai/index.html | 8 +-- .../com/johnsnowlabs/ml/ai/model/Choice.html | 8 +-- .../ml/ai/model/CompletionResponse.html | 8 +-- .../ml/ai/model/EmbeddingData.html | 8 +-- .../ml/ai/model/TextEmbeddingResponse.html | 8 +-- .../com/johnsnowlabs/ml/ai/model/Usage.html | 8 +-- .../johnsnowlabs/ml/ai/model/UsageData.html | 8 +-- .../com/johnsnowlabs/ml/ai/model/index.html | 8 +-- .../ml/ai/seq2seq/DecoderProcessor.html | 8 +-- .../ml/ai/seq2seq/OnnxT5EncoderDecoder.html | 8 +-- .../ai/seq2seq/OpenvinoT5EncoderDecoder.html | 8 +-- .../ml/ai/seq2seq/T5EncoderDecoder.html | 8 +-- .../com/johnsnowlabs/ml/ai/seq2seq/index.html | 8 +-- .../ml/ai/t5/OnnxT5EncoderDecoder.html | 8 +-- .../t5/T5EncoderDecoder$DecoderProcessor.html | 8 +-- .../ml/ai/t5/T5EncoderDecoder.html | 8 +-- docs/api/com/johnsnowlabs/ml/ai/t5/index.html | 8 +-- .../ml/ai/util/Generation/Generate.html | 20 +++---- .../ai/util/Generation/GenerationConfig.html | 8 +-- .../ml/ai/util/Generation/Logit/Logit.html | 8 +-- .../ForcedTokenLogitProcessor.html | 8 +-- .../Logit/LogitProcess/LogitProcessor.html | 8 +-- .../LogitProcess/MinLengthLogitProcessor.html | 8 +-- .../NoRepeatNgramsLogitProcessor.html | 8 +-- .../RepetitionPenaltyLogitProcessor.html | 8 +-- .../LogitProcess/SuppressLogitProcessor.html | 8 +-- .../Generation/Logit/LogitProcess/index.html | 8 +-- .../Generation/Logit/LogitProcessorList.html | 8 +-- .../Logit/LogitWarper/LogitWarper.html | 8 +-- .../LogitWarper/TemperatureLogitWarper.html | 8 +-- .../Logit/LogitWarper/TopKLogitWarper.html | 8 +-- .../Logit/LogitWarper/TopPLogitWarper.html | 8 +-- .../Generation/Logit/LogitWarper/index.html | 8 +-- .../ml/ai/util/Generation/Logit/index.html | 8 +-- .../Generation/Search/BeamHypotheses.html | 8 +-- .../ai/util/Generation/Search/BeamScorer.html | 30 +++++++--- .../Generation/Search/BeamSearchScorer.html | 30 +++++++--- .../ml/ai/util/Generation/Search/index.html | 8 +-- .../ml/ai/util/Generation/index.html | 8 +-- .../com/johnsnowlabs/ml/ai/util/index.html | 8 +-- docs/api/com/johnsnowlabs/ml/crf/Attr.html | 8 +-- .../com/johnsnowlabs/ml/crf/AttrFeature.html | 8 +-- .../api/com/johnsnowlabs/ml/crf/AttrStat.html | 8 +-- .../com/johnsnowlabs/ml/crf/CrfDataset.html | 8 +-- .../com/johnsnowlabs/ml/crf/CrfParams.html | 8 +-- .../johnsnowlabs/ml/crf/DatasetEncoder.html | 8 +-- .../johnsnowlabs/ml/crf/DatasetMetadata.html | 8 +-- .../johnsnowlabs/ml/crf/DatasetReader$.html | 8 +-- .../johnsnowlabs/ml/crf/EdgeCalculator$.html | 8 +-- .../com/johnsnowlabs/ml/crf/FbCalculator.html | 8 +-- .../api/com/johnsnowlabs/ml/crf/Instance.html | 8 +-- .../johnsnowlabs/ml/crf/InstanceLabels.html | 8 +-- .../johnsnowlabs/ml/crf/L2DecayStrategy.html | 8 +-- .../johnsnowlabs/ml/crf/LinearChainCrf.html | 8 +-- .../ml/crf/LinearChainCrfModel.html | 8 +-- .../ml/crf/SerializedDatasetMetadata.html | 8 +-- .../ml/crf/SerializedLinearChainCrfModel.html | 8 +-- .../ml/crf/SparseArray$$SeqWrapper.html | 8 +-- .../com/johnsnowlabs/ml/crf/SparseArray$.html | 8 +-- .../com/johnsnowlabs/ml/crf/SparseArray.html | 8 +-- .../ml/crf/TextSentenceAttrs.html | 8 +-- .../ml/crf/TextSentenceLabels.html | 8 +-- .../com/johnsnowlabs/ml/crf/Transition.html | 8 +-- .../com/johnsnowlabs/ml/crf/VectorMath$.html | 8 +-- .../com/johnsnowlabs/ml/crf/WordAttrs.html | 8 +-- docs/api/com/johnsnowlabs/ml/crf/index.html | 8 +-- docs/api/com/johnsnowlabs/ml/index.html | 8 +-- .../com/johnsnowlabs/ml/onnx/OnnxSession.html | 8 +-- .../ml/onnx/OnnxWrapper$$DecoderWrappers.html | 8 +-- ...er$$EncoderDecoderWithoutPastWrappers.html | 8 +-- .../OnnxWrapper$$EncoderDecoderWrappers.html | 8 +-- .../johnsnowlabs/ml/onnx/OnnxWrapper$.html | 8 +-- .../com/johnsnowlabs/ml/onnx/OnnxWrapper.html | 8 +-- .../johnsnowlabs/ml/onnx/ReadOnnxModel.html | 8 +-- ...sources$$implicits$$OnnxSessionResult.html | 8 +-- .../ml/onnx/TensorResources$$implicits$.html | 8 +-- .../ml/onnx/TensorResources$.html | 8 +-- .../johnsnowlabs/ml/onnx/TensorResources.html | 8 +-- .../johnsnowlabs/ml/onnx/WriteOnnxModel.html | 8 +-- docs/api/com/johnsnowlabs/ml/onnx/index.html | 8 +-- .../OpenvinoWrapper$$DecoderWrappers.html | 8 +-- ...er$$EncoderDecoderWithoutPastWrappers.html | 8 +-- ...envinoWrapper$$EncoderDecoderWrappers.html | 8 +-- .../ml/openvino/OpenvinoWrapper$.html | 8 +-- .../ml/openvino/OpenvinoWrapper.html | 8 +-- .../ml/openvino/ReadOpenvinoModel.html | 8 +-- .../ml/openvino/WriteOpenvinoModel.html | 8 +-- .../com/johnsnowlabs/ml/openvino/index.html | 8 +-- .../tensorflow/ClassifierDatasetEncoder.html | 8 +-- .../ClassifierDatasetEncoderParams.html | 8 +-- .../ml/tensorflow/DatasetEncoderParams.html | 8 +-- .../johnsnowlabs/ml/tensorflow/Logging.html | 8 +-- .../ml/tensorflow/ModelSignature.html | 8 +-- .../johnsnowlabs/ml/tensorflow/NerBatch$.html | 8 +-- .../johnsnowlabs/ml/tensorflow/NerBatch.html | 8 +-- .../ml/tensorflow/NerDatasetEncoder.html | 8 +-- .../ml/tensorflow/ReadTensorflowModel.html | 8 +-- .../ml/tensorflow/SentenceGrouper.html | 8 +-- .../ml/tensorflow/TensorResources$.html | 8 +-- .../ml/tensorflow/TensorResources.html | 8 +-- .../ml/tensorflow/TensorflowClassifier.html | 8 +-- .../ml/tensorflow/TensorflowWrapper$.html | 8 +-- .../ml/tensorflow/TensorflowWrapper.html | 8 +-- .../johnsnowlabs/ml/tensorflow/Variables.html | 8 +-- .../ml/tensorflow/WriteTensorflowModel.html | 8 +-- .../com/johnsnowlabs/ml/tensorflow/index.html | 8 +-- .../sentencepiece/ReadSentencePieceModel.html | 8 +-- .../sentencepiece/SentencePieceException.html | 8 +-- .../sentencepiece/SentencePieceProcessor.html | 8 +-- .../sentencepiece/SentencePieceWrapper$.html | 8 +-- .../WriteSentencePieceModel.html | 8 +-- .../ml/tensorflow/sentencepiece/index.html | 8 +-- ...delSignatureConstants$$AttentionMask$.html | 8 +-- ...lSignatureConstants$$AttentionMaskV1$.html | 8 +-- ...SignatureConstants$$AudioValuesInput$.html | 8 +-- ...s$$CachedDecoderEncoderAttentionMask$.html | 8 +-- ...stants$$CachedDecoderEncoderInputIds$.html | 8 +-- ...eConstants$$CachedDecoderInputCache1$.html | 8 +-- ...eConstants$$CachedDecoderInputCache2$.html | 8 +-- ...tureConstants$$CachedDecoderInputIds$.html | 8 +-- ...natureConstants$$CachedEncoderOutput$.html | 8 +-- ...gnatureConstants$$CachedLogitsOutput$.html | 8 +-- ...delSignatureConstants$$CachedOutPut2$.html | 8 +-- ...delSignatureConstants$$CachedOutput1$.html | 8 +-- .../sign/ModelSignatureConstants$$DType$.html | 8 +-- ...atureConstants$$DecoderAttentionMask$.html | 8 +-- ...ureConstants$$DecoderCachedCache1Key$.html | 8 +-- ...ureConstants$$DecoderCachedCache2Key$.html | 8 +-- ...ts$$DecoderCachedEncoderAttentionKey$.html | 8 +-- ...stants$$DecoderCachedEncoderStateKey$.html | 8 +-- ...eConstants$$DecoderCachedInputIdsKey$.html | 8 +-- ...natureConstants$$DecoderCachedOutput$.html | 8 +-- ...stants$$DecoderCachedOutputCache1Key$.html | 8 +-- ...stants$$DecoderCachedOutputCache2Key$.html | 8 +-- ...ureConstants$$DecoderCachedOutputKey$.html | 8 +-- ...nstants$$DecoderEncoderAttentionMask$.html | 8 +-- ...ureConstants$$DecoderEncoderInputIds$.html | 8 +-- ...onstants$$DecoderInitOutputCache1Key$.html | 8 +-- ...onstants$$DecoderInitOutputCache2Key$.html | 8 +-- ...lSignatureConstants$$DecoderInputIds$.html | 8 +-- ...delSignatureConstants$$DecoderOutput$.html | 8 +-- .../ModelSignatureConstants$$DimCount$.html | 8 +-- ...atureConstants$$EncoderAttentionMask$.html | 8 +-- ...gnatureConstants$$EncoderContextMask$.html | 8 +-- ...lSignatureConstants$$EncoderInputIds$.html | 8 +-- ...delSignatureConstants$$EncoderOutput$.html | 8 +-- ...lSignatureConstants$$EndLogitsOutput$.html | 8 +-- ...ignatureConstants$$InitCachedOutPut2$.html | 8 +-- ...ignatureConstants$$InitCachedOutput1$.html | 8 +-- ...nts$$InitDecoderEncoderAttentionMask$.html | 8 +-- ...onstants$$InitDecoderEncoderInputIds$.html | 8 +-- ...natureConstants$$InitDecoderInputIds$.html | 8 +-- ...SignatureConstants$$InitLogitsOutput$.html | 8 +-- .../ModelSignatureConstants$$InputIds$.html | 8 +-- .../ModelSignatureConstants$$InputIdsV1$.html | 8 +-- ...lSignatureConstants$$LastHiddenState$.html | 8 +-- ...ignatureConstants$$LastHiddenStateV1$.html | 8 +-- ...odelSignatureConstants$$LogitsOutput$.html | 8 +-- .../sign/ModelSignatureConstants$$Name$.html | 8 +-- ...SignatureConstants$$PixelValuesInput$.html | 8 +-- ...odelSignatureConstants$$PoolerOutput$.html | 8 +-- ...elSignatureConstants$$PoolerOutputV1$.html | 8 +-- ...elSignatureConstants$$SerializedSize$.html | 8 +-- ...odelSignatureConstants$$ShapeDimList$.html | 8 +-- ...ignatureConstants$$StartLogitsOutput$.html | 8 +-- ...lSignatureConstants$$TFInfoDescriptor.html | 8 +-- ...lSignatureConstants$$TFInfoNameMapper.html | 8 +-- ...stants$$TapasLogitsAggregationOutput$.html | 8 +-- ...ignatureConstants$$TapasLogitsOutput$.html | 8 +-- ...odelSignatureConstants$$TokenTypeIds$.html | 8 +-- ...elSignatureConstants$$TokenTypeIdsV1$.html | 8 +-- .../sign/ModelSignatureConstants$.html | 8 +-- .../sign/ModelSignatureManager$.html | 8 +-- .../ml/tensorflow/sign/index.html | 8 +-- ...inAlg$$implicits$$ExtendedDenseMatrix.html | 8 +-- .../ml/util/LinAlg$$implicits$.html | 8 +-- .../api/com/johnsnowlabs/ml/util/LinAlg$.html | 8 +-- .../ml/util/LoadExternalModel$.html | 27 +++++++-- .../com/johnsnowlabs/ml/util/ModelArch$.html | 8 +-- .../com/johnsnowlabs/ml/util/ModelEngine.html | 8 +-- docs/api/com/johnsnowlabs/ml/util/ONNX$.html | 8 +-- .../com/johnsnowlabs/ml/util/Openvino$.html | 8 +-- .../com/johnsnowlabs/ml/util/PyTorch$.html | 8 +-- .../com/johnsnowlabs/ml/util/TensorFlow$.html | 8 +-- .../com/johnsnowlabs/ml/util/Unknown$.html | 8 +-- docs/api/com/johnsnowlabs/ml/util/index.html | 8 +-- .../johnsnowlabs/nlp/ActivationFunction$.html | 8 +-- .../nlp/Annotation$$AnnotationContainer.html | 8 +-- ...nnotation$$extractors$$AnnotationData.html | 8 +-- .../nlp/Annotation$$extractors$.html | 8 +-- .../api/com/johnsnowlabs/nlp/Annotation$.html | 8 +-- docs/api/com/johnsnowlabs/nlp/Annotation.html | 8 +-- .../AnnotationAudio$$AnnotationContainer.html | 8 +-- .../nlp/AnnotationAudio$$AudioFields.html | 8 +-- .../johnsnowlabs/nlp/AnnotationAudio$.html | 8 +-- .../com/johnsnowlabs/nlp/AnnotationAudio.html | 8 +-- .../AnnotationImage$$AnnotationContainer.html | 8 +-- .../nlp/AnnotationImage$$ImageFields.html | 8 +-- .../johnsnowlabs/nlp/AnnotationImage$.html | 8 +-- .../com/johnsnowlabs/nlp/AnnotationImage.html | 8 +-- .../johnsnowlabs/nlp/AnnotatorApproach.html | 8 +-- .../com/johnsnowlabs/nlp/AnnotatorModel.html | 8 +-- .../com/johnsnowlabs/nlp/AnnotatorType$.html | 8 +-- .../com/johnsnowlabs/nlp/AudioAssembler$.html | 8 +-- .../com/johnsnowlabs/nlp/AudioAssembler.html | 8 +-- docs/api/com/johnsnowlabs/nlp/CanBeLazy.html | 8 +-- docs/api/com/johnsnowlabs/nlp/Doc2Chunk$.html | 8 +-- docs/api/com/johnsnowlabs/nlp/Doc2Chunk.html | 8 +-- .../johnsnowlabs/nlp/DocumentAssembler$.html | 8 +-- .../johnsnowlabs/nlp/DocumentAssembler.html | 8 +-- .../johnsnowlabs/nlp/EmbeddingsFinisher$.html | 8 +-- .../johnsnowlabs/nlp/EmbeddingsFinisher.html | 8 +-- .../com/johnsnowlabs/nlp/FeaturesReader.html | 8 +-- .../com/johnsnowlabs/nlp/FeaturesWriter.html | 8 +-- docs/api/com/johnsnowlabs/nlp/Finisher$.html | 8 +-- docs/api/com/johnsnowlabs/nlp/Finisher.html | 8 +-- .../com/johnsnowlabs/nlp/GraphFinisher.html | 8 +-- .../nlp/HasAudioFeatureProperties.html | 8 +-- .../johnsnowlabs/nlp/HasBatchedAnnotate.html | 8 +-- .../nlp/HasBatchedAnnotateAudio.html | 8 +-- .../nlp/HasBatchedAnnotateImage.html | 8 +-- .../nlp/HasCandidateLabelsProperties.html | 8 +-- .../nlp/HasCaseSensitiveProperties.html | 8 +-- .../HasClassifierActivationProperties.html | 8 +-- .../nlp/HasEnableCachingProperties.html | 8 +-- docs/api/com/johnsnowlabs/nlp/HasEngine.html | 8 +-- .../api/com/johnsnowlabs/nlp/HasFeatures.html | 8 +-- .../nlp/HasGeneratorProperties.html | 57 ++++++++++++++++-- .../nlp/HasImageFeatureProperties.html | 8 +-- .../nlp/HasInputAnnotationCols.html | 8 +-- .../nlp/HasMultipleInputAnnotationCols.html | 8 +-- .../nlp/HasOutputAnnotationCol.html | 8 +-- .../nlp/HasOutputAnnotatorType.html | 8 +-- .../com/johnsnowlabs/nlp/HasPretrained.html | 8 +-- .../HasProtectedParams$ProtectedParam.html | 8 +-- .../johnsnowlabs/nlp/HasProtectedParams.html | 8 +-- .../com/johnsnowlabs/nlp/HasRecursiveFit.html | 8 +-- .../nlp/HasRecursiveTransform.html | 8 +-- .../johnsnowlabs/nlp/HasSimpleAnnotate.html | 8 +-- .../api/com/johnsnowlabs/nlp/IAnnotation.html | 8 +-- .../com/johnsnowlabs/nlp/ImageAssembler$.html | 8 +-- .../com/johnsnowlabs/nlp/ImageAssembler.html | 8 +-- .../com/johnsnowlabs/nlp/JavaAnnotation.html | 8 +-- .../com/johnsnowlabs/nlp/LightPipeline.html | 8 +-- .../nlp/MultiDocumentAssembler$.html | 8 +-- .../nlp/MultiDocumentAssembler.html | 8 +-- .../nlp/ParamsAndFeaturesReadable.html | 8 +-- .../nlp/ParamsAndFeaturesWritable.html | 8 +-- .../com/johnsnowlabs/nlp/RawAnnotator.html | 8 +-- .../johnsnowlabs/nlp/RecursivePipeline.html | 8 +-- .../nlp/RecursivePipelineModel.html | 8 +-- docs/api/com/johnsnowlabs/nlp/SparkNLP$.html | 8 +-- .../com/johnsnowlabs/nlp/TableAssembler$.html | 8 +-- .../com/johnsnowlabs/nlp/TableAssembler.html | 8 +-- .../com/johnsnowlabs/nlp/TokenAssembler$.html | 8 +-- .../com/johnsnowlabs/nlp/TokenAssembler.html | 8 +-- .../nlp/annotators/Chunk2Doc$.html | 8 +-- .../nlp/annotators/Chunk2Doc.html | 8 +-- .../nlp/annotators/ChunkTokenizer$.html | 8 +-- .../nlp/annotators/ChunkTokenizer.html | 8 +-- .../nlp/annotators/ChunkTokenizerModel$.html | 8 +-- .../nlp/annotators/ChunkTokenizerModel.html | 8 +-- .../johnsnowlabs/nlp/annotators/Chunker$.html | 8 +-- .../johnsnowlabs/nlp/annotators/Chunker.html | 8 +-- .../nlp/annotators/Date2Chunk$.html | 8 +-- .../nlp/annotators/Date2Chunk.html | 8 +-- .../nlp/annotators/DateMatcher$.html | 8 +-- .../nlp/annotators/DateMatcher.html | 8 +-- .../nlp/annotators/DateMatcherTranslator.html | 8 +-- .../DateMatcherTranslatorPolicy.html | 8 +-- .../nlp/annotators/DateMatcherUtils.html | 8 +-- .../DocumentCharacterTextSplitter$.html | 8 +-- .../DocumentCharacterTextSplitter.html | 8 +-- .../nlp/annotators/DocumentNormalizer$.html | 8 +-- .../nlp/annotators/DocumentNormalizer.html | 8 +-- .../annotators/DocumentTokenSplitter$.html | 8 +-- .../nlp/annotators/DocumentTokenSplitter.html | 8 +-- .../nlp/annotators/EnglishStemmer$.html | 8 +-- .../nlp/annotators/GraphExtraction.html | 8 +-- .../nlp/annotators/Lemmatizer$.html | 8 +-- .../nlp/annotators/Lemmatizer.html | 8 +-- .../nlp/annotators/LemmatizerModel$.html | 8 +-- .../nlp/annotators/LemmatizerModel.html | 8 +-- .../nlp/annotators/LookAroundManager$.html | 8 +-- .../nlp/annotators/MultiDateMatcher$.html | 8 +-- .../nlp/annotators/MultiDateMatcher.html | 8 +-- .../nlp/annotators/MultiDatePolicy$.html | 8 +-- .../nlp/annotators/NGramGenerator$.html | 8 +-- .../nlp/annotators/NGramGenerator.html | 8 +-- .../nlp/annotators/Normalizer$.html | 8 +-- .../nlp/annotators/Normalizer.html | 8 +-- .../nlp/annotators/NormalizerModel$.html | 8 +-- ...alizerModel$TokenizerAndNormalizerMap.html | 8 +-- .../nlp/annotators/NormalizerModel.html | 8 +-- .../annotators/PretrainedAnnotations$.html | 8 +-- .../ReadablePretrainedLemmatizer.html | 8 +-- ...adablePretrainedStopWordsCleanerModel.html | 8 +-- .../ReadablePretrainedTextMatcher.html | 8 +-- .../ReadablePretrainedTokenizer.html | 8 +-- .../nlp/annotators/RecursiveTokenizer.html | 8 +-- .../annotators/RecursiveTokenizerModel$.html | 8 +-- .../annotators/RecursiveTokenizerModel.html | 8 +-- .../nlp/annotators/RegexMatcher$.html | 8 +-- .../nlp/annotators/RegexMatcher.html | 8 +-- .../nlp/annotators/RegexMatcherModel$.html | 8 +-- .../nlp/annotators/RegexMatcherModel.html | 8 +-- .../nlp/annotators/RegexTokenizer$.html | 8 +-- .../nlp/annotators/RegexTokenizer.html | 8 +-- .../nlp/annotators/SingleDatePolicy$.html | 8 +-- .../johnsnowlabs/nlp/annotators/Stemmer$.html | 8 +-- .../johnsnowlabs/nlp/annotators/Stemmer.html | 8 +-- .../nlp/annotators/StopWordsCleaner$.html | 8 +-- .../nlp/annotators/StopWordsCleaner.html | 8 +-- .../nlp/annotators/TextMatcher$.html | 8 +-- .../nlp/annotators/TextMatcher.html | 8 +-- .../nlp/annotators/TextMatcherModel$.html | 8 +-- .../nlp/annotators/TextMatcherModel.html | 8 +-- .../nlp/annotators/TextSplitter.html | 8 +-- .../nlp/annotators/Token2Chunk$.html | 8 +-- .../nlp/annotators/Token2Chunk.html | 8 +-- .../nlp/annotators/Tokenizer$.html | 8 +-- .../nlp/annotators/Tokenizer.html | 8 +-- .../nlp/annotators/TokenizerModel$.html | 8 +-- .../nlp/annotators/TokenizerModel.html | 8 +-- .../nlp/annotators/audio/HubertForCTC$.html | 8 +-- .../nlp/annotators/audio/HubertForCTC.html | 8 +-- .../audio/ReadHubertForAudioDLModel.html | 8 +-- .../audio/ReadWav2Vec2ForAudioDLModel.html | 8 +-- .../audio/ReadWhisperForCTCDLModel.html | 8 +-- ...ReadablePretrainedHubertForAudioModel.html | 8 +-- ...adablePretrainedWav2Vec2ForAudioModel.html | 8 +-- .../ReadablePretrainedWhisperForCTCModel.html | 8 +-- .../nlp/annotators/audio/Wav2Vec2ForCTC$.html | 8 +-- .../nlp/annotators/audio/Wav2Vec2ForCTC.html | 8 +-- .../nlp/annotators/audio/WhisperForCTC$.html | 8 +-- .../nlp/annotators/audio/WhisperForCTC.html | 58 ++++++++++++++++-- .../audio/feature_extractor/AudioUtils$.html | 8 +-- .../PreprocessorAttributes$.html | 8 +-- .../WhisperPreprocessor.html | 8 +-- .../audio/feature_extractor/index.html | 8 +-- .../nlp/annotators/audio/index.html | 8 +-- .../nlp/annotators/btm/BigTextMatcher$.html | 8 +-- .../nlp/annotators/btm/BigTextMatcher.html | 8 +-- .../annotators/btm/BigTextMatcherModel$.html | 8 +-- .../annotators/btm/BigTextMatcherModel.html | 8 +-- .../btm/ReadablePretrainedBigTextMatcher.html | 8 +-- .../nlp/annotators/btm/TMEdgesReadWriter.html | 8 +-- .../nlp/annotators/btm/TMEdgesReader.html | 8 +-- .../nlp/annotators/btm/TMNodesReader.html | 8 +-- .../nlp/annotators/btm/TMNodesWriter.html | 8 +-- .../nlp/annotators/btm/TMVocabReadWriter.html | 8 +-- .../nlp/annotators/btm/TMVocabReader.html | 8 +-- .../nlp/annotators/btm/TrieNode.html | 8 +-- .../nlp/annotators/btm/index.html | 8 +-- .../dl/AlbertForQuestionAnswering$.html | 8 +-- .../dl/AlbertForQuestionAnswering.html | 8 +-- .../dl/AlbertForSequenceClassification$.html | 8 +-- .../dl/AlbertForSequenceClassification.html | 8 +-- .../dl/AlbertForTokenClassification$.html | 8 +-- .../dl/AlbertForTokenClassification.html | 8 +-- .../dl/BartForZeroShotClassification$.html | 8 +-- .../dl/BartForZeroShotClassification.html | 8 +-- .../dl/BertForQuestionAnswering$.html | 8 +-- .../dl/BertForQuestionAnswering.html | 8 +-- .../dl/BertForSequenceClassification$.html | 8 +-- .../dl/BertForSequenceClassification.html | 8 +-- .../dl/BertForTokenClassification$.html | 8 +-- .../dl/BertForTokenClassification.html | 8 +-- .../dl/BertForZeroShotClassification$.html | 8 +-- .../dl/BertForZeroShotClassification.html | 8 +-- .../dl/CamemBertForQuestionAnswering$.html | 8 +-- .../dl/CamemBertForQuestionAnswering.html | 8 +-- .../CamemBertForSequenceClassification$.html | 8 +-- .../CamemBertForSequenceClassification.html | 8 +-- .../dl/CamemBertForTokenClassification$.html | 8 +-- .../dl/CamemBertForTokenClassification.html | 8 +-- .../classifier/dl/ClassifierDLApproach$.html | 8 +-- .../classifier/dl/ClassifierDLApproach.html | 8 +-- .../classifier/dl/ClassifierDLModel$.html | 8 +-- .../classifier/dl/ClassifierDLModel.html | 8 +-- .../classifier/dl/ClassifierEncoder.html | 8 +-- .../classifier/dl/ClassifierMetrics.html | 8 +-- .../dl/DeBertaForQuestionAnswering$.html | 8 +-- .../dl/DeBertaForQuestionAnswering.html | 8 +-- .../dl/DeBertaForSequenceClassification$.html | 8 +-- .../dl/DeBertaForSequenceClassification.html | 8 +-- .../dl/DeBertaForTokenClassification$.html | 8 +-- .../dl/DeBertaForTokenClassification.html | 8 +-- .../dl/DeBertaForZeroShotClassification$.html | 8 +-- .../dl/DeBertaForZeroShotClassification.html | 8 +-- .../dl/DistilBertForQuestionAnswering$.html | 8 +-- .../dl/DistilBertForQuestionAnswering.html | 8 +-- .../DistilBertForSequenceClassification$.html | 8 +-- .../DistilBertForSequenceClassification.html | 8 +-- .../dl/DistilBertForTokenClassification$.html | 8 +-- .../dl/DistilBertForTokenClassification.html | 8 +-- .../DistilBertForZeroShotClassification$.html | 8 +-- .../DistilBertForZeroShotClassification.html | 8 +-- .../dl/LongformerForQuestionAnswering$.html | 8 +-- .../dl/LongformerForQuestionAnswering.html | 8 +-- .../LongformerForSequenceClassification$.html | 8 +-- .../LongformerForSequenceClassification.html | 8 +-- .../dl/LongformerForTokenClassification$.html | 8 +-- .../dl/LongformerForTokenClassification.html | 8 +-- .../dl/MPNetForQuestionAnswering$.html | 8 +-- .../dl/MPNetForQuestionAnswering.html | 8 +-- .../dl/MPNetForSequenceClassification$.html | 8 +-- .../dl/MPNetForSequenceClassification.html | 8 +-- .../dl/MPNetForTokenClassification$.html | 8 +-- .../dl/MPNetForTokenClassification.html | 8 +-- .../dl/MultiClassifierDLApproach.html | 8 +-- .../dl/MultiClassifierDLModel$.html | 8 +-- .../classifier/dl/MultiClassifierDLModel.html | 8 +-- ...ReadAlbertForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadAlbertForSequenceDLModel.html | 8 +-- .../dl/ReadAlbertForTokenDLModel.html | 8 +-- .../dl/ReadBartForZeroShotDLModel.html | 8 +-- .../ReadBertForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadBertForSequenceDLModel.html | 8 +-- .../dl/ReadBertForTokenDLModel.html | 8 +-- .../dl/ReadBertForZeroShotDLModel.html | 8 +-- .../dl/ReadCamemBertForQADLModel.html | 8 +-- .../dl/ReadCamemBertForSequenceDLModel.html | 8 +-- .../dl/ReadCamemBertForTokenDLModel.html | 8 +-- .../dl/ReadClassifierDLTensorflowModel.html | 8 +-- ...eadDeBertaForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadDeBertaForSequenceDLModel.html | 8 +-- .../dl/ReadDeBertaForTokenDLModel.html | 8 +-- .../dl/ReadDeBertaForZeroShotDLModel.html | 8 +-- ...DistilBertForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadDistilBertForSequenceDLModel.html | 8 +-- .../dl/ReadDistilBertForTokenDLModel.html | 8 +-- .../dl/ReadDistilBertForZeroShotDLModel.html | 8 +-- ...LongformerForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadLongformerForSequenceDLModel.html | 8 +-- .../dl/ReadLongformerForTokenDLModel.html | 8 +-- .../ReadMPNetForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadMPNetForSequenceDLModel.html | 8 +-- .../dl/ReadMPNetForTokenDLModel.html | 8 +-- .../ReadMultiClassifierDLTensorflowModel.html | 8 +-- ...eadRoBertaForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadRoBertaForSequenceDLModel.html | 8 +-- .../dl/ReadRoBertaForTokenDLModel.html | 8 +-- .../dl/ReadRoBertaForZeroShotDLModel.html | 8 +-- .../dl/ReadSentimentDLTensorflowModel.html | 8 +-- .../ReadTapasForQuestionAnsweringDLModel.html | 8 +-- ...XlmRoBertaForQuestionAnsweringDLModel.html | 8 +-- .../dl/ReadXlmRoBertaForSequenceDLModel.html | 8 +-- .../dl/ReadXlmRoBertaForTokenDLModel.html | 8 +-- .../dl/ReadXlmRoBertaForZeroShotDLModel.html | 8 +-- .../dl/ReadXlnetForSequenceDLModel.html | 8 +-- .../dl/ReadXlnetForTokenDLModel.html | 8 +-- .../ReadablePretrainedAlbertForQAModel.html | 8 +-- ...dablePretrainedAlbertForSequenceModel.html | 8 +-- ...ReadablePretrainedAlbertForTokenModel.html | 8 +-- ...eadablePretrainedBartForZeroShotModel.html | 8 +-- .../dl/ReadablePretrainedBertForQAModel.html | 8 +-- ...eadablePretrainedBertForSequenceModel.html | 8 +-- .../ReadablePretrainedBertForTokenModel.html | 8 +-- ...eadablePretrainedBertForZeroShotModel.html | 8 +-- ...ReadablePretrainedCamemBertForQAModel.html | 8 +-- ...lePretrainedCamemBertForSequenceModel.html | 8 +-- ...dablePretrainedCamemBertForTokenModel.html | 8 +-- .../dl/ReadablePretrainedClassifierDL.html | 8 +-- .../ReadablePretrainedDeBertaForQAModel.html | 8 +-- ...ablePretrainedDeBertaForSequenceModel.html | 8 +-- ...eadablePretrainedDeBertaForTokenModel.html | 8 +-- ...ablePretrainedDeBertaForZeroShotModel.html | 8 +-- ...eadablePretrainedDistilBertForQAModel.html | 8 +-- ...ePretrainedDistilBertForSequenceModel.html | 8 +-- ...ablePretrainedDistilBertForTokenModel.html | 8 +-- ...ePretrainedDistilBertForZeroShotModel.html | 8 +-- ...eadablePretrainedLongformerForQAModel.html | 8 +-- ...ePretrainedLongformerForSequenceModel.html | 8 +-- ...ablePretrainedLongformerForTokenModel.html | 8 +-- .../dl/ReadablePretrainedMPNetForQAModel.html | 8 +-- ...adablePretrainedMPNetForSequenceModel.html | 8 +-- ...eadablePretrainedMPNetForTokenDLModel.html | 8 +-- .../ReadablePretrainedMultiClassifierDL.html | 8 +-- .../ReadablePretrainedRoBertaForQAModel.html | 8 +-- ...ablePretrainedRoBertaForSequenceModel.html | 8 +-- ...eadablePretrainedRoBertaForTokenModel.html | 8 +-- ...ablePretrainedRoBertaForZeroShotModel.html | 8 +-- .../dl/ReadablePretrainedSentimentDL.html | 8 +-- .../dl/ReadablePretrainedTapasForQAModel.html | 8 +-- ...eadablePretrainedXlmRoBertaForQAModel.html | 8 +-- ...ePretrainedXlmRoBertaForSequenceModel.html | 8 +-- ...ablePretrainedXlmRoBertaForTokenModel.html | 8 +-- ...ePretrainedXlmRoBertaForZeroShotModel.html | 8 +-- ...adablePretrainedXlnetForSequenceModel.html | 8 +-- .../ReadablePretrainedXlnetForTokenModel.html | 8 +-- .../dl/RoBertaForQuestionAnswering$.html | 8 +-- .../dl/RoBertaForQuestionAnswering.html | 8 +-- .../dl/RoBertaForSequenceClassification$.html | 8 +-- .../dl/RoBertaForSequenceClassification.html | 8 +-- .../dl/RoBertaForTokenClassification$.html | 8 +-- .../dl/RoBertaForTokenClassification.html | 8 +-- .../dl/RoBertaForZeroShotClassification$.html | 8 +-- .../dl/RoBertaForZeroShotClassification.html | 8 +-- .../classifier/dl/SentimentApproach$.html | 8 +-- .../classifier/dl/SentimentDLApproach.html | 8 +-- .../classifier/dl/SentimentDLModel$.html | 8 +-- .../classifier/dl/SentimentDLModel.html | 8 +-- .../dl/TapasForQuestionAnswering$.html | 8 +-- .../dl/TapasForQuestionAnswering.html | 8 +-- .../dl/XlmRoBertaForQuestionAnswering$.html | 8 +-- .../dl/XlmRoBertaForQuestionAnswering.html | 8 +-- .../XlmRoBertaForSequenceClassification$.html | 8 +-- .../XlmRoBertaForSequenceClassification.html | 8 +-- .../dl/XlmRoBertaForTokenClassification$.html | 8 +-- .../dl/XlmRoBertaForTokenClassification.html | 8 +-- .../XlmRoBertaForZeroShotClassification$.html | 8 +-- .../XlmRoBertaForZeroShotClassification.html | 8 +-- .../dl/XlnetForSequenceClassification$.html | 8 +-- .../dl/XlnetForSequenceClassification.html | 8 +-- .../dl/XlnetForTokenClassification$.html | 8 +-- .../dl/XlnetForTokenClassification.html | 8 +-- .../nlp/annotators/classifier/dl/index.html | 8 +-- .../nlp/annotators/classifier/index.html | 8 +-- .../nlp/annotators/common/Annotated$.html | 8 +-- .../nlp/annotators/common/Annotated.html | 8 +-- .../nlp/annotators/common/ChunkSplit$.html | 8 +-- .../nlp/annotators/common/ConllSentence.html | 8 +-- .../DatasetHelpers$$DataFrameHelper.html | 8 +-- .../annotators/common/DatasetHelpers$.html | 8 +-- .../annotators/common/DependencyParsed$.html | 8 +-- .../common/DependencyParsedSentence.html | 8 +-- .../common/EmbeddingsWithSentence$.html | 8 +-- .../annotators/common/IndexedTaggedWord.html | 8 +-- .../nlp/annotators/common/IndexedToken.html | 8 +-- .../nlp/annotators/common/InfixToken$.html | 8 +-- .../nlp/annotators/common/InfixToken.html | 8 +-- .../LabeledDependency$$DependencyInfo.html | 8 +-- .../annotators/common/LabeledDependency$.html | 8 +-- .../nlp/annotators/common/NerTagged$.html | 8 +-- .../nlp/annotators/common/PosTagged$.html | 8 +-- .../nlp/annotators/common/PrefixedToken$.html | 8 +-- .../nlp/annotators/common/PrefixedToken.html | 8 +-- .../common/PreprocessingParser.html | 8 +-- .../nlp/annotators/common/Sentence$.html | 8 +-- .../nlp/annotators/common/Sentence.html | 8 +-- .../nlp/annotators/common/SentenceSplit$.html | 8 +-- .../nlp/annotators/common/SuffixedToken$.html | 8 +-- .../nlp/annotators/common/SuffixedToken.html | 8 +-- .../nlp/annotators/common/TableData$.html | 8 +-- .../nlp/annotators/common/TableData.html | 8 +-- .../nlp/annotators/common/Tagged.html | 8 +-- .../annotators/common/TaggedSentence$.html | 8 +-- .../nlp/annotators/common/TaggedSentence.html | 8 +-- .../nlp/annotators/common/TaggedWord.html | 8 +-- .../nlp/annotators/common/TokenPiece.html | 8 +-- .../common/TokenPieceEmbeddings$.html | 8 +-- .../common/TokenPieceEmbeddings.html | 8 +-- .../annotators/common/TokenizedSentence.html | 8 +-- .../common/TokenizedWithSentence$.html | 8 +-- .../annotators/common/WordWithDependency.html | 8 +-- .../common/WordpieceEmbeddingsSentence$.html | 8 +-- .../common/WordpieceEmbeddingsSentence.html | 8 +-- .../common/WordpieceTokenized$.html | 8 +-- .../common/WordpieceTokenizedSentence.html | 8 +-- .../nlp/annotators/common/index.html | 8 +-- .../ReadSpanBertCorefTensorflowModel.html | 8 +-- .../ReadablePretrainedSpanBertCorefModel.html | 8 +-- .../annotators/coref/SpanBertCorefModel$.html | 8 +-- .../annotators/coref/SpanBertCorefModel.html | 8 +-- .../nlp/annotators/coref/index.html | 8 +-- .../cv/CLIPForZeroShotClassification$.html | 8 +-- .../cv/CLIPForZeroShotClassification.html | 8 +-- .../cv/ConvNextForImageClassification$.html | 8 +-- .../cv/ConvNextForImageClassification.html | 8 +-- .../nlp/annotators/cv/HasRescaleFactor.html | 8 +-- ...eadCLIPForZeroShotClassificationModel.html | 8 +-- .../cv/ReadConvNextForImageDLModel.html | 8 +-- .../cv/ReadSwinForImageDLModel.html | 8 +-- .../annotators/cv/ReadViTForImageDLModel.html | 8 +-- .../cv/ReadVisionEncoderDecoderDLModel.html | 8 +-- ...nedCLIPForZeroShotClassificationModel.html | 8 +-- ...adablePretrainedConvNextForImageModel.html | 8 +-- .../ReadablePretrainedSwinForImageModel.html | 8 +-- .../ReadablePretrainedViTForImageModel.html | 8 +-- ...lePretrainedVisionEncoderDecoderModel.html | 8 +-- .../cv/SwinForImageClassification$.html | 8 +-- .../cv/SwinForImageClassification.html | 8 +-- .../cv/ViTForImageClassification$.html | 8 +-- .../cv/ViTForImageClassification.html | 8 +-- ...sionEncoderDecoderForImageCaptioning$.html | 8 +-- ...isionEncoderDecoderForImageCaptioning.html | 58 ++++++++++++++++-- .../johnsnowlabs/nlp/annotators/cv/index.html | 8 +-- .../er/AhoCorasickAutomaton$Node.html | 8 +-- .../annotators/er/AhoCorasickAutomaton.html | 8 +-- .../nlp/annotators/er/EntityPattern.html | 8 +-- .../annotators/er/EntityRulerApproach.html | 8 +-- .../annotators/er/EntityRulerFeatures.html | 8 +-- .../nlp/annotators/er/EntityRulerModel$.html | 8 +-- .../nlp/annotators/er/EntityRulerModel.html | 8 +-- .../nlp/annotators/er/EntityRulerUtil$.html | 8 +-- .../annotators/er/FlattenEntityPattern.html | 8 +-- .../nlp/annotators/er/PatternsReadWriter.html | 8 +-- .../nlp/annotators/er/PatternsReader.html | 8 +-- .../er/ReadablePretrainedEntityRuler.html | 8 +-- .../er/RegexPatternsReadWriter.html | 8 +-- .../annotators/er/RegexPatternsReader.html | 8 +-- .../johnsnowlabs/nlp/annotators/er/index.html | 8 +-- .../johnsnowlabs/nlp/annotators/index.html | 8 +-- .../nlp/annotators/keyword/index.html | 8 +-- .../keyword/yake/YakeKeywordExtraction$.html | 8 +-- .../keyword/yake/YakeKeywordExtraction.html | 8 +-- .../annotators/keyword/yake/YakeParams.html | 8 +-- .../nlp/annotators/keyword/yake/index.html | 8 +-- .../annotators/keyword/yake/util/Token.html | 8 +-- .../keyword/yake/util/Utilities$.html | 8 +-- .../annotators/keyword/yake/util/index.html | 8 +-- .../annotators/ld/dl/LanguageDetectorDL$.html | 8 +-- .../annotators/ld/dl/LanguageDetectorDL.html | 8 +-- ...ReadLanguageDetectorDLTensorflowModel.html | 8 +-- ...ablePretrainedLanguageDetectorDLModel.html | 8 +-- .../nlp/annotators/ld/dl/index.html | 8 +-- .../johnsnowlabs/nlp/annotators/ld/index.html | 8 +-- .../nlp/annotators/ner/ModelMetrics$.html | 8 +-- .../nlp/annotators/ner/NamedEntity.html | 8 +-- .../nlp/annotators/ner/NerApproach.html | 8 +-- .../nlp/annotators/ner/NerConverter$.html | 8 +-- .../nlp/annotators/ner/NerConverter.html | 8 +-- .../nlp/annotators/ner/NerOverwriter$.html | 8 +-- .../nlp/annotators/ner/NerOverwriter.html | 8 +-- .../nlp/annotators/ner/NerTagsEncoding$.html | 8 +-- .../nlp/annotators/ner/Verbose$.html | 8 +-- .../ner/crf/DictionaryFeatures$.html | 8 +-- .../ner/crf/DictionaryFeatures.html | 8 +-- .../ner/crf/FeatureGenerator$TokenType$.html | 8 +-- .../annotators/ner/crf/FeatureGenerator.html | 8 +-- .../annotators/ner/crf/NerCrfApproach$.html | 8 +-- .../annotators/ner/crf/NerCrfApproach.html | 8 +-- .../nlp/annotators/ner/crf/NerCrfModel$.html | 8 +-- .../nlp/annotators/ner/crf/NerCrfModel.html | 8 +-- .../ner/crf/ReadablePretrainedNerCrf.html | 8 +-- .../nlp/annotators/ner/crf/index.html | 8 +-- .../nlp/annotators/ner/dl/LoadsContrib$.html | 8 +-- .../nlp/annotators/ner/dl/NerDLApproach$.html | 8 +-- .../nlp/annotators/ner/dl/NerDLApproach.html | 8 +-- .../nlp/annotators/ner/dl/NerDLModel$.html | 8 +-- .../nlp/annotators/ner/dl/NerDLModel.html | 8 +-- .../ner/dl/NerDLModelPythonReader$.html | 8 +-- .../ner/dl/ReadZeroShotNerDLModel.html | 8 +-- .../ner/dl/ReadablePretrainedNerDL.html | 8 +-- .../ner/dl/ReadablePretrainedZeroShotNer.html | 8 +-- .../nlp/annotators/ner/dl/ReadsNERGraph.html | 8 +-- .../annotators/ner/dl/WithGraphResolver.html | 8 +-- .../annotators/ner/dl/ZeroShotNerModel$.html | 8 +-- .../annotators/ner/dl/ZeroShotNerModel.html | 8 +-- .../nlp/annotators/ner/dl/index.html | 8 +-- .../nlp/annotators/ner/index.html | 8 +-- ...lizableFormat$$SerializableDateFormat.html | 8 +-- .../AnnotatorParam$SerializableFormat$.html | 8 +-- .../nlp/annotators/param/AnnotatorParam.html | 8 +-- .../annotators/param/EvaluationDLParams.html | 8 +-- .../param/ExternalResourceParam.html | 8 +-- .../param/SerializedAnnotatorComponent.html | 8 +-- .../param/WritableAnnotatorComponent.html | 8 +-- .../nlp/annotators/param/index.html | 8 +-- .../parser/dep/DependencyParserApproach$.html | 8 +-- .../parser/dep/DependencyParserApproach.html | 8 +-- .../parser/dep/DependencyParserModel$.html | 8 +-- .../parser/dep/DependencyParserModel.html | 8 +-- .../GreedyTransition/DependencyMaker$.html | 8 +-- .../DependencyMaker$CurrentState.html | 8 +-- .../DependencyMaker$ParseState.html | 8 +-- .../dep/GreedyTransition/DependencyMaker.html | 8 +-- .../GreedyTransitionApproach$.html | 8 +-- .../parser/dep/GreedyTransition/index.html | 8 +-- .../GreedyTransition/package$$Feature.html | 8 +-- .../GreedyTransition/package$$WordData.html | 8 +-- .../parser/dep/Perceptron$WeightLearner.html | 8 +-- .../nlp/annotators/parser/dep/Perceptron.html | 8 +-- .../dep/ReadablePretrainedDependency.html | 8 +-- .../annotators/parser/dep/TagDictionary$.html | 8 +-- .../nlp/annotators/parser/dep/Tagger$.html | 8 +-- .../nlp/annotators/parser/dep/Tagger.html | 8 +-- .../nlp/annotators/parser/dep/index.html | 8 +-- .../nlp/annotators/parser/index.html | 8 +-- .../annotators/parser/typdep/ConllData.html | 8 +-- .../parser/typdep/DependencyArcList.html | 8 +-- .../parser/typdep/DependencyInstance.html | 8 +-- .../parser/typdep/DependencyPipe.html | 8 +-- .../parser/typdep/LocalFeatureData.html | 8 +-- .../parser/typdep/LowRankTensor.html | 8 +-- .../nlp/annotators/parser/typdep/Options.html | 8 +-- .../annotators/parser/typdep/Parameters.html | 8 +-- .../parser/typdep/PredictionParameters.html | 8 +-- .../ReadablePretrainedTypedDependency.html | 8 +-- .../parser/typdep/TrainDependencies.html | 8 +-- .../annotators/parser/typdep/TrainFile.html | 8 +-- .../parser/typdep/TypedDependencyParser.html | 8 +-- .../TypedDependencyParserApproach$.html | 8 +-- .../typdep/TypedDependencyParserApproach.html | 8 +-- .../typdep/TypedDependencyParserModel$.html | 8 +-- .../typdep/TypedDependencyParserModel.html | 8 +-- .../typdep/feature/FeatureTemplate.html | 8 +-- .../feature/SyntacticFeatureFactory.html | 8 +-- .../parser/typdep/feature/index.html | 8 +-- .../nlp/annotators/parser/typdep/index.html | 8 +-- .../parser/typdep/io/Conll09Reader.html | 8 +-- .../parser/typdep/io/ConllUReader.html | 8 +-- .../parser/typdep/io/ConllWriter.html | 8 +-- .../parser/typdep/io/DependencyReader.html | 8 +-- .../annotators/parser/typdep/io/index.html | 8 +-- .../parser/typdep/util/Alphabet.html | 8 +-- .../parser/typdep/util/Collector.html | 8 +-- .../parser/typdep/util/DependencyLabel.html | 8 +-- .../parser/typdep/util/Dictionary.html | 8 +-- .../parser/typdep/util/DictionarySet.html | 8 +-- .../parser/typdep/util/FeatureVector.html | 8 +-- .../parser/typdep/util/ScoreCollector.html | 8 +-- .../annotators/parser/typdep/util/Utils.html | 8 +-- .../annotators/parser/typdep/util/index.html | 8 +-- .../nlp/annotators/pos/index.html | 8 +-- .../pos/perceptron/AveragedPerceptron.html | 8 +-- .../pos/perceptron/PerceptronApproach$.html | 8 +-- .../pos/perceptron/PerceptronApproach.html | 8 +-- .../PerceptronApproachDistributed$.html | 8 +-- .../PerceptronApproachDistributed.html | 8 +-- .../pos/perceptron/PerceptronModel$.html | 8 +-- .../pos/perceptron/PerceptronModel.html | 8 +-- .../perceptron/PerceptronPredictionUtils.html | 8 +-- .../perceptron/PerceptronTrainingUtils.html | 8 +-- .../pos/perceptron/PerceptronUtils.html | 8 +-- .../ReadablePretrainedPerceptron.html | 8 +-- .../StringMapStringDoubleAccumulator.html | 8 +-- .../perceptron/TrainingPerceptronLegacy.html | 8 +-- .../TupleKeyLongDoubleMapAccumulator.html | 8 +-- .../nlp/annotators/pos/perceptron/index.html | 8 +-- .../sbd/SentenceDetectorParams.html | 8 +-- .../nlp/annotators/sbd/index.html | 8 +-- .../sbd/pragmatic/CustomPragmaticMethod.html | 8 +-- .../sbd/pragmatic/DefaultPragmaticMethod.html | 8 +-- .../sbd/pragmatic/MixedPragmaticMethod.html | 8 +-- .../pragmatic/PragmaticContentFormatter$.html | 8 +-- .../pragmatic/PragmaticContentFormatter.html | 8 +-- .../sbd/pragmatic/PragmaticDictionaries$.html | 8 +-- .../sbd/pragmatic/PragmaticMethod.html | 8 +-- .../pragmatic/PragmaticSentenceExtractor.html | 8 +-- .../sbd/pragmatic/PragmaticSymbols$.html | 8 +-- .../annotators/sbd/pragmatic/RuleSymbols.html | 8 +-- .../sbd/pragmatic/SentenceDetector$.html | 8 +-- .../sbd/pragmatic/SentenceDetector.html | 8 +-- .../nlp/annotators/sbd/pragmatic/index.html | 8 +-- .../nlp/annotators/sda/index.html | 8 +-- .../sda/pragmatic/PragmaticScorer.html | 8 +-- .../sda/pragmatic/SentimentDetector$.html | 8 +-- .../sda/pragmatic/SentimentDetector.html | 8 +-- .../pragmatic/SentimentDetectorModel$.html | 8 +-- .../sda/pragmatic/SentimentDetectorModel.html | 8 +-- .../nlp/annotators/sda/pragmatic/index.html | 8 +-- .../sda/vivekn/ReadablePretrainedVivekn.html | 8 +-- .../sda/vivekn/ViveknSentimentApproach.html | 8 +-- .../sda/vivekn/ViveknSentimentModel$.html | 8 +-- .../sda/vivekn/ViveknSentimentModel.html | 8 +-- .../sda/vivekn/ViveknSentimentUtils.html | 8 +-- .../nlp/annotators/sda/vivekn/index.html | 8 +-- .../sentence_detector_dl/Metrics.html | 8 +-- .../ReadablePretrainedSentenceDetectorDL.html | 8 +-- .../ReadsSentenceDetectorDLGraph.html | 8 +-- .../SentenceDetectorDLApproach.html | 8 +-- .../SentenceDetectorDLEncoder$.html | 8 +-- .../SentenceDetectorDLEncoder.html | 8 +-- .../SentenceDetectorDLEncoderParam.html | 8 +-- .../SentenceDetectorDLModel$.html | 8 +-- .../SentenceDetectorDLModel.html | 8 +-- .../sentence_detector_dl/index.html | 8 +-- .../annotators/seq2seq/BartTransformer$.html | 8 +-- .../annotators/seq2seq/BartTransformer.html | 58 ++++++++++++++++-- .../annotators/seq2seq/GPT2Transformer$.html | 8 +-- .../annotators/seq2seq/GPT2Transformer.html | 8 +-- .../seq2seq/LLAMA2Transformer$.html | 8 +-- .../annotators/seq2seq/LLAMA2Transformer.html | 58 ++++++++++++++++-- .../seq2seq/M2M100Transformer$.html | 8 +-- .../annotators/seq2seq/M2M100Transformer.html | 58 ++++++++++++++++-- .../seq2seq/MarianTransformer$.html | 8 +-- .../annotators/seq2seq/MarianTransformer.html | 8 +-- .../seq2seq/MistralTransformer$.html | 8 +-- .../seq2seq/MistralTransformer.html | 58 ++++++++++++++++-- .../annotators/seq2seq/Phi2Transformer$.html | 8 +-- .../annotators/seq2seq/Phi2Transformer.html | 60 +++++++++++++++++-- .../seq2seq/ReadBartTransformerDLModel.html | 8 +-- .../seq2seq/ReadGPT2TransformerDLModel.html | 8 +-- .../seq2seq/ReadLLAMA2TransformerDLModel.html | 8 +-- .../seq2seq/ReadM2M100TransformerDLModel.html | 8 +-- .../seq2seq/ReadMarianMTDLModel.html | 8 +-- .../ReadMistralTransformerDLModel.html | 8 +-- .../seq2seq/ReadPhi2TransformerDLModel.html | 8 +-- .../seq2seq/ReadT5TransformerDLModel.html | 8 +-- ...eadablePretrainedBartTransformerModel.html | 8 +-- ...eadablePretrainedGPT2TransformerModel.html | 8 +-- ...dablePretrainedLLAMA2TransformerModel.html | 8 +-- ...dablePretrainedM2M100TransformerModel.html | 8 +-- .../ReadablePretrainedMarianMTModel.html | 8 +-- ...ablePretrainedMistralTransformerModel.html | 8 +-- ...eadablePretrainedPhi2TransformerModel.html | 8 +-- .../ReadablePretrainedT5TransformerModel.html | 8 +-- .../annotators/seq2seq/T5Transformer$.html | 8 +-- .../nlp/annotators/seq2seq/T5Transformer.html | 8 +-- .../nlp/annotators/seq2seq/index.html | 10 ++-- .../DocumentSimilarityRankerApproach$.html | 8 +-- .../DocumentSimilarityRankerApproach.html | 8 +-- .../DocumentSimilarityRankerModel$.html | 8 +-- .../DocumentSimilarityRankerModel.html | 8 +-- .../similarity/IndexedNeighbors.html | 8 +-- .../IndexedNeighborsWithDistance.html | 8 +-- .../similarity/NeighborAnnotation.html | 8 +-- .../similarity/NeighborsResultSet.html | 8 +-- .../ReadableDocumentSimilarityRanker.html | 8 +-- .../nlp/annotators/similarity/index.html | 8 +-- .../spell/context/CandidateStrategy$.html | 8 +-- ...ntextSpellCheckerApproach$ArrayHelper.html | 8 +-- .../context/ContextSpellCheckerApproach.html | 8 +-- .../context/ContextSpellCheckerModel$.html | 8 +-- .../ContextSpellCheckerModel$StringTools.html | 8 +-- .../context/ContextSpellCheckerModel.html | 8 +-- .../spell/context/HasTransducerFeatures.html | 8 +-- .../spell/context/LangModelSentence.html | 8 +-- .../ReadablePretrainedContextSpell.html | 8 +-- .../context/ReadsLanguageModelGraph.html | 8 +-- .../spell/context/WeightedLevenshtein.html | 8 +-- .../nlp/annotators/spell/context/index.html | 8 +-- .../spell/context/parser/AgeToken.html | 8 +-- .../spell/context/parser/DateToken.html | 8 +-- .../context/parser/GenericRegexParser.html | 8 +-- .../context/parser/GenericVocabParser.html | 8 +-- .../spell/context/parser/LocationClass.html | 8 +-- .../spell/context/parser/MainVocab.html | 8 +-- .../spell/context/parser/MedicationClass.html | 8 +-- .../spell/context/parser/NamesClass.html | 8 +-- .../spell/context/parser/NumberToken.html | 8 +-- .../spell/context/parser/RegexParser.html | 8 +-- .../context/parser/SerializableClass.html | 8 +-- .../context/parser/SpecialClassParser.html | 8 +-- .../context/parser/TransducerSeqFeature.html | 8 +-- .../spell/context/parser/UnitToken.html | 8 +-- .../spell/context/parser/VocabParser.html | 8 +-- .../spell/context/parser/index.html | 8 +-- .../nlp/annotators/spell/index.html | 8 +-- .../spell/norvig/NorvigSweetingApproach$.html | 8 +-- .../spell/norvig/NorvigSweetingApproach.html | 8 +-- .../spell/norvig/NorvigSweetingModel$.html | 8 +-- .../spell/norvig/NorvigSweetingModel.html | 8 +-- .../spell/norvig/NorvigSweetingParams.html | 8 +-- .../norvig/ReadablePretrainedNorvig.html | 8 +-- .../nlp/annotators/spell/norvig/index.html | 8 +-- .../ReadablePretrainedSymmetric.html | 8 +-- .../symmetric/SymmetricDeleteApproach$.html | 8 +-- .../symmetric/SymmetricDeleteApproach.html | 8 +-- .../symmetric/SymmetricDeleteModel$.html | 8 +-- .../SymmetricDeleteModel$SuggestedWord.html | 8 +-- .../spell/symmetric/SymmetricDeleteModel.html | 8 +-- .../symmetric/SymmetricDeleteParams.html | 8 +-- .../nlp/annotators/spell/symmetric/index.html | 8 +-- .../nlp/annotators/spell/util/Utilities$.html | 8 +-- .../nlp/annotators/spell/util/index.html | 8 +-- .../nlp/annotators/tapas/TapasCellDate$.html | 8 +-- .../nlp/annotators/tapas/TapasCellDate.html | 8 +-- .../nlp/annotators/tapas/TapasCellValue$.html | 8 +-- .../nlp/annotators/tapas/TapasCellValue.html | 8 +-- .../nlp/annotators/tapas/TapasEncoder.html | 8 +-- .../nlp/annotators/tapas/TapasInputData.html | 8 +-- .../tapas/TapasNumericRelation$.html | 8 +-- .../tapas/TapasNumericValueSpan$.html | 8 +-- .../tapas/TapasNumericValueSpan.html | 8 +-- .../nlp/annotators/tapas/index.html | 8 +-- .../tokenizer/bpe/BartTokenizer.html | 8 +-- .../tokenizer/bpe/BpeTokenizer$.html | 8 +-- .../tokenizer/bpe/CLIPTokenizer.html | 8 +-- .../tokenizer/bpe/Gpt2Tokenizer.html | 8 +-- .../tokenizer/bpe/Phi2Tokenizer.html | 8 +-- .../tokenizer/bpe/RobertaTokenizer.html | 8 +-- .../tokenizer/bpe/SpecialToken.html | 8 +-- .../tokenizer/bpe/WhisperTokenDecoder.html | 8 +-- .../nlp/annotators/tokenizer/bpe/index.html | 8 +-- .../nlp/annotators/tokenizer/index.html | 8 +-- .../ws/ReadablePretrainedWordSegmenter.html | 8 +-- .../nlp/annotators/ws/TagsType$.html | 8 +-- .../annotators/ws/WordSegmenterApproach$.html | 8 +-- .../annotators/ws/WordSegmenterApproach.html | 8 +-- .../annotators/ws/WordSegmenterModel$.html | 8 +-- .../nlp/annotators/ws/WordSegmenterModel.html | 8 +-- .../johnsnowlabs/nlp/annotators/ws/index.html | 8 +-- .../nlp/embeddings/AlbertEmbeddings$.html | 8 +-- .../nlp/embeddings/AlbertEmbeddings.html | 8 +-- .../nlp/embeddings/BGEEmbeddings$.html | 8 +-- .../nlp/embeddings/BGEEmbeddings.html | 8 +-- .../nlp/embeddings/BertEmbeddings$.html | 8 +-- .../nlp/embeddings/BertEmbeddings.html | 8 +-- .../embeddings/BertSentenceEmbeddings$.html | 8 +-- .../embeddings/BertSentenceEmbeddings.html | 8 +-- .../nlp/embeddings/CamemBertEmbeddings$.html | 8 +-- .../nlp/embeddings/CamemBertEmbeddings.html | 8 +-- .../nlp/embeddings/ChunkEmbeddings$.html | 8 +-- .../nlp/embeddings/ChunkEmbeddings.html | 8 +-- .../nlp/embeddings/DeBertaEmbeddings$.html | 8 +-- .../nlp/embeddings/DeBertaEmbeddings.html | 8 +-- .../nlp/embeddings/DistilBertEmbeddings$.html | 8 +-- .../nlp/embeddings/DistilBertEmbeddings.html | 8 +-- .../nlp/embeddings/Doc2VecApproach$.html | 8 +-- .../nlp/embeddings/Doc2VecApproach.html | 8 +-- .../nlp/embeddings/Doc2VecModel$.html | 8 +-- .../nlp/embeddings/Doc2VecModel.html | 8 +-- .../nlp/embeddings/E5Embeddings$.html | 8 +-- .../nlp/embeddings/E5Embeddings.html | 8 +-- .../nlp/embeddings/ElmoEmbeddings$.html | 8 +-- .../nlp/embeddings/ElmoEmbeddings.html | 8 +-- .../EmbeddingsCoverage$CoverageResult.html | 8 +-- .../nlp/embeddings/EmbeddingsCoverage.html | 8 +-- .../embeddings/HasEmbeddingsProperties.html | 8 +-- .../nlp/embeddings/InstructorEmbeddings$.html | 8 +-- .../nlp/embeddings/InstructorEmbeddings.html | 8 +-- .../nlp/embeddings/LongformerEmbeddings$.html | 8 +-- .../nlp/embeddings/LongformerEmbeddings.html | 8 +-- .../nlp/embeddings/MPNetEmbeddings$.html | 8 +-- .../nlp/embeddings/MPNetEmbeddings.html | 8 +-- .../PoolingStrategy$$AnnotatorType$.html | 8 +-- .../nlp/embeddings/PoolingStrategy$.html | 8 +-- .../nlp/embeddings/ReadAlbertDLModel.html | 8 +-- .../nlp/embeddings/ReadBGEDLModel.html | 8 +-- .../nlp/embeddings/ReadBertDLModel.html | 8 +-- .../embeddings/ReadBertSentenceDLModel.html | 8 +-- .../nlp/embeddings/ReadCamemBertDLModel.html | 8 +-- .../nlp/embeddings/ReadDeBertaDLModel.html | 8 +-- .../nlp/embeddings/ReadDistilBertDLModel.html | 8 +-- .../nlp/embeddings/ReadE5DLModel.html | 8 +-- .../nlp/embeddings/ReadElmoDLModel.html | 8 +-- .../nlp/embeddings/ReadInstructorDLModel.html | 8 +-- .../nlp/embeddings/ReadLongformerDLModel.html | 8 +-- .../nlp/embeddings/ReadMPNetDLModel.html | 8 +-- .../nlp/embeddings/ReadRobertaDLModel.html | 8 +-- .../ReadRobertaSentenceDLModel.html | 8 +-- .../nlp/embeddings/ReadUAEDLModel.html | 8 +-- .../nlp/embeddings/ReadUSEDLModel.html | 8 +-- .../nlp/embeddings/ReadXlmRobertaDLModel.html | 8 +-- .../ReadXlmRobertaSentenceDLModel.html | 8 +-- .../nlp/embeddings/ReadXlnetDLModel.html | 8 +-- .../ReadablePretrainedAlbertModel.html | 8 +-- .../ReadablePretrainedBGEModel.html | 8 +-- .../ReadablePretrainedBertModel.html | 8 +-- .../ReadablePretrainedBertSentenceModel.html | 8 +-- .../ReadablePretrainedCamemBertModel.html | 8 +-- .../ReadablePretrainedDeBertaModel.html | 8 +-- .../ReadablePretrainedDistilBertModel.html | 8 +-- .../embeddings/ReadablePretrainedDoc2Vec.html | 8 +-- .../embeddings/ReadablePretrainedE5Model.html | 8 +-- .../ReadablePretrainedElmoModel.html | 8 +-- .../ReadablePretrainedInstructorModel.html | 8 +-- .../ReadablePretrainedLongformerModel.html | 8 +-- .../ReadablePretrainedMPNetModel.html | 8 +-- .../ReadablePretrainedRobertaModel.html | 8 +-- ...eadablePretrainedRobertaSentenceModel.html | 8 +-- .../ReadablePretrainedUAEModel.html | 8 +-- .../ReadablePretrainedUSEModel.html | 8 +-- .../ReadablePretrainedWord2Vec.html | 8 +-- .../ReadablePretrainedWordEmbeddings.html | 8 +-- .../ReadablePretrainedXlmRobertaModel.html | 8 +-- ...ablePretrainedXlmRobertaSentenceModel.html | 8 +-- .../ReadablePretrainedXlnetModel.html | 8 +-- .../nlp/embeddings/ReadsFromBytes.html | 8 +-- .../nlp/embeddings/RoBertaEmbeddings$.html | 8 +-- .../nlp/embeddings/RoBertaEmbeddings.html | 8 +-- .../RoBertaSentenceEmbeddings$.html | 8 +-- .../embeddings/RoBertaSentenceEmbeddings.html | 8 +-- .../nlp/embeddings/SentenceEmbeddings$.html | 8 +-- .../nlp/embeddings/SentenceEmbeddings.html | 8 +-- .../nlp/embeddings/UAEEmbeddings$.html | 8 +-- .../nlp/embeddings/UAEEmbeddings.html | 8 +-- .../embeddings/UniversalSentenceEncoder$.html | 8 +-- .../embeddings/UniversalSentenceEncoder.html | 8 +-- .../nlp/embeddings/Word2VecApproach$.html | 8 +-- .../nlp/embeddings/Word2VecApproach.html | 8 +-- .../nlp/embeddings/Word2VecModel$.html | 8 +-- .../nlp/embeddings/Word2VecModel.html | 8 +-- .../nlp/embeddings/WordEmbeddings$.html | 8 +-- .../nlp/embeddings/WordEmbeddings.html | 8 +-- .../WordEmbeddingsBinaryIndexer$.html | 8 +-- .../nlp/embeddings/WordEmbeddingsModel$.html | 8 +-- .../nlp/embeddings/WordEmbeddingsModel.html | 8 +-- .../nlp/embeddings/WordEmbeddingsReader.html | 8 +-- .../WordEmbeddingsTextIndexer$.html | 8 +-- .../nlp/embeddings/WordEmbeddingsWriter.html | 8 +-- .../nlp/embeddings/XlmRoBertaEmbeddings$.html | 8 +-- .../nlp/embeddings/XlmRoBertaEmbeddings.html | 8 +-- .../XlmRoBertaSentenceEmbeddings$.html | 8 +-- .../XlmRoBertaSentenceEmbeddings.html | 8 +-- .../nlp/embeddings/XlnetEmbeddings$.html | 8 +-- .../nlp/embeddings/XlnetEmbeddings.html | 8 +-- .../johnsnowlabs/nlp/embeddings/index.html | 8 +-- .../DocumentSimilarityRankerFinisher$.html | 8 +-- .../DocumentSimilarityRankerFinisher.html | 8 +-- .../com/johnsnowlabs/nlp/finisher/index.html | 8 +-- .../nlp/functions$$EachAnnotations.html | 8 +-- .../nlp/functions$$ExplodeAnnotations.html | 8 +-- .../nlp/functions$$FilterAnnotations.html | 8 +-- .../nlp/functions$$MapAnnotations.html | 8 +-- docs/api/com/johnsnowlabs/nlp/functions$.html | 8 +-- docs/api/com/johnsnowlabs/nlp/index.html | 8 +-- .../nlp/pretrained/PretrainedPipeline$.html | 8 +-- .../nlp/pretrained/PretrainedPipeline.html | 8 +-- .../pretrained/PythonResourceDownloader$.html | 8 +-- .../nlp/pretrained/RepositoryMetadata.html | 8 +-- .../nlp/pretrained/ResourceDownloader$.html | 8 +-- .../nlp/pretrained/ResourceDownloader.html | 8 +-- .../nlp/pretrained/ResourceMetadata$.html | 8 +-- .../nlp/pretrained/ResourceMetadata.html | 8 +-- .../nlp/pretrained/ResourceRequest.html | 8 +-- .../nlp/pretrained/ResourceType$.html | 8 +-- .../nlp/pretrained/S3ResourceDownloader.html | 8 +-- .../johnsnowlabs/nlp/pretrained/index.html | 8 +-- .../com/johnsnowlabs/nlp/recursive/index.html | 8 +-- .../nlp/recursive/package$$Recursive.html | 8 +-- .../recursive/package$$RecursiveModel.html | 8 +-- .../nlp/serialization/ArrayFeature.html | 8 +-- .../nlp/serialization/Feature.html | 8 +-- .../nlp/serialization/MapFeature.html | 8 +-- .../SerializedExternalResource.html | 8 +-- .../nlp/serialization/SetFeature.html | 8 +-- .../nlp/serialization/StructFeature.html | 8 +-- .../nlp/serialization/TransducerFeature.html | 8 +-- .../johnsnowlabs/nlp/serialization/index.html | 8 +-- .../com/johnsnowlabs/nlp/training/CoNLL.html | 8 +-- .../nlp/training/CoNLL2003NerReader.html | 8 +-- .../nlp/training/CoNLLDocument.html | 8 +-- .../CoNLLHelper$$CoNLLSentenceCols.html | 8 +-- .../training/CoNLLHelper$$CoNLLTokenCols.html | 8 +-- .../nlp/training/CoNLLHelper$.html | 8 +-- .../com/johnsnowlabs/nlp/training/CoNLLU.html | 8 +-- .../nlp/training/CoNLLUCols$.html | 8 +-- .../nlp/training/CoNLLUDocument.html | 8 +-- .../com/johnsnowlabs/nlp/training/POS.html | 8 +-- .../johnsnowlabs/nlp/training/PubTator.html | 8 +-- .../nlp/training/SpacyToAnnotation.html | 8 +-- .../com/johnsnowlabs/nlp/training/index.html | 8 +-- .../johnsnowlabs/nlp/util/FinisherUtil$.html | 8 +-- .../johnsnowlabs/nlp/util/GraphBuilder.html | 8 +-- .../nlp/util/LfuCache$CachedItem.html | 8 +-- .../nlp/util/LfuCache$DoubleLinked.html | 8 +-- .../nlp/util/LfuCache$FrequencyList.html | 8 +-- .../com/johnsnowlabs/nlp/util/LfuCache.html | 8 +-- .../nlp/util/LruMap$KeyPriority.html | 8 +-- .../nlp/util/LruMap$KeyPriorityOrdering$.html | 8 +-- .../api/com/johnsnowlabs/nlp/util/LruMap.html | 8 +-- .../nlp/util/SparkNlpConfig$.html | 8 +-- docs/api/com/johnsnowlabs/nlp/util/index.html | 8 +-- .../nlp/util/io/CloudStorageType$.html | 8 +-- .../nlp/util/io/ExternalResource$.html | 8 +-- .../nlp/util/io/ExternalResource.html | 8 +-- .../nlp/util/io/MatchStrategy$.html | 8 +-- .../nlp/util/io/OutputHelper$.html | 8 +-- .../com/johnsnowlabs/nlp/util/io/ReadAs$.html | 8 +-- .../util/io/ResourceHelper$$SourceStream.html | 8 +-- .../nlp/util/io/ResourceHelper$.html | 8 +-- .../com/johnsnowlabs/nlp/util/io/index.html | 8 +-- .../nlp/util/regex/RegexRule.html | 8 +-- .../util/regex/RuleFactory$$RuleMatch.html | 8 +-- .../nlp/util/regex/RuleFactory$.html | 8 +-- .../nlp/util/regex/RuleFactory.html | 8 +-- .../nlp/util/regex/TransformStrategy$.html | 8 +-- .../johnsnowlabs/nlp/util/regex/index.html | 8 +-- .../com/johnsnowlabs/storage/BytesKey.html | 8 +-- .../com/johnsnowlabs/storage/Database$.html | 8 +-- .../com/johnsnowlabs/storage/Database.html | 8 +-- .../johnsnowlabs/storage/HasConnection.html | 8 +-- .../com/johnsnowlabs/storage/HasStorage.html | 8 +-- .../johnsnowlabs/storage/HasStorageModel.html | 8 +-- .../storage/HasStorageOptions.html | 8 +-- .../storage/HasStorageReader.html | 8 +-- .../johnsnowlabs/storage/HasStorageRef$.html | 8 +-- .../johnsnowlabs/storage/HasStorageRef.html | 8 +-- .../storage/RocksDBConnection$.html | 8 +-- .../storage/RocksDBConnection.html | 8 +-- .../storage/StorageBatchWriter.html | 8 +-- .../johnsnowlabs/storage/StorageFormat.html | 8 +-- .../johnsnowlabs/storage/StorageHelper$.html | 8 +-- .../johnsnowlabs/storage/StorageLocator$.html | 8 +-- .../johnsnowlabs/storage/StorageLocator.html | 8 +-- .../storage/StorageReadWriter.html | 8 +-- .../johnsnowlabs/storage/StorageReadable.html | 8 +-- .../johnsnowlabs/storage/StorageReader.html | 8 +-- .../johnsnowlabs/storage/StorageWriter.html | 8 +-- docs/api/com/johnsnowlabs/storage/index.html | 8 +-- .../api/com/johnsnowlabs/util/Benchmark$.html | 8 +-- docs/api/com/johnsnowlabs/util/Build$.html | 8 +-- .../johnsnowlabs/util/CoNLLGenerator$.html | 8 +-- .../com/johnsnowlabs/util/ConfigHelper$.html | 8 +-- .../com/johnsnowlabs/util/ConfigLoader$.html | 8 +-- .../com/johnsnowlabs/util/FileHelper$.html | 8 +-- .../com/johnsnowlabs/util/JsonBuilder$.html | 8 +-- .../com/johnsnowlabs/util/JsonParser$.html | 8 +-- .../johnsnowlabs/util/PipelineModels$.html | 8 +-- .../johnsnowlabs/util/TrainingHelper$.html | 8 +-- docs/api/com/johnsnowlabs/util/Version$.html | 8 +-- docs/api/com/johnsnowlabs/util/Version.html | 8 +-- .../johnsnowlabs/util/ZipArchiveUtil$.html | 8 +-- docs/api/com/johnsnowlabs/util/index.html | 8 +-- .../util/spark/LongMapAccumulator.html | 8 +-- .../util/spark/MapAccumulator.html | 8 +-- .../johnsnowlabs/util/spark/SparkUtil$.html | 8 +-- .../com/johnsnowlabs/util/spark/index.html | 8 +-- docs/api/index.html | 8 +-- docs/api/index.js | 2 +- docs/api/python/.buildinfo | 2 +- docs/api/python/genindex.html | 2 +- docs/api/python/getting_started/index.html | 20 +++---- docs/api/python/index.html | 2 +- docs/api/python/modules/index.html | 2 +- docs/api/python/modules/sparknlp.html | 6 +- .../python/modules/sparknlp/annotation.html | 2 +- .../modules/sparknlp/annotation_audio.html | 2 +- .../modules/sparknlp/annotation_image.html | 2 +- .../annotator/audio/hubert_for_ctc.html | 2 +- .../annotator/audio/wav2vec2_for_ctc.html | 2 +- .../annotator/audio/whisper_for_ctc.html | 2 +- .../sparknlp/annotator/chunk2_doc.html | 2 +- .../modules/sparknlp/annotator/chunker.html | 2 +- .../albert_for_question_answering.html | 2 +- .../albert_for_sequence_classification.html | 2 +- .../albert_for_token_classification.html | 2 +- .../bart_for_zero_shot_classification.html | 2 +- .../bert_for_question_answering.html | 2 +- .../bert_for_sequence_classification.html | 2 +- .../bert_for_token_classification.html | 2 +- .../bert_for_zero_shot_classification.html | 2 +- .../camembert_for_question_answering.html | 2 +- ...camembert_for_sequence_classification.html | 2 +- .../camembert_for_token_classification.html | 2 +- .../classifier_dl/classifier_dl.html | 2 +- .../deberta_for_question_answering.html | 2 +- .../deberta_for_sequence_classification.html | 2 +- .../deberta_for_token_classification.html | 2 +- .../deberta_for_zero_shot_classification.html | 2 +- .../distil_bert_for_question_answering.html | 2 +- ...stil_bert_for_sequence_classification.html | 2 +- .../distil_bert_for_token_classification.html | 2 +- ...til_bert_for_zero_shot_classification.html | 2 +- .../longformer_for_question_answering.html | 2 +- ...ongformer_for_sequence_classification.html | 2 +- .../longformer_for_token_classification.html | 2 +- .../mpnet_for_question_answering.html | 2 +- .../mpnet_for_sequence_classification.html | 2 +- .../mpnet_for_token_classification.html | 2 +- .../classifier_dl/multi_classifier_dl.html | 2 +- .../roberta_for_question_answering.html | 2 +- .../roberta_for_sequence_classification.html | 2 +- .../roberta_for_token_classification.html | 2 +- .../roberta_for_zero_shot_classification.html | 2 +- .../annotator/classifier_dl/sentiment_dl.html | 2 +- .../tapas_for_question_answering.html | 2 +- .../xlm_roberta_for_question_answering.html | 2 +- ...m_roberta_for_sequence_classification.html | 2 +- .../xlm_roberta_for_token_classification.html | 2 +- ..._roberta_for_zero_shot_classification.html | 2 +- .../xlnet_for_sequence_classification.html | 2 +- .../xlnet_for_token_classification.html | 2 +- .../annotator/coref/spanbert_coref.html | 2 +- .../cv/clip_for_zero_shot_classification.html | 2 +- .../cv/convnext_for_image_classification.html | 2 +- .../cv/swin_for_image_classification.html | 2 +- ..._encoder_decoder_for_image_captioning.html | 2 +- .../cv/vit_for_image_classification.html | 2 +- .../sparknlp/annotator/date2_chunk.html | 2 +- .../dependency/dependency_parser.html | 2 +- .../dependency/typed_dependency_parser.html | 2 +- .../document_character_text_splitter.html | 2 +- .../annotator/document_normalizer.html | 2 +- .../annotator/document_token_splitter.html | 2 +- .../document_token_splitter_test.html | 2 +- .../embeddings/albert_embeddings.html | 2 +- .../annotator/embeddings/bert_embeddings.html | 2 +- .../embeddings/bert_sentence_embeddings.html | 2 +- .../annotator/embeddings/bge_embeddings.html | 2 +- .../embeddings/camembert_embeddings.html | 2 +- .../embeddings/chunk_embeddings.html | 2 +- .../embeddings/deberta_embeddings.html | 2 +- .../embeddings/distil_bert_embeddings.html | 2 +- .../annotator/embeddings/doc2vec.html | 2 +- .../annotator/embeddings/e5_embeddings.html | 2 +- .../annotator/embeddings/elmo_embeddings.html | 2 +- .../embeddings/instructor_embeddings.html | 2 +- .../embeddings/longformer_embeddings.html | 2 +- .../embeddings/mpnet_embeddings.html | 2 +- .../embeddings/roberta_embeddings.html | 2 +- .../roberta_sentence_embeddings.html | 2 +- .../embeddings/sentence_embeddings.html | 2 +- .../annotator/embeddings/uae_embeddings.html | 2 +- .../universal_sentence_encoder.html | 2 +- .../annotator/embeddings/word2vec.html | 2 +- .../annotator/embeddings/word_embeddings.html | 2 +- .../embeddings/xlm_roberta_embeddings.html | 2 +- .../xlm_roberta_sentence_embeddings.html | 2 +- .../embeddings/xlnet_embeddings.html | 2 +- .../sparknlp/annotator/er/entity_ruler.html | 2 +- .../sparknlp/annotator/graph_extraction.html | 2 +- .../yake_keyword_extraction.html | 2 +- .../annotator/ld_dl/language_detector_dl.html | 2 +- .../sparknlp/annotator/lemmatizer.html | 2 +- .../annotator/matcher/big_text_matcher.html | 2 +- .../annotator/matcher/date_matcher.html | 2 +- .../annotator/matcher/multi_date_matcher.html | 2 +- .../annotator/matcher/regex_matcher.html | 2 +- .../annotator/matcher/text_matcher.html | 2 +- .../sparknlp/annotator/n_gram_generator.html | 2 +- .../sparknlp/annotator/ner/ner_approach.html | 2 +- .../sparknlp/annotator/ner/ner_converter.html | 2 +- .../sparknlp/annotator/ner/ner_crf.html | 2 +- .../sparknlp/annotator/ner/ner_dl.html | 2 +- .../annotator/ner/ner_overwriter.html | 2 +- .../annotator/ner/zero_shot_ner_model.html | 2 +- .../sparknlp/annotator/normalizer.html | 2 +- .../annotator/openai/openai_completion.html | 2 +- .../annotator/openai/openai_embeddings.html | 2 +- .../annotator/param/classifier_encoder.html | 2 +- .../annotator/param/evaluation_dl_params.html | 2 +- .../sparknlp/annotator/pos/perceptron.html | 2 +- .../annotator/sentence/sentence_detector.html | 2 +- .../sentence/sentence_detector_dl.html | 2 +- .../sentiment/sentiment_detector.html | 2 +- .../annotator/sentiment/vivekn_sentiment.html | 2 +- .../annotator/seq2seq/bart_transformer.html | 2 +- .../annotator/seq2seq/gpt2_transformer.html | 2 +- .../annotator/seq2seq/llama2_transformer.html | 2 +- .../annotator/seq2seq/m2m100_transformer.html | 2 +- .../annotator/seq2seq/marian_transformer.html | 2 +- .../seq2seq/mistral_transformer.html | 2 +- .../annotator/seq2seq/phi2_transformer.html | 8 +-- .../annotator/seq2seq/t5_transformer.html | 2 +- .../document_similarity_ranker.html | 2 +- .../spell_check/context_spell_checker.html | 2 +- .../spell_check/norvig_sweeting.html | 2 +- .../spell_check/symmetric_delete.html | 2 +- .../modules/sparknlp/annotator/stemmer.html | 2 +- .../annotator/stop_words_cleaner.html | 2 +- .../annotator/tf_ner_dl_graph_builder.html | 2 +- .../annotator/token/chunk_tokenizer.html | 2 +- .../annotator/token/recursive_tokenizer.html | 2 +- .../annotator/token/regex_tokenizer.html | 2 +- .../sparknlp/annotator/token/tokenizer.html | 2 +- .../sparknlp/annotator/token2_chunk.html | 2 +- .../sparknlp/annotator/ws/word_segmenter.html | 2 +- .../sparknlp/base/audio_assembler.html | 2 +- .../modules/sparknlp/base/doc2_chunk.html | 2 +- .../sparknlp/base/document_assembler.html | 2 +- .../sparknlp/base/embeddings_finisher.html | 2 +- .../modules/sparknlp/base/finisher.html | 2 +- .../modules/sparknlp/base/graph_finisher.html | 2 +- .../sparknlp/base/has_recursive_fit.html | 2 +- .../base/has_recursive_transform.html | 2 +- .../sparknlp/base/image_assembler.html | 2 +- .../modules/sparknlp/base/light_pipeline.html | 2 +- .../base/multi_document_assembler.html | 2 +- .../sparknlp/base/recursive_pipeline.html | 2 +- .../sparknlp/base/table_assembler.html | 2 +- .../sparknlp/base/token_assembler.html | 2 +- .../sparknlp/common/annotator_approach.html | 2 +- .../sparknlp/common/annotator_model.html | 2 +- .../sparknlp/common/annotator_properties.html | 2 +- .../sparknlp/common/match_strategy.html | 2 +- .../modules/sparknlp/common/properties.html | 2 +- .../modules/sparknlp/common/read_as.html | 2 +- .../common/recursive_annotator_approach.html | 2 +- .../python/modules/sparknlp/common/utils.html | 2 +- .../python/modules/sparknlp/functions.html | 2 +- .../sparknlp/internal/annotator_java_ml.html | 2 +- .../internal/annotator_transformer.html | 2 +- .../internal/extended_java_wrapper.html | 2 +- .../internal/params_getters_setters.html | 2 +- .../modules/sparknlp/internal/recursive.html | 2 +- .../modules/sparknlp/logging/comet.html | 2 +- .../pretrained/pretrained_pipeline.html | 2 +- .../pretrained/resource_downloader.html | 2 +- .../modules/sparknlp/training/conll.html | 2 +- .../modules/sparknlp/training/conllu.html | 2 +- .../python/modules/sparknlp/training/pos.html | 2 +- .../modules/sparknlp/training/pub_tator.html | 2 +- .../training/spacy_to_annotation.html | 2 +- docs/api/python/py-modindex.html | 2 +- .../sparknlp/annotation/index.html | 2 +- .../sparknlp/annotation_audio/index.html | 2 +- .../sparknlp/annotation_image/index.html | 2 +- .../annotator/audio/hubert_for_ctc/index.html | 2 +- .../sparknlp/annotator/audio/index.html | 2 +- .../audio/wav2vec2_for_ctc/index.html | 2 +- .../audio/whisper_for_ctc/index.html | 2 +- .../sparknlp/annotator/chunk2_doc/index.html | 2 +- .../sparknlp/annotator/chunker/index.html | 2 +- .../albert_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../bert_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../bert_for_token_classification/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../classifier_dl/classifier_dl/index.html | 2 +- .../deberta_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../annotator/classifier_dl/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../mpnet_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../mpnet_for_token_classification/index.html | 2 +- .../multi_classifier_dl/index.html | 2 +- .../roberta_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../classifier_dl/sentiment_dl/index.html | 2 +- .../tapas_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../xlnet_for_token_classification/index.html | 2 +- .../sparknlp/annotator/coref/index.html | 2 +- .../annotator/coref/spanbert_coref/index.html | 2 +- .../index.html | 2 +- .../index.html | 2 +- .../sparknlp/annotator/cv/index.html | 2 +- .../swin_for_image_classification/index.html | 2 +- .../index.html | 2 +- .../vit_for_image_classification/index.html | 2 +- .../sparknlp/annotator/date2_chunk/index.html | 2 +- .../dependency/dependency_parser/index.html | 2 +- .../sparknlp/annotator/dependency/index.html | 2 +- .../typed_dependency_parser/index.html | 2 +- .../index.html | 2 +- .../annotator/document_normalizer/index.html | 2 +- .../document_token_splitter/index.html | 2 +- .../document_token_splitter_test/index.html | 2 +- .../embeddings/albert_embeddings/index.html | 2 +- .../embeddings/bert_embeddings/index.html | 2 +- .../bert_sentence_embeddings/index.html | 2 +- .../embeddings/bge_embeddings/index.html | 2 +- .../camembert_embeddings/index.html | 2 +- .../embeddings/chunk_embeddings/index.html | 2 +- .../embeddings/deberta_embeddings/index.html | 2 +- .../distil_bert_embeddings/index.html | 2 +- .../annotator/embeddings/doc2vec/index.html | 2 +- .../embeddings/e5_embeddings/index.html | 2 +- .../embeddings/elmo_embeddings/index.html | 2 +- .../sparknlp/annotator/embeddings/index.html | 2 +- .../instructor_embeddings/index.html | 2 +- .../longformer_embeddings/index.html | 2 +- .../embeddings/mpnet_embeddings/index.html | 2 +- .../embeddings/roberta_embeddings/index.html | 2 +- .../roberta_sentence_embeddings/index.html | 2 +- .../embeddings/sentence_embeddings/index.html | 2 +- .../embeddings/uae_embeddings/index.html | 2 +- .../universal_sentence_encoder/index.html | 2 +- .../annotator/embeddings/word2vec/index.html | 2 +- .../embeddings/word_embeddings/index.html | 2 +- .../xlm_roberta_embeddings/index.html | 2 +- .../index.html | 2 +- .../embeddings/xlnet_embeddings/index.html | 2 +- .../annotator/er/entity_ruler/index.html | 2 +- .../sparknlp/annotator/er/index.html | 2 +- .../annotator/graph_extraction/index.html | 2 +- .../autosummary/sparknlp/annotator/index.html | 2 +- .../annotator/keyword_extraction/index.html | 2 +- .../yake_keyword_extraction/index.html | 2 +- .../sparknlp/annotator/ld_dl/index.html | 2 +- .../ld_dl/language_detector_dl/index.html | 2 +- .../sparknlp/annotator/lemmatizer/index.html | 2 +- .../matcher/big_text_matcher/index.html | 2 +- .../annotator/matcher/date_matcher/index.html | 2 +- .../sparknlp/annotator/matcher/index.html | 2 +- .../matcher/multi_date_matcher/index.html | 2 +- .../matcher/regex_matcher/index.html | 2 +- .../annotator/matcher/text_matcher/index.html | 2 +- .../annotator/n_gram_generator/index.html | 2 +- .../sparknlp/annotator/ner/index.html | 2 +- .../annotator/ner/ner_approach/index.html | 2 +- .../annotator/ner/ner_converter/index.html | 2 +- .../sparknlp/annotator/ner/ner_crf/index.html | 2 +- .../sparknlp/annotator/ner/ner_dl/index.html | 2 +- .../annotator/ner/ner_overwriter/index.html | 2 +- .../ner/zero_shot_ner_model/index.html | 2 +- .../sparknlp/annotator/normalizer/index.html | 2 +- .../sparknlp/annotator/openai/index.html | 2 +- .../openai/openai_completion/index.html | 2 +- .../openai/openai_embeddings/index.html | 2 +- .../param/classifier_encoder/index.html | 2 +- .../param/evaluation_dl_params/index.html | 2 +- .../sparknlp/annotator/param/index.html | 2 +- .../sparknlp/annotator/pos/index.html | 2 +- .../annotator/pos/perceptron/index.html | 2 +- .../sparknlp/annotator/sentence/index.html | 2 +- .../sentence/sentence_detector/index.html | 2 +- .../sentence/sentence_detector_dl/index.html | 2 +- .../sparknlp/annotator/sentiment/index.html | 2 +- .../sentiment/sentiment_detector/index.html | 2 +- .../sentiment/vivekn_sentiment/index.html | 2 +- .../seq2seq/bart_transformer/index.html | 2 +- .../seq2seq/gpt2_transformer/index.html | 2 +- .../sparknlp/annotator/seq2seq/index.html | 2 +- .../seq2seq/llama2_transformer/index.html | 2 +- .../seq2seq/m2m100_transformer/index.html | 2 +- .../seq2seq/marian_transformer/index.html | 2 +- .../seq2seq/mistral_transformer/index.html | 2 +- .../seq2seq/phi2_transformer/index.html | 8 +-- .../seq2seq/t5_transformer/index.html | 2 +- .../document_similarity_ranker/index.html | 2 +- .../sparknlp/annotator/similarity/index.html | 2 +- .../context_spell_checker/index.html | 2 +- .../sparknlp/annotator/spell_check/index.html | 2 +- .../spell_check/norvig_sweeting/index.html | 2 +- .../spell_check/symmetric_delete/index.html | 2 +- .../sparknlp/annotator/stemmer/index.html | 2 +- .../annotator/stop_words_cleaner/index.html | 2 +- .../tf_ner_dl_graph_builder/index.html | 2 +- .../token/chunk_tokenizer/index.html | 2 +- .../sparknlp/annotator/token/index.html | 2 +- .../token/recursive_tokenizer/index.html | 2 +- .../token/regex_tokenizer/index.html | 2 +- .../annotator/token/tokenizer/index.html | 2 +- .../annotator/token2_chunk/index.html | 2 +- .../sparknlp/annotator/ws/index.html | 2 +- .../annotator/ws/word_segmenter/index.html | 2 +- .../sparknlp/base/audio_assembler/index.html | 2 +- .../sparknlp/base/doc2_chunk/index.html | 2 +- .../base/document_assembler/index.html | 2 +- .../base/embeddings_finisher/index.html | 2 +- .../sparknlp/base/finisher/index.html | 2 +- .../sparknlp/base/graph_finisher/index.html | 2 +- .../base/has_recursive_fit/index.html | 2 +- .../base/has_recursive_transform/index.html | 2 +- .../sparknlp/base/image_assembler/index.html | 2 +- .../autosummary/sparknlp/base/index.html | 2 +- .../sparknlp/base/light_pipeline/index.html | 2 +- .../base/multi_document_assembler/index.html | 2 +- .../base/recursive_pipeline/index.html | 2 +- .../sparknlp/base/table_assembler/index.html | 2 +- .../sparknlp/base/token_assembler/index.html | 2 +- .../common/annotator_approach/index.html | 2 +- .../common/annotator_model/index.html | 2 +- .../common/annotator_properties/index.html | 2 +- .../sparknlp/common/annotator_type/index.html | 2 +- .../common/coverage_result/index.html | 2 +- .../autosummary/sparknlp/common/index.html | 2 +- .../sparknlp/common/match_strategy/index.html | 2 +- .../sparknlp/common/properties/index.html | 2 +- .../sparknlp/common/read_as/index.html | 2 +- .../recursive_annotator_approach/index.html | 2 +- .../sparknlp/common/storage/index.html | 2 +- .../sparknlp/common/utils/index.html | 2 +- .../autosummary/sparknlp/functions/index.html | 2 +- .../reference/autosummary/sparknlp/index.html | 2 +- .../internal/annotator_java_ml/index.html | 2 +- .../internal/annotator_transformer/index.html | 2 +- .../internal/extended_java_wrapper/index.html | 2 +- .../autosummary/sparknlp/internal/index.html | 2 +- .../params_getters_setters/index.html | 2 +- .../sparknlp/internal/recursive/index.html | 2 +- .../sparknlp/logging/comet/index.html | 2 +- .../autosummary/sparknlp/logging/index.html | 2 +- .../sparknlp/pretrained/index.html | 2 +- .../pretrained/pretrained_pipeline/index.html | 2 +- .../pretrained/resource_downloader/index.html | 2 +- .../sparknlp/pretrained/utils/index.html | 2 +- .../sparknlp/training/conll/index.html | 2 +- .../sparknlp/training/conllu/index.html | 2 +- .../autosummary/sparknlp/training/index.html | 2 +- .../sparknlp/training/pos/index.html | 2 +- .../sparknlp/training/pub_tator/index.html | 2 +- .../training/spacy_to_annotation/index.html | 2 +- .../sparknlp/training/tfgraphs/index.html | 2 +- .../sparknlp/upload_to_hub/index.html | 2 +- .../autosummary/sparknlp/util/index.html | 2 +- docs/api/python/reference/index.html | 2 +- docs/api/python/search.html | 2 +- docs/api/python/searchindex.js | 2 +- .../python/static/documentation_options.js | 2 +- docs/api/python/third_party/Comet.html | 2 +- docs/api/python/third_party/MLflow.html | 2 +- docs/api/python/third_party/index.html | 2 +- docs/api/python/user_guide/annotation.html | 2 +- docs/api/python/user_guide/annotators.html | 2 +- .../python/user_guide/custom_pipelines.html | 2 +- docs/api/python/user_guide/helpers.html | 2 +- docs/api/python/user_guide/index.html | 2 +- .../python/user_guide/light_pipelines.html | 2 +- .../user_guide/pretrained_pipelines.html | 2 +- docs/api/python/user_guide/training.html | 2 +- docs/api/scala/collection/compat/index.html | 8 +-- docs/api/scala/collection/index.html | 8 +-- docs/api/scala/index.html | 8 +-- .../scala/com/johnsnowlabs/util/Build.scala | 1 - 1537 files changed, 5440 insertions(+), 4991 deletions(-) diff --git a/docs/api/com/index.html b/docs/api/com/index.html index 42bd9076f9892d..28ddc3863625e0 100644 --- a/docs/api/com/index.html +++ b/docs/api/com/index.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com - - + Spark NLP 5.4.1 ScalaDoc - com + + @@ -28,7 +28,7 @@
  • - + - + @@ -501,7 +501,7 @@

    Concrete Value Members

    def - generate(inputIds: Seq[Array[Int]], decoderEncoderStateTensors: Either[Tensor, OnnxTensor], encoderAttentionMaskTensors: Either[Tensor, OnnxTensor], decoderInputs: Array[Array[Int]], maxOutputLength: Int, minOutputLength: Int, doSample: Boolean, beamSize: Int, numReturnSequences: Int, temperature: Double, topK: Int, topP: Double, repetitionPenalty: Double, noRepeatNgramSize: Int, vocabSize: Int, eosTokenId: Int, paddingTokenId: Int, randomSeed: Option[Long], ignoreTokenIds: Array[Int] = Array(), session: Either[Session, (OrtEnvironment, OrtSession)], applySoftmax: Boolean = true, ovInferRequest: Option[InferRequest] = None): Array[Array[Int]] + generate(inputIds: Seq[Array[Int]], decoderEncoderStateTensors: Either[Tensor, OnnxTensor], encoderAttentionMaskTensors: Either[Tensor, OnnxTensor], decoderInputs: Array[Array[Int]], maxOutputLength: Int, minOutputLength: Int, doSample: Boolean, beamSize: Int, numReturnSequences: Int, temperature: Double, topK: Int, topP: Double, repetitionPenalty: Double, noRepeatNgramSize: Int, vocabSize: Int, eosTokenId: Int, paddingTokenId: Int, randomSeed: Option[Long], ignoreTokenIds: Array[Int] = Array(), session: Either[Session, (OrtEnvironment, OrtSession)], applySoftmax: Boolean = true, ovInferRequest: Option[InferRequest] = None, stopTokenIds: Array[Int] = Array()): Array[Array[Int]]

    Text Generation using Beam Search diff --git a/docs/api/com/johnsnowlabs/ml/ai/util/Generation/GenerationConfig.html b/docs/api/com/johnsnowlabs/ml/ai/util/Generation/GenerationConfig.html index 45355ed1702d1f..96fcb2846d6e14 100644 --- a/docs/api/com/johnsnowlabs/ml/ai/util/Generation/GenerationConfig.html +++ b/docs/api/com/johnsnowlabs/ml/ai/util/Generation/GenerationConfig.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.ml.ai.util.Generation.GenerationConfig - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.ml.ai.util.Generation.GenerationConfig + + @@ -28,7 +28,7 @@

  • + + + + + + + + abstract + def + + + getDone: Array[Boolean] + + +
  • @@ -364,9 +380,9 @@

    Abstract Value Members

  • - + - + @@ -375,7 +391,7 @@

    Abstract Value Members

    def - process(inputIds: Seq[Array[Int]], nextScores: Seq[Array[Float]], nextTokens: Seq[Array[Int]], nextIndices: Seq[Array[Int]], padTokenId: Int, eosTokenId: Int, beamIndices: Seq[Array[Int]], currentLength: Int): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) + process(inputIds: Seq[Array[Int]], nextScores: Seq[Array[Float]], nextTokens: Seq[Array[Int]], nextIndices: Seq[Array[Int]], padTokenId: Int, eosTokenId: Int, beamIndices: Seq[Array[Int]], currentLength: Int, stopTokenIds: Array[Int]): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) diff --git a/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.html b/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.html index 35fec92e6a080a..7a0adc1cdd3d49 100644 --- a/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.html +++ b/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/BeamSearchScorer.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.ml.ai.util.Generation.Search.BeamSearchScorer - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.ml.ai.util.Generation.Search.BeamSearchScorer + + @@ -28,7 +28,7 @@
  • +
  • + + + + + + + + + def + + + getDone: Array[Boolean] + + +
    Definition Classes
    BeamSearchScorerBeamScorer
  • @@ -724,9 +740,9 @@

    Value Members

  • - + - + @@ -735,7 +751,7 @@

    Value Members

    def - process(inputIds: Seq[Array[Int]], nextScores: Seq[Array[Float]], nextTokens: Seq[Array[Int]], nextIndices: Seq[Array[Int]], padTokenId: Int, eosTokenId: Int, beamIndices: Seq[Array[Int]], currentLength: Int): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]]) + process(inputIds: Seq[Array[Int]], nextScores: Seq[Array[Float]], nextTokens: Seq[Array[Int]], nextIndices: Seq[Array[Int]], padTokenId: Int, eosTokenId: Int, beamIndices: Seq[Array[Int]], currentLength: Int, stopTokenIds: Array[Int]): (Array[Array[Float]], Array[Array[Int]], Array[Array[Int]])
    Definition Classes
    BeamSearchScorerBeamScorer
    diff --git a/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/index.html b/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/index.html index 7578ef143830f2..abca335853f72f 100644 --- a/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/index.html +++ b/docs/api/com/johnsnowlabs/ml/ai/util/Generation/Search/index.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.ml.ai.util.Generation.Search - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.ml.ai.util.Generation.Search + + @@ -28,7 +28,7 @@
  • +
  • + + + + + + + + + def + + + generateRandomString(n: Int): String + + +

    Generates a random alphanumeric string of a given length.

    Generates a random alphanumeric string of a given length. +

    n

    + the length of the generated string

    returns

    + a random alphanumeric string of length n

  • diff --git a/docs/api/com/johnsnowlabs/ml/util/ModelArch$.html b/docs/api/com/johnsnowlabs/ml/util/ModelArch$.html index 042901645993f7..5f2e4df6e9ba35 100644 --- a/docs/api/com/johnsnowlabs/ml/util/ModelArch$.html +++ b/docs/api/com/johnsnowlabs/ml/util/ModelArch$.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.ml.util.ModelArch - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.ml.util.ModelArch + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + +

  • @@ -1277,6 +1293,22 @@

    Value Members

    setRepetitionPenalty(value: Double): HasGeneratorProperties.this +

    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): HasGeneratorProperties.this + +

  • @@ -1342,6 +1374,23 @@

    Value Members

    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

  • diff --git a/docs/api/com/johnsnowlabs/nlp/HasImageFeatureProperties.html b/docs/api/com/johnsnowlabs/nlp/HasImageFeatureProperties.html index a3f6c31babfbb9..24fd8515ef0184 100644 --- a/docs/api/com/johnsnowlabs/nlp/HasImageFeatureProperties.html +++ b/docs/api/com/johnsnowlabs/nlp/HasImageFeatureProperties.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.HasImageFeatureProperties - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.HasImageFeatureProperties + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -3280,6 +3296,22 @@

    Value Members

    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): WhisperForCTC.this.type + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -3378,6 +3410,24 @@

    Value Members

    It contains TF model signatures for the loaded saved model

    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/audio/feature_extractor/AudioUtils$.html b/docs/api/com/johnsnowlabs/nlp/annotators/audio/feature_extractor/AudioUtils$.html index f264976548bec8..f18af3b4333e09 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/audio/feature_extractor/AudioUtils$.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/audio/feature_extractor/AudioUtils$.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.audio.feature_extractor.AudioUtils - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.audio.feature_extractor.AudioUtils + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -3247,6 +3263,22 @@

    Value Members

    Definition Classes
    HasImageFeatureProperties
    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): VisionEncoderDecoderForImageCaptioning.this.type + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -3362,6 +3394,24 @@

    Value Members

    Resize the input to the given size.

    Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an integer is provided, then the input will be resized to (size, size). Only has an effect if do_resize is set to True.

    Definition Classes
    HasImageFeatureProperties
    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/cv/index.html b/docs/api/com/johnsnowlabs/nlp/annotators/cv/index.html index efd0ab434a670a..7859716a8becfe 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/cv/index.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/cv/index.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.cv - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.cv + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2852,6 +2868,22 @@

    Value Members

    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): BartTransformer.this.type + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2965,6 +2997,24 @@

    Value Members

    It contains TF model signatures for the laded saved model

    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/GPT2Transformer$.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/GPT2Transformer$.html index 3a9bcfcf7fa560..37fb8ceae58dd0 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/GPT2Transformer$.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/GPT2Transformer$.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2769,6 +2785,22 @@

    Value Members

    setRepetitionPenalty(value: Double): LLAMA2Transformer.this.type +

    Definition Classes
    HasGeneratorProperties
    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): LLAMA2Transformer.this.type + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2834,6 +2866,24 @@

    Value Members

    Definition Classes
    HasGeneratorProperties
    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/M2M100Transformer$.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/M2M100Transformer$.html index 25d2a3353a296f..6453a2d64fd1f9 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/M2M100Transformer$.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/M2M100Transformer$.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.M2M100Transformer - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.M2M100Transformer + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2830,6 +2846,22 @@

    Value Members

    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): M2M100Transformer.this.type + + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2942,6 +2974,24 @@

    Value Members

    Source Language (Default: en)

    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MarianTransformer$.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MarianTransformer$.html index 10911241ee7384..0dfaf60e1feab0 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MarianTransformer$.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/MarianTransformer$.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2772,6 +2788,22 @@

    Value Members

    setRepetitionPenalty(value: Double): MistralTransformer.this.type +

    Definition Classes
    HasGeneratorProperties
    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): MistralTransformer.this.type + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2837,6 +2869,24 @@

    Value Members

    Definition Classes
    HasGeneratorProperties
    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer$.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer$.html index a355c312ca8170..4d674285c01a61 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer$.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/Phi2Transformer$.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.Phi2Transformer - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.Phi2Transformer + + @@ -28,7 +28,7 @@
  • + + + + + + + + + def + + + getStopTokenIds: Array[Int] + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2810,6 +2826,22 @@

    Value Members

    setRepetitionPenalty(value: Double): Phi2Transformer.this.type +

    Definition Classes
    HasGeneratorProperties
    +
  • + + + + + + + + + def + + + setStopTokenIds(value: Array[Int]): Phi2Transformer.this.type + +

    Definition Classes
    HasGeneratorProperties
  • @@ -2891,6 +2923,24 @@

    Value Members

    +
  • + + + + + + + + + val + + + stopTokenIds: IntArrayParam + + +

    Stop tokens to terminate the generation +

    Stop tokens to terminate the generation +

    Definition Classes
    HasGeneratorProperties
  • diff --git a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/ReadBartTransformerDLModel.html b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/ReadBartTransformerDLModel.html index f638e740092301..0c0c67879fc621 100644 --- a/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/ReadBartTransformerDLModel.html +++ b/docs/api/com/johnsnowlabs/nlp/annotators/seq2seq/ReadBartTransformerDLModel.html @@ -3,9 +3,9 @@ - Spark NLP 5.4.0 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.ReadBartTransformerDLModel - - + Spark NLP 5.4.1 ScalaDoc - com.johnsnowlabs.nlp.annotators.seq2seq.ReadBartTransformerDLModel + + @@ -28,7 +28,7 @@