diff --git a/.gitignore b/.gitignore
index aa264460d82d79..e91a8952f7c8d5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -338,3 +338,4 @@ python/docs/reference/_autosummary/**
# MS Visio Code
**/.vscode/
+.metals/
\ No newline at end of file
diff --git a/CHANGELOG b/CHANGELOG
index 30f8d250436dee..3e611cda48316f 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,21 @@
+========
+5.0.0
+========
+----------------
+New Features & Enhancements
+----------------
+* **NEW:** Introducing support for ONNX Runtime in Spark NLP. ONNX Runtime is a high-performance inference engine for machine learning models in the ONNX format. ONNX Runtime has proved to considerably increase the performance of inference for many models.
+* **NEW:** Introducing **InstructorEmbeddings** annotator in Spark NLP 🚀. `InstructorEmbeddings` can load new state-of-the-art INSTRUCTOR Models inherited from T5 for Text Embeddings.
+* **NEW:** Introducing **E5Embeddings** annotator in Spark NLP 🚀. `E5Embeddings` can load new state-of-the-art E5 Models inherited from BERT for Text Embeddings.
+* **NEW:** Introducing **DocumentSimilarityRanker** annotator in Spark NLP 🚀. `DocumentSimilarityRanker` is a new annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbours search on top of sentence embeddings, It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.
+
+----------------
+Bug Fixes
+----------------
+* Fix BART issue with maxInputLength
+
+
+
========
4.4.4
========
diff --git a/README.md b/README.md
index 1e7009ef6ab9c6..a002cf66f1c285 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,7 @@ environment.
Spark NLP comes with **17000+** pretrained **pipelines** and **models** in more than **200+** languages.
It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).
-**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
+**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, and **Vision Transformers (ViT)** not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
## Project's website
@@ -48,25 +48,25 @@ documentation and examples
- [Databricks Support](#databricks-support)
- [EMR Support](#emr-support)
- [Using Spark NLP](#usage)
- - [Packages Cheatsheet](#packages-cheatsheet)
- - [Spark Packages](#spark-packages)
- - [Scala](#scala)
- - [Maven](#maven)
- - [SBT](#sbt)
- - [Python](#python)
- - [Pip/Conda](#pipconda)
- - [Compiled JARs](#compiled-jars)
- - [Apache Zeppelin](#apache-zeppelin)
- - [Jupyter Notebook](#jupyter-notebook-python)
- - [Google Colab Notebook](#google-colab-notebook)
- - [Kaggle Kernel](#kaggle-kernel)
- - [Databricks Cluster](#databricks-cluster)
- - [EMR Cluster](#emr-cluster)
- - [GCP Dataproc](#gcp-dataproc)
- - [Spark NLP Configuration](#spark-nlp-configuration)
+ - [Packages Cheatsheet](#packages-cheatsheet)
+ - [Spark Packages](#spark-packages)
+ - [Scala](#scala)
+ - [Maven](#maven)
+ - [SBT](#sbt)
+ - [Python](#python)
+ - [Pip/Conda](#pipconda)
+ - [Compiled JARs](#compiled-jars)
+ - [Apache Zeppelin](#apache-zeppelin)
+ - [Jupyter Notebook](#jupyter-notebook-python)
+ - [Google Colab Notebook](#google-colab-notebook)
+ - [Kaggle Kernel](#kaggle-kernel)
+ - [Databricks Cluster](#databricks-cluster)
+ - [EMR Cluster](#emr-cluster)
+ - [GCP Dataproc](#gcp-dataproc)
+ - [Spark NLP Configuration](#spark-nlp-configuration)
- [Pipelines & Models](#pipelines-and-models)
- - [Pipelines](#pipelines)
- - [Models](#models)
+ - [Pipelines](#pipelines)
+ - [Models](#models)
- [Offline](#offline)
- [Examples](#examples)
- [FAQ](#faq)
@@ -110,6 +110,8 @@ documentation and examples
- BERT Sentence Embeddings (TF Hub & HuggingFace models)
- RoBerta Sentence Embeddings (HuggingFace models)
- XLM-RoBerta Sentence Embeddings (HuggingFace models)
+- Instructor Embeddings (HuggingFace models)
+- E5 Embeddings (HuggingFace models)
- Sentence Embeddings
- Chunk Embeddings
- Unsupervised keywords extraction
@@ -148,7 +150,7 @@ documentation and examples
- Automatic Speech Recognition (Wav2Vec2)
- Automatic Speech Recognition (HuBERT)
- Named entity recognition (Deep learning)
-- Easy TensorFlow integration
+- Easy ONNX and TensorFlow integrations
- GPU Support
- Full integration with Spark ML functions
- +12000 pre-trained models in +200 languages!
@@ -165,7 +167,7 @@ To use Spark NLP you need the following requirements:
**GPU (optional):**
-Spark NLP 4.4.4 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
+Spark NLP 5.0.0 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
@@ -181,7 +183,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.4.4 pyspark==3.3.1
+$ pip install spark-nlp==5.0.0 pyspark==3.3.1
```
In Python console or Jupyter `Python3` kernel:
@@ -226,24 +228,23 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh
## Apache Spark Support
-Spark NLP *4.4.4* has been built on top of Apache Spark 3.2 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x
+Spark NLP *5.0.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x
| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
+| 5.0.x | NO | NO | YES | YES | YES | YES | YES |
| 4.4.x | NO | NO | YES | YES | YES | YES | YES |
-| 4.3.x | NO | NO | YES | YES | YES | YES | NO |
-| 4.2.x | NO | NO | YES | YES | YES | YES | NO |
-| 4.1.x | NO | NO | YES | YES | YES | YES | NO |
-| 4.0.x | NO | NO | YES | YES | YES | YES | NO |
-| 3.4.x | YES | YES | YES | YES | Partially | N/A | NO
+| 4.3.x | NO | NO | YES | YES | YES | YES | NO |
+| 4.2.x | NO | NO | YES | YES | YES | YES | NO |
+| 4.1.x | NO | NO | YES | YES | YES | YES | NO |
+| 4.0.x | NO | NO | YES | YES | YES | YES | NO |
+| 3.4.x | YES | YES | YES | YES | Partially | N/A | NO |
| 3.3.x | YES | YES | YES | YES | NO | NO | NO |
| 3.2.x | YES | YES | YES | YES | NO | NO | NO |
| 3.1.x | YES | YES | YES | YES | NO | NO | NO |
| 3.0.x | YES | YES | YES | YES | NO | NO | NO |
| 2.7.x | YES | YES | NO | NO | NO | NO | NO |
-NOTE: Starting 4.0.0 release, the default `spark-nlp` and `spark-nlp-gpu` packages are based on Scala 2.12.15 and Apache
-Spark 3.2 by default.
Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases).
@@ -251,6 +252,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
|-----------|------------|------------|------------|------------|------------|------------|------------|
+| 5.0.x | NO | YES | YES | YES | YES | NO | YES |
| 4.4.x | NO | YES | YES | YES | YES | NO | YES |
| 4.3.x | YES | YES | YES | YES | YES | NO | YES |
| 4.2.x | YES | YES | YES | YES | YES | NO | YES |
@@ -265,12 +267,10 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
## Databricks Support
-Spark NLP 4.4.4 has been tested and is compatible with the following runtimes:
+Spark NLP 5.0.0 has been tested and is compatible with the following runtimes:
**CPU:**
-- 7.3
-- 7.3 ML
- 9.1
- 9.1 ML
- 10.1
@@ -299,6 +299,10 @@ Spark NLP 4.4.4 has been tested and is compatible with the following runtimes:
- 12.2 ML
- 13.0
- 13.0 ML
+- 13.1
+- 13.1 ML
+- 13.2
+- 13.2 ML
**GPU:**
@@ -316,13 +320,12 @@ Spark NLP 4.4.4 has been tested and is compatible with the following runtimes:
- 12.1 ML & GPU
- 12.2 ML & GPU
- 13.0 ML & GPU
-
-NOTE: Spark NLP 4.x is based on TensorFlow 2.7.x which is compatible with CUDA11 and cuDNN 8.0.2. The only Databricks
-runtimes supporting CUDA 11 are 9.x and above as listed under GPU.
+- 13.1 ML & GPU
+- 13.2 ML & GPU
## EMR Support
-Spark NLP 4.4.4 has been tested and is compatible with the following EMR releases:
+Spark NLP 5.0.0 has been tested and is compatible with the following EMR releases:
- emr-6.2.0
- emr-6.3.0
@@ -334,6 +337,7 @@ Spark NLP 4.4.4 has been tested and is compatible with the following EMR release
- emr-6.8.0
- emr-6.9.0
- emr-6.10.0
+- emr-6.11.0
Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)
@@ -365,11 +369,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
The `spark-nlp` has been published to
@@ -378,11 +382,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.4
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.0.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.4
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.0.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.4
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.0.0
```
@@ -392,11 +396,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.4
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.0.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.4
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.0.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.4
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.0.0
```
@@ -406,11 +410,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)
-spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.4
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.0.0
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.4
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.0.0
-spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.4
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.0.0
```
@@ -424,7 +428,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
## Scala
@@ -442,7 +446,7 @@ coordinates:
com.johnsnowlabs.nlpspark-nlp_2.12
- 4.4.4
+ 5.0.0
```
@@ -453,7 +457,7 @@ coordinates:
com.johnsnowlabs.nlpspark-nlp-gpu_2.12
- 4.4.4
+ 5.0.0
```
@@ -464,7 +468,7 @@ coordinates:
com.johnsnowlabs.nlpspark-nlp-aarch64_2.12
- 4.4.4
+ 5.0.0
```
@@ -475,7 +479,7 @@ coordinates:
com.johnsnowlabs.nlpspark-nlp-silicon_2.12
- 4.4.4
+ 5.0.0
```
@@ -485,28 +489,28 @@ coordinates:
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "4.4.4"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.0.0"
```
**spark-nlp-gpu:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "4.4.4"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.0.0"
```
**spark-nlp-aarch64:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "4.4.4"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.0.0"
```
**spark-nlp-silicon:**
```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "4.4.4"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.0.0"
```
Maven
@@ -528,7 +532,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:
```bash
-pip install spark-nlp==4.4.4
+pip install spark-nlp==5.0.0
```
Conda:
@@ -557,7 +561,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0")
.getOrCreate()
```
@@ -628,7 +632,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list
```bash
-com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
@@ -639,7 +643,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
Apart from the previous step, install the python module through pip
```bash
-pip install spark-nlp==4.4.4
+pip install spark-nlp==5.0.0
```
Or you can install `spark-nlp` from inside Zeppelin by using Conda:
@@ -667,7 +671,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
-$ pip install spark-nlp==4.4.4 pyspark==3.3.1 jupyter
+$ pip install spark-nlp==5.0.0 pyspark==3.3.1 jupyter
$ jupyter notebook
```
@@ -684,13 +688,13 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
-pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
If not using pyspark at all, you'll have to run the instructions
-pointed [here](#python-without-explicit-Pyspark-installation)
+pointed [here](#python-without-explicit-pyspark-installation)
## Google Colab Notebook
@@ -711,7 +715,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.4
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.0.0
```
[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
@@ -734,7 +738,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
-!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 4.4.4
+!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.0.0
```
[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
@@ -753,9 +757,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP
3. In `Libraries` tab inside your cluster you need to follow these steps:
- 3.1. Install New -> PyPI -> `spark-nlp==4.4.4` -> Install
+ 3.1. Install New -> PyPI -> `spark-nlp==5.0.0` -> Install
- 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4` -> Install
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0` -> Install
4. Now you can attach your notebook to the cluster and use Spark NLP!
@@ -806,7 +810,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
- "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4"
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0"
}
}]
```
@@ -815,7 +819,7 @@ A sample of AWS CLI to launch EMR cluster:
```.sh
aws emr create-cluster \
---name "Spark NLP 4.4.4" \
+--name "Spark NLP 5.0.0" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
@@ -879,7 +883,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
@@ -918,7 +922,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0")
.getOrCreate()
```
@@ -932,7 +936,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
**pyspark:**
@@ -945,7 +949,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.4
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0
```
**Databricks:**
@@ -1217,7 +1221,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars", "/tmp/spark-nlp-assembly-4.4.4.jar")
+ .config("spark.jars", "/tmp/spark-nlp-assembly-5.0.0.jar")
.getOrCreate()
```
@@ -1226,7 +1230,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
- i.e., `hdfs:///tmp/spark-nlp-assembly-4.4.4.jar`)
+ i.e., `hdfs:///tmp/spark-nlp-assembly-5.0.0.jar`)
Example of using pretrained Models and Pipelines in offline:
diff --git a/build.sbt b/build.sbt
index 83330ac60cd9ba..c9e37ecd4a699e 100644
--- a/build.sbt
+++ b/build.sbt
@@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)
organization := "com.johnsnowlabs.nlp"
-version := "4.4.4"
+version := "5.0.0"
(ThisBuild / scalaVersion) := scalaVer
@@ -165,6 +165,16 @@ val tensorflowDependencies: Seq[sbt.ModuleID] =
else
Seq(tensorflowCPU)
+val onnxDependencies: Seq[sbt.ModuleID] =
+ if (is_gpu.equals("true"))
+ Seq(onnxGPU)
+ else if (is_silicon.equals("true"))
+ Seq(onnxCPU)
+ else if (is_aarch64.equals("true"))
+ Seq(onnxCPU)
+ else
+ Seq(onnxCPU)
+
lazy val mavenProps = settingKey[Unit]("workaround for Maven properties")
lazy val root = (project in file("."))
@@ -175,6 +185,7 @@ lazy val root = (project in file("."))
testDependencies ++
utilDependencies ++
tensorflowDependencies ++
+ onnxDependencies ++
typedDependencyParserDependencies,
// TODO potentially improve this?
mavenProps := {
diff --git a/conda/README.md b/conda/README.md
index 1ff6ddb645b173..b04e5a32133168 100644
--- a/conda/README.md
+++ b/conda/README.md
@@ -35,13 +35,13 @@ conda config --set anaconda_upload no
Build `spark-nlp` from the latest PyPI tar:
```bash
-conda build . --python=3.7 && conda build . --python=3.8 && conda build . --python=3.9
+conda build conda/
```
Example of uploading Conda package to Anaconda Cloud:
```bash
-anaconda upload /anaconda3/conda-bld/noarch/spark-nlp-version-py37_0.tar.bz2
+anaconda upload /anaconda3/conda-bld/noarch/spark-nlp-version-py_0.tar.bz2
```
## Install
diff --git a/conda/conda_build_config.yaml b/conda/conda_build_config.yaml
deleted file mode 100644
index 293f77e58aa60b..00000000000000
--- a/conda/conda_build_config.yaml
+++ /dev/null
@@ -1,4 +0,0 @@
-python:
- - 3.7
- - 3.8
- - 3.9
diff --git a/conda/meta.yaml b/conda/meta.yaml
index 3bf52ab98dc2cf..4ccecd03892a1e 100644
--- a/conda/meta.yaml
+++ b/conda/meta.yaml
@@ -1,30 +1,36 @@
-package:
- name: "spark-nlp"
- version: 4.4.4
+{% set name = "spark-nlp" %}
+{% set version = "4.4.0" %}
-app:
- entry: spark-nlp
- summary: Natural Language Understanding Library for Apache Spark.
+package:
+ name: {{ name|lower }}
+ version: {{ version }}
source:
- fn: spark-nlp-4.4.4.tar.gz
- url: https://files.pythonhosted.org/packages/f9/e4/5eb83ed1c68be9fca636f6c62f9e55da3f2e511818e2a8feb852d6986064/spark-nlp-4.4.4.tar.gz
- sha256: d9e2f017ab7cf6e82e775c38862f1a4ee32bbb0af6619e0b9051e6737711b5b6
+ url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
+ sha256: e76fdd82b966ca169ba8a1fdcfe2e684fc63abaf88de841d2eb881cacb5e0105
+
build:
- noarch: generic
+ noarch: python
+ script: {{ PYTHON }} -m pip install . -vv
number: 0
- script: "python -m pip install . --no-deps -vv"
requirements:
- build:
- - python
+ host:
+ - python >=3.7,<3.11
+ - pip
run:
- - python
+ - python >=3.7,<3.11
+
+test:
+ imports:
+ - sparknlp
+ commands:
+ - pip check
+ requires:
+ - pip
about:
- home: https://github.com/JohnSnowLabs/spark-nlp/
- license: Apache License 2.0
- license_family: APACHE
- license_url: https://github.com/JohnSnowLabs/spark-nlp/blob/master/LICENSE
- description: John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
- summary: Natural Language Understanding Library for Apache Spark.
+ home: https://github.com/JohnSnowLabs/spark-nlp
+ summary: John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.
+ license: Apache-2.0
+ license_file: LICENSE
diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html
index 0ff09427399feb..ad8d347e9edffb 100755
--- a/docs/_layouts/landing.html
+++ b/docs/_layouts/landing.html
@@ -224,7 +224,7 @@
Transformers at Scale
Unlock the power of Large Language Models with Spark NLP 🚀, the only open-source library that delivers cutting-edge transformers for production such as
BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa,
- XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Facebook BART, Google T5, MarianMT, OpenAI GPT2,
+ XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Facebook BART, Instructor Embeddings, E5 Embeddings, Google T5, MarianMT, OpenAI GPT2,
Google ViT, ASR Wav2Vec2 and many more not only to Python, and R but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively
@@ -369,6 +369,22 @@
Value Members
@native()