Skip to content

Commit

Permalink
Merge pull request #13961 from JohnSnowLabs/release/511-release-candi…
Browse files Browse the repository at this point in the history
…date

* [SPARKNLP-906] Fix reading suffix (#13945)

* Sparknlp 888 Add ONNX support to MPNet embeddings (#13955)

* adding onxx support to mpnet

* remove name in test

* updating default name for mpnet models in scala and python

* updating default model name

* Adding ONNX Support to ALBERT Token and Sequence Classification and Question Answering annotators (#13956)

* SPARKNLP-891 Adding ONNX support for AlbertQuestionAnswering
SPARKNLP-892 Adding ONNX support for AlbertSequenceClassification
SPARKNLP-893 Adding ONNX support for AlbertTokenClassification

* SPARKNLP-891 Adding ONNX support for AlbertQuestionAnswering
SPARKNLP-892 Adding ONNX support for AlbertSequenceClassification
SPARKNLP-893 Adding ONNX support for AlbertTokenClassification

* SPARKNLP-884 Enabling getVectors method to get word vectors as spark dataframe (#13957)

* [SPARKNLP-890] ONNX E5 MPnet example (#13958)

* Bump version to 5.1.1

* [SPARKNLP-891] [SPARKNLP-892] [SPARKNLP-893] Adding docs for ONNX support in AlbertXXX

* Fix misspelling [skip test]

* Fixing onnx saving path bug (#13959)

* fixing onnx write issue on windows

* fixing indentation

* fixing formatting

* fixing formatting

* final formatting fix

* Fix onnx saving bug

---------

Co-authored-by: Devin Ha <t.ha@tu-berlin.de>
Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>

---------

Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com>
Co-authored-by: ahmedlone127 <ahmedlone127@gmail.com>
Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com>
Co-authored-by: Danilo Burbano <danilo@johnsnowlabs.com>
Co-authored-by: Devin Ha <t.ha@tu-berlin.de>
  • Loading branch information
6 people authored Sep 11, 2023
2 parents eec96df + 23829c6 commit e94899c
Show file tree
Hide file tree
Showing 1,461 changed files with 16,679 additions and 4,974 deletions.
19 changes: 19 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
========
5.1.1
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in MPNet embedding annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForQuestionAnswering annotator
* Implement `getVectors` feature in Word2VecModel, Doc2VecModel, and WordEmbeddingsModel annotators. This new feature allows access to the entire tokens and their vectors in the loaded model.

----------------
Bug Fixes
----------------
* Fix how to save and load `Whisper` models
* Fix saving ONNX model on Windows operating system


========
5.1.0
========
Expand Down
88 changes: 44 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ To use Spark NLP you need the following requirements:

**GPU (optional):**

Spark NLP 5.1.0 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
Spark NLP 5.1.1 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand All @@ -186,7 +186,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.1.0 pyspark==3.3.1
$ pip install spark-nlp==5.1.1 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -231,7 +231,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh

## Apache Spark Support

Spark NLP *5.1.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x
Spark NLP *5.1.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x

| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -270,7 +270,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github

## Databricks Support

Spark NLP 5.1.0 has been tested and is compatible with the following runtimes:
Spark NLP 5.1.1 has been tested and is compatible with the following runtimes:

**CPU:**

Expand Down Expand Up @@ -331,7 +331,7 @@ Spark NLP 5.1.0 has been tested and is compatible with the following runtimes:

## EMR Support

Spark NLP 5.1.0 has been tested and is compatible with the following EMR releases:
Spark NLP 5.1.1 has been tested and is compatible with the following EMR releases:

- emr-6.2.0
- emr-6.3.0
Expand Down Expand Up @@ -376,11 +376,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

The `spark-nlp` has been published to
Expand All @@ -389,11 +389,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.1

```

Expand All @@ -403,11 +403,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.1

```

Expand All @@ -417,11 +417,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.1

```

Expand All @@ -435,7 +435,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

## Scala
Expand All @@ -453,7 +453,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -464,7 +464,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -475,7 +475,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -486,7 +486,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -496,28 +496,28 @@ coordinates:

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.1"
```

**spark-nlp-gpu:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.1"
```

**spark-nlp-aarch64:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.1"
```

**spark-nlp-silicon:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.1"
```

Maven
Expand All @@ -539,7 +539,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:

```bash
pip install spark-nlp==5.1.0
pip install spark-nlp==5.1.1
```

Conda:
Expand Down Expand Up @@ -568,7 +568,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1")
.getOrCreate()
```

Expand Down Expand Up @@ -639,7 +639,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list

```bash
com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
Expand All @@ -650,7 +650,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
Apart from the previous step, install the python module through pip

```bash
pip install spark-nlp==5.1.0
pip install spark-nlp==5.1.1
```

Or you can install `spark-nlp` from inside Zeppelin by using Conda:
Expand Down Expand Up @@ -678,7 +678,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.1.0 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.1.1 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand All @@ -695,7 +695,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
Expand All @@ -722,7 +722,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.0
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.1
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
Expand All @@ -745,7 +745,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.0
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.1
```

[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
Expand All @@ -764,9 +764,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP

3. In `Libraries` tab inside your cluster you need to follow these steps:

3.1. Install New -> PyPI -> `spark-nlp==5.1.0` -> Install
3.1. Install New -> PyPI -> `spark-nlp==5.1.1` -> Install

3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1` -> Install

4. Now you can attach your notebook to the cluster and use Spark NLP!

Expand Down Expand Up @@ -817,7 +817,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0"
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1"
}
}]
```
Expand All @@ -826,7 +826,7 @@ A sample of AWS CLI to launch EMR cluster:
```.sh
aws emr create-cluster \
--name "Spark NLP 5.1.0" \
--name "Spark NLP 5.1.1" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
Expand Down Expand Up @@ -890,7 +890,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```
2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
Expand Down Expand Up @@ -929,7 +929,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1")
.getOrCreate()
```
Expand All @@ -943,7 +943,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```
**pyspark:**
Expand All @@ -956,7 +956,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```
**Databricks:**
Expand Down Expand Up @@ -1228,7 +1228,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.1.0.jar")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.1.1.jar")
.getOrCreate()
```
Expand All @@ -1237,7 +1237,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.0.jar`)
i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.1.jar`)
Example of using pretrained Models and Pipelines in offline:
Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.1.0"
version := "5.1.1"

(ThisBuild / scalaVersion) := scalaVer

Expand Down
4 changes: 2 additions & 2 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
{% set name = "spark-nlp" %}
{% set version = "5.1.0" %}
{% set version = "5.1.1" %}

package:
name: {{ name|lower }}
version: {{ version }}

source:
url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
sha256: fed71757358c40d57d5d12bd9511e5e0112925cd00ddbf788ceb935db7aaac20
sha256: 7dc90ff99334614018e3846f5057d90858b6ab5708f065016570d9a03ee935e7

build:
noarch: python
Expand Down
Loading

0 comments on commit e94899c

Please sign in to comment.