Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release/511-release-candidate #13961

Merged
merged 13 commits into from
Sep 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
19 changes: 19 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
========
5.1.1
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in MPNet embedding annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForQuestionAnswering annotator
* Implement `getVectors` feature in Word2VecModel, Doc2VecModel, and WordEmbeddingsModel annotators. This new feature allows access to the entire tokens and their vectors in the loaded model.

----------------
Bug Fixes
----------------
* Fix how to save and load `Whisper` models
* Fix saving ONNX model on Windows operating system


========
5.1.0
========
Expand Down
88 changes: 44 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ To use Spark NLP you need the following requirements:

**GPU (optional):**

Spark NLP 5.1.0 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
Spark NLP 5.1.1 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand All @@ -186,7 +186,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.1.0 pyspark==3.3.1
$ pip install spark-nlp==5.1.1 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -231,7 +231,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh

## Apache Spark Support

Spark NLP *5.1.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x
Spark NLP *5.1.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x

| Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -270,7 +270,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github

## Databricks Support

Spark NLP 5.1.0 has been tested and is compatible with the following runtimes:
Spark NLP 5.1.1 has been tested and is compatible with the following runtimes:

**CPU:**

Expand Down Expand Up @@ -331,7 +331,7 @@ Spark NLP 5.1.0 has been tested and is compatible with the following runtimes:

## EMR Support

Spark NLP 5.1.0 has been tested and is compatible with the following EMR releases:
Spark NLP 5.1.1 has been tested and is compatible with the following EMR releases:

- emr-6.2.0
- emr-6.3.0
Expand Down Expand Up @@ -376,11 +376,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

The `spark-nlp` has been published to
Expand All @@ -389,11 +389,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.1

```

Expand All @@ -403,11 +403,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.1

```

Expand All @@ -417,11 +417,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.0
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.1

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.0
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.1

```

Expand All @@ -435,7 +435,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

## Scala
Expand All @@ -453,7 +453,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -464,7 +464,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -475,7 +475,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -486,7 +486,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.1.0</version>
<version>5.1.1</version>
</dependency>
```

Expand All @@ -496,28 +496,28 @@ coordinates:

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.1"
```

**spark-nlp-gpu:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.1"
```

**spark-nlp-aarch64:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.1"
```

**spark-nlp-silicon:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.1"
```

Maven
Expand All @@ -539,7 +539,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:

```bash
pip install spark-nlp==5.1.0
pip install spark-nlp==5.1.1
```

Conda:
Expand Down Expand Up @@ -568,7 +568,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1")
.getOrCreate()
```

Expand Down Expand Up @@ -639,7 +639,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list

```bash
com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
Expand All @@ -650,7 +650,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
Apart from the previous step, install the python module through pip

```bash
pip install spark-nlp==5.1.0
pip install spark-nlp==5.1.1
```

Or you can install `spark-nlp` from inside Zeppelin by using Conda:
Expand Down Expand Up @@ -678,7 +678,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.1.0 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.1.1 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand All @@ -695,7 +695,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
Expand All @@ -722,7 +722,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.0
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.1
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
Expand All @@ -745,7 +745,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.0
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.1
```

[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
Expand All @@ -764,9 +764,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP

3. In `Libraries` tab inside your cluster you need to follow these steps:

3.1. Install New -> PyPI -> `spark-nlp==5.1.0` -> Install
3.1. Install New -> PyPI -> `spark-nlp==5.1.1` -> Install

3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1` -> Install

4. Now you can attach your notebook to the cluster and use Spark NLP!

Expand Down Expand Up @@ -817,7 +817,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0"
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1"
}
}]
```
Expand All @@ -826,7 +826,7 @@ A sample of AWS CLI to launch EMR cluster:

```.sh
aws emr create-cluster \
--name "Spark NLP 5.1.0" \
--name "Spark NLP 5.1.1" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
Expand Down Expand Up @@ -890,7 +890,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
Expand Down Expand Up @@ -929,7 +929,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1")
.getOrCreate()
```

Expand All @@ -943,7 +943,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

**pyspark:**
Expand All @@ -956,7 +956,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.1
```

**Databricks:**
Expand Down Expand Up @@ -1228,7 +1228,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.1.0.jar")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.1.1.jar")
.getOrCreate()
```

Expand All @@ -1237,7 +1237,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.0.jar`)
i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.1.jar`)

Example of using pretrained Models and Pipelines in offline:

Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.1.0"
version := "5.1.1"

(ThisBuild / scalaVersion) := scalaVer

Expand Down
4 changes: 2 additions & 2 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
{% set name = "spark-nlp" %}
{% set version = "5.1.0" %}
{% set version = "5.1.1" %}

package:
name: {{ name|lower }}
version: {{ version }}

source:
url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
sha256: fed71757358c40d57d5d12bd9511e5e0112925cd00ddbf788ceb935db7aaac20
sha256: 7dc90ff99334614018e3846f5057d90858b6ab5708f065016570d9a03ee935e7

build:
noarch: python
Expand Down
Loading