Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release/522 release candidate #14117

Merged
merged 10 commits into from
Jan 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
17 changes: 16 additions & 1 deletion CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
========
5.2.1
========
----------------
Enhancements
----------------
* Update `aws-java-sdk-bundle` dependency to a version without any CVEs

----------------
Bug Fixes
----------------
* Fix the missing `BGEEmbeddings` from annotator in Python
* Add a new BGE notebook to import models into Spark NLP
* Upload the new true `BGE` models to Spark NLP for text embeddings


========
5.2.1
========
Expand All @@ -14,7 +30,6 @@ New Features & Enhancements
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with ONNX format
* Add a new notebook to show how to import any model from `MarianNMT` family into Spark NLP with ONNX format


----------------
Bug Fixes
----------------
Expand Down
88 changes: 44 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ To use Spark NLP you need the following requirements:

**GPU (optional):**

Spark NLP 5.2.1 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
Spark NLP 5.2.2 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand All @@ -189,7 +189,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.2.1 pyspark==3.3.1
$ pip install spark-nlp==5.2.2 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -234,7 +234,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh

## Apache Spark Support

Spark NLP *5.2.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *5.2.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -276,7 +276,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github

## Databricks Support

Spark NLP 5.2.1 has been tested and is compatible with the following runtimes:
Spark NLP 5.2.2 has been tested and is compatible with the following runtimes:

**CPU:**

Expand Down Expand Up @@ -343,7 +343,7 @@ Spark NLP 5.2.1 has been tested and is compatible with the following runtimes:

## EMR Support

Spark NLP 5.2.1 has been tested and is compatible with the following EMR releases:
Spark NLP 5.2.2 has been tested and is compatible with the following EMR releases:

- emr-6.2.0
- emr-6.3.0
Expand Down Expand Up @@ -390,11 +390,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

The `spark-nlp` has been published to
Expand All @@ -403,11 +403,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.1
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.2

```

Expand All @@ -417,11 +417,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.1
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.2

```

Expand All @@ -431,11 +431,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.1
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.2

```

Expand All @@ -449,7 +449,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

## Scala
Expand All @@ -467,7 +467,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.2.1</version>
<version>5.2.2</version>
</dependency>
```

Expand All @@ -478,7 +478,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.2.1</version>
<version>5.2.2</version>
</dependency>
```

Expand All @@ -489,7 +489,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.2.1</version>
<version>5.2.2</version>
</dependency>
```

Expand All @@ -500,7 +500,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.2.1</version>
<version>5.2.2</version>
</dependency>
```

Expand All @@ -510,28 +510,28 @@ coordinates:

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.1"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.2"
```

**spark-nlp-gpu:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.1"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.2"
```

**spark-nlp-aarch64:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.1"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.2"
```

**spark-nlp-silicon:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.1"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.2"
```

Maven
Expand All @@ -553,7 +553,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:

```bash
pip install spark-nlp==5.2.1
pip install spark-nlp==5.2.2
```

Conda:
Expand Down Expand Up @@ -582,7 +582,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2")
.getOrCreate()
```

Expand Down Expand Up @@ -653,7 +653,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list

```bash
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
Expand All @@ -664,7 +664,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
Apart from the previous step, install the python module through pip

```bash
pip install spark-nlp==5.2.1
pip install spark-nlp==5.2.2
```

Or you can install `spark-nlp` from inside Zeppelin by using Conda:
Expand Down Expand Up @@ -692,7 +692,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.2.1 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.2.2 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand All @@ -709,7 +709,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
Expand All @@ -736,7 +736,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.1
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.2
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
Expand All @@ -759,7 +759,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.1
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.2
```

[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
Expand All @@ -778,9 +778,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP

3. In `Libraries` tab inside your cluster you need to follow these steps:

3.1. Install New -> PyPI -> `spark-nlp==5.2.1` -> Install
3.1. Install New -> PyPI -> `spark-nlp==5.2.2` -> Install

3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2` -> Install

4. Now you can attach your notebook to the cluster and use Spark NLP!

Expand Down Expand Up @@ -831,7 +831,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1"
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2"
}
}]
```
Expand All @@ -840,7 +840,7 @@ A sample of AWS CLI to launch EMR cluster:

```.sh
aws emr create-cluster \
--name "Spark NLP 5.2.1" \
--name "Spark NLP 5.2.2" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
Expand Down Expand Up @@ -904,7 +904,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
Expand Down Expand Up @@ -947,7 +947,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2")
.getOrCreate()
```

Expand All @@ -961,7 +961,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

**pyspark:**
Expand All @@ -974,7 +974,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.1
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
```

**Databricks:**
Expand Down Expand Up @@ -1246,7 +1246,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.2.1.jar")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.2.2.jar")
.getOrCreate()
```

Expand All @@ -1255,7 +1255,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.1.jar`)
i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.2.jar`)

Example of using pretrained Models and Pipelines in offline:

Expand Down
5 changes: 2 additions & 3 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.2.1"
version := "5.2.2"

(ThisBuild / scalaVersion) := scalaVer

Expand Down Expand Up @@ -153,8 +153,7 @@ lazy val utilDependencies = Seq(
gcpStorage,
greex,
azureIdentity,
azureStorage
)
azureStorage)

lazy val typedDependencyParserDependencies = Seq(junit)

Expand Down
Loading
Loading