Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

540 Release Candidate #14247

Merged
merged 37 commits into from
Jul 1, 2024
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
deb0a45
closed the connection (#14233)
mehmetbutgul May 14, 2024
b3d491b
Fix missing java distribution for setup-java step
DevinTDHa May 14, 2024
d8a42c0
Lock macos version for runner
DevinTDHa May 14, 2024
5967577
Add missing sbt setup
DevinTDHa May 14, 2024
fcd4e9c
Add openvino dependency (#14255)
DevinTDHa May 21, 2024
4419a70
[SPARKNLP-1037] Adding addFile changes to to replace broadcast in all…
danilojsl May 21, 2024
fabc4ab
Integrating OpenVINO Runtime in Spark NLP (#14200)
rajatkrishna May 21, 2024
9430402
Fixing colab notebook bugs (#14249)
ahmedlone127 May 24, 2024
3f59375
adding model hub cards + updating readme + small typo fix on M2M100Te…
ahmedlone127 May 24, 2024
d083420
adding padded tokens (#14276)
ahmedlone127 May 24, 2024
e0e28e8
Sparknlp 1035 test all notebooks to import tensor flow models to spar…
ahmedlone127 May 24, 2024
262b802
Adding caching to streamlit demos (#14232)
AbdullahMubeenAnwar May 24, 2024
74f5151
Disable OpenVINO FastTest
DevinTDHa May 24, 2024
c2048be
Add openvino GPU dependency (#14309)
DevinTDHa Jun 3, 2024
adc193e
Fix incorrect LLAMA2 position ID (#14308)
rajatkrishna Jun 3, 2024
7274281
bump version to 5.4.0-rc1 [skip test]
maziyarpanahi Jun 5, 2024
9c075f8
Sparknlp 1016 implement mp net for token classification (#14322)
ahmedlone127 Jun 10, 2024
0ea5898
Uploading OpenVINO example notebooks (#14313)
rajatkrishna Jun 10, 2024
4583ccf
SparkNLP - 995 Introducing MistralAI LLMs (#14318)
prabod Jun 10, 2024
cdb031a
SparkNLP 1043 integrate new casual lm annotators to use open vino (#1…
prabod Jun 10, 2024
3054d4c
Fixed LLAMA generation bug (#14320)
prabod Jun 10, 2024
b4000d3
Fix compilation error
maziyarpanahi Jun 10, 2024
85c90dd
Bump to 5.4.0-rc2
maziyarpanahi Jun 10, 2024
1cba7e3
Add Pooling Average to Broken XXXForSentenceEmbedding annotators (#1…
ahmedlone127 Jun 12, 2024
903e780
Fix compilation error and formatting
maziyarpanahi Jun 12, 2024
54027a4
revert changes to BERT backend
maziyarpanahi Jun 12, 2024
4356794
Fix models link on FAQ (#14333)
dcecchini Jun 21, 2024
5a86b70
adding onnx support and average pooling (#14330)
ahmedlone127 Jun 21, 2024
ac9de09
uploading UAEEmbeddings notebook (#14324)
AbdullahMubeenAnwar Jun 21, 2024
e88682c
Bump version to 5.4.0 [skip test]
maziyarpanahi Jun 26, 2024
09dc500
Refactor OpenAIEmbeddings (#14334)
mehmetbutgul Jun 28, 2024
9d235e0
Update CHANGELOG [run doc]
maziyarpanahi Jun 28, 2024
036fc50
Update Scala and Python APIs
actions-user Jun 28, 2024
86e6725
Update ORT and Azure deps
maziyarpanahi Jun 28, 2024
a5b88ad
add the missing OpenVINO coordinates
maziyarpanahi Jun 29, 2024
1502757
set ORT to 1.18.0
maziyarpanahi Jun 29, 2024
595b8f4
Update jsl-openvino to GA
maziyarpanahi Jun 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
10 changes: 5 additions & 5 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,17 +33,17 @@ on:
jobs:
spark34:
if: "! contains(toJSON(github.event.commits.*.message), '[skip test]')"
runs-on: macos-latest
runs-on: macos-13
env:
TF_CPP_MIN_LOG_LEVEL: 3
JAVA_OPTS: "-Xmx4096m -XX:+UseG1GC"
name: Build and Test on Apache Spark 3.4.x

steps:
- uses: actions/checkout@v3
- uses: actions/setup-java@v3
- uses: actions/setup-java@v4
with:
distribution: 'adopt'
distribution: 'temurin'
java-version: '8'
cache: 'sbt'
- name: Install Python 3.7
Expand Down Expand Up @@ -73,7 +73,7 @@ jobs:
python3.7 -m pytest -v -m fast
spark35:
if: "! contains(toJSON(github.event.commits.*.message), '[skip test]')"
runs-on: macos-latest
runs-on: macos-13
env:
TF_CPP_MIN_LOG_LEVEL: 3
JAVA_OPTS: "-Xmx4096m -XX:+UseG1GC"
Expand Down Expand Up @@ -109,7 +109,7 @@ jobs:

spark33:
if: "! contains(toJSON(github.event.commits.*.message), '[skip test]')"
runs-on: macos-latest
runs-on: macos-13
env:
TF_CPP_MIN_LOG_LEVEL: 3
JAVA_OPTS: "-Xmx4096m -XX:+UseG1GC"
Expand Down
88 changes: 44 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ To use Spark NLP you need the following requirements:

**GPU (optional):**

Spark NLP 5.3.3 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
Spark NLP 5.4.0-rc2 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand All @@ -182,7 +182,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.3.3 pyspark==3.3.1
$ pip install spark-nlp==5.4.0-rc2 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -227,7 +227,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh

## Apache Spark Support

Spark NLP *5.3.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *5.4.0-rc2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -271,7 +271,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github

## Databricks Support

Spark NLP 5.3.3 has been tested and is compatible with the following runtimes:
Spark NLP 5.4.0-rc2 has been tested and is compatible with the following runtimes:

**CPU:**

Expand Down Expand Up @@ -344,7 +344,7 @@ Spark NLP 5.3.3 has been tested and is compatible with the following runtimes:

## EMR Support

Spark NLP 5.3.3 has been tested and is compatible with the following EMR releases:
Spark NLP 5.4.0-rc2 has been tested and is compatible with the following EMR releases:

- emr-6.2.0
- emr-6.3.0
Expand Down Expand Up @@ -394,11 +394,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x,
```sh
# CPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

The `spark-nlp` has been published to
Expand All @@ -407,11 +407,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0-rc2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0-rc2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.3
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0-rc2

```

Expand All @@ -421,11 +421,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0-rc2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0-rc2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.3
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0-rc2

```

Expand All @@ -435,11 +435,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s
```sh
# M1/M2 (Apple Silicon)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0-rc2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0-rc2

spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.3
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0-rc2

```

Expand All @@ -453,7 +453,7 @@ set in your SparkSession:
spark-shell \
--driver-memory 16g \
--conf spark.kryoserializer.buffer.max=2000M \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

## Scala
Expand All @@ -471,7 +471,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.3.3</version>
<version>5.4.0-rc2</version>
</dependency>
```

Expand All @@ -482,7 +482,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.3.3</version>
<version>5.4.0-rc2</version>
</dependency>
```

Expand All @@ -493,7 +493,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.3.3</version>
<version>5.4.0-rc2</version>
</dependency>
```

Expand All @@ -504,7 +504,7 @@ coordinates:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.3.3</version>
<version>5.4.0-rc2</version>
</dependency>
```

Expand All @@ -514,28 +514,28 @@ coordinates:

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.4.0-rc2"
```

**spark-nlp-gpu:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.4.0-rc2"
```

**spark-nlp-aarch64:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.4.0-rc2"
```

**spark-nlp-silicon:**

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.3"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.4.0-rc2"
```

Maven
Expand All @@ -557,7 +557,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through
Pip:

```bash
pip install spark-nlp==5.3.3
pip install spark-nlp==5.4.0-rc2
```

Conda:
Expand Down Expand Up @@ -586,7 +586,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2")
.getOrCreate()
```

Expand Down Expand Up @@ -657,7 +657,7 @@ Use either one of the following options
- Add the following Maven Coordinates to the interpreter's library list

```bash
com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
Expand All @@ -668,7 +668,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
Apart from the previous step, install the python module through pip

```bash
pip install spark-nlp==5.3.3
pip install spark-nlp==5.4.0-rc2
```

Or you can install `spark-nlp` from inside Zeppelin by using Conda:
Expand Down Expand Up @@ -696,7 +696,7 @@ launch the Jupyter from the same Python environment:
$ conda create -n sparknlp python=3.8 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.3.3 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.4.0-rc2 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand All @@ -713,7 +713,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
Expand All @@ -740,7 +740,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.3
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0-rc2
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb)
Expand All @@ -763,7 +763,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -s is for spark-nlp
# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage
# by default they are set to the latest
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.3
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0-rc2
```

[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live
Expand All @@ -782,9 +782,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP

3. In `Libraries` tab inside your cluster you need to follow these steps:

3.1. Install New -> PyPI -> `spark-nlp==5.3.3` -> Install
3.1. Install New -> PyPI -> `spark-nlp==5.4.0-rc2` -> Install

3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2` -> Install

4. Now you can attach your notebook to the cluster and use Spark NLP!

Expand Down Expand Up @@ -835,7 +835,7 @@ A sample of your software configuration in JSON on S3 (must be public access):
"spark.kryoserializer.buffer.max": "2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize": "0",
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3"
"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2"
}
}]
```
Expand All @@ -844,7 +844,7 @@ A sample of AWS CLI to launch EMR cluster:

```.sh
aws emr create-cluster \
--name "Spark NLP 5.3.3" \
--name "Spark NLP 5.4.0-rc2" \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--instance-type m4.4xlarge \
Expand Down Expand Up @@ -908,7 +908,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
--enable-component-gateway \
--metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
--properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
Expand Down Expand Up @@ -951,7 +951,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2")
.getOrCreate()
```

Expand All @@ -965,7 +965,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

**pyspark:**
Expand All @@ -978,7 +978,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.3
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0-rc2
```

**Databricks:**
Expand Down Expand Up @@ -1250,7 +1250,7 @@ spark = SparkSession.builder
.config("spark.driver.memory", "16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.3.3.jar")
.config("spark.jars", "/tmp/spark-nlp-assembly-5.4.0-rc2.jar")
.getOrCreate()
```

Expand All @@ -1259,7 +1259,7 @@ spark = SparkSession.builder
version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.3.jar`)
i.e., `hdfs:///tmp/spark-nlp-assembly-5.4.0-rc2.jar`)

Example of using pretrained Models and Pipelines in offline:

Expand Down
Loading
Loading