diff --git a/CHANGELOG b/CHANGELOG index 2e0eac24162c26..f327548e3f7e80 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,36 @@ +======== +5.3.0 +======== +---------------- +New Features & Enhancements +---------------- +* **NEW:** Introducing Llama-2 and all the models fine-tuned based on this architecutre. This our very first CasualLM annotator in ONNX and it comes with support for quantization in INT4 and INT8 for CPUs. +* **NEW:** Introducing `MPNetForSequenceClassification` annotator for sequence classification tasks. This annotator is based on the MPNet architecture and is designed to classify sequences of text into a set of predefined classes. +* **NEW:** Introducing `MPNetForQuestionAnswering` annotator for question answering tasks. This annotator is based on the MPNet architecture and is designed to answer questions based on a given context. +* **NEW:** Introducing `M2M100` state-of-the-art multilingual translation. M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model can directly translate between the 9,900 directions of 100 languages. +* **NEW:** Introducing a new `DeBertaForZeroShotClassification` annotator for zero-shot classification tasks. This annotator is based on the DeBERTa architecture and is designed to classify sequences of text into a set of predefined classes. +* **NEW:** Implement retreival feature in our `DocumentSimilarity`annotator. The new DocumentSimilarity ranker is a powerful tool for ranking documents based on their similarity to a given query document. It is designed to be efficient and scalable, making it ideal for a variety of RAG applications/ +* Add ONNNX support for `BertForZeroShotClassification` annotator. +* Add support for in-memory use of `WordEmbeddingsModel` annotator in server-less cluster. We initially introduced in-memory feature for this annotator for users inside Kubernetes cluster without any `HDFS`, however, today it runs without any issue `locally`, Google `Colab`, `Kaggle`, `Databricks`, `AWS EMR`, `GCP`, and `AWS Glue`. +* New Whisper Large and Distil models. +* Update ONNX Runtime to 1.17.0 +* Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, 14.3 GPU +* Support new EMR 6.15.0 and 7.0.0 versions +* Add nobteook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it to Spark NLP +* Add notebook to import BERT for Zero-Shot classification from Hugging Face +* Add notebook to import DeBERTa for Zero-Shot classification from Hugging Face +* Update EntityRuler documentation +* Improve SBT project and resolve warnings (almost!) + +---------------- +Bug Fixes +---------------- +* Fix Spark NLP Configuration's to set `cluster_tmp_dir` on Databricks' DBFS via `spark.jsl.settings.storage.cluster_tmp_dir` https://github.com/JohnSnowLabs/spark-nlp/issues/14129 +* Fix score calculation in `RoBertaForQuestionAnswering` annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147 +* Fix optional input col validations https://github.com/JohnSnowLabs/spark-nlp/pull/14153 +* Fix notebooks for importing DeBERTa classifiers https://github.com/JohnSnowLabs/spark-nlp/pull/14154 +* Fix GPT2 deserialization over the cluster (Databricks) https://github.com/JohnSnowLabs/spark-nlp/pull/14177 + ======== 5.2.3 ======== diff --git a/README.md b/README.md index 54e3dacc8cb620..5f4c9637cd8926 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,7 @@ environment. Spark NLP comes with **36000+** pretrained **pipelines** and **models** in more than **200+** languages. It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features). -**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Facebook BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively. +**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Llama-2**, **M2M100**, **BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively. ## Project's website @@ -111,42 +111,34 @@ documentation and examples - BERT Sentence Embeddings (TF Hub & HuggingFace models) - RoBerta Sentence Embeddings (HuggingFace models) - XLM-RoBerta Sentence Embeddings (HuggingFace models) -- Instructor Embeddings (HuggingFace models) +- INSTRUCTOR Embeddings (HuggingFace models) - E5 Embeddings (HuggingFace models) - MPNet Embeddings (HuggingFace models) - OpenAI Embeddings -- Sentence Embeddings -- Chunk Embeddings +- Sentence & Chunk Embeddings - Unsupervised keywords extraction - Language Detection & Identification (up to 375 languages) -- Multi-class Sentiment analysis (Deep learning) -- Multi-label Sentiment analysis (Deep learning) +- Multi-class & Multi-labe Sentiment analysis (Deep learning) - Multi-class Text Classification (Deep learning) -- BERT for Token & Sequence Classification -- DistilBERT for Token & Sequence Classification -- CamemBERT for Token & Sequence Classification -- ALBERT for Token & Sequence Classification -- RoBERTa for Token & Sequence Classification -- DeBERTa for Token & Sequence Classification -- XLM-RoBERTa for Token & Sequence Classification +- BERT for Token & Sequence Classification & Question Answering +- DistilBERT for Token & Sequence Classification & Question Answering +- CamemBERT for Token & Sequence Classification & Question Answering +- ALBERT for Token & Sequence Classification & Question Answering +- RoBERTa for Token & Sequence Classification & Question Answering +- DeBERTa for Token & Sequence Classification & Question Answering +- XLM-RoBERTa for Token & Sequence Classification & Question Answering +- Longformer for Token & Sequence Classification & Question Answering +- MPnet for Token & Sequence Classification & Question Answering - XLNet for Token & Sequence Classification -- Longformer for Token & Sequence Classification -- BERT for Token & Sequence Classification -- BERT for Question Answering -- CamemBERT for Question Answering -- DistilBERT for Question Answering -- ALBERT for Question Answering -- RoBERTa for Question Answering -- DeBERTa for Question Answering -- XLM-RoBERTa for Question Answering -- Longformer for Question Answering -- Table Question Answering (TAPAS) - Zero-Shot NER Model - Zero-Shot Text Classification by Transformers (ZSL) - Neural Machine Translation (MarianMT) +- Many-to-Many multilingual translation model (Facebook M2M100) +- Table Question Answering (TAPAS) - Text-To-Text Transfer Transformer (Google T5) - Generative Pre-trained Transformer 2 (OpenAI GPT2) - Seq2Seq for NLG, Translation, and Comprehension (Facebook BART) +- Chat and Conversational LLMs (Facebook Llama-22) - Vision Transformer (Google ViT) - Swin Image Classification (Microsoft Swin Transformer) - ConvNext Image Classification (Facebook ConvNext) @@ -173,7 +165,7 @@ To use Spark NLP you need the following requirements: **GPU (optional):** -Spark NLP 5.2.3 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.3.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -189,7 +181,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.2.3 pyspark==3.3.1 +$ pip install spark-nlp==5.3.0 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -234,11 +226,12 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.2.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x +Spark NLP *5.3.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| -| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.3.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO | | 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO | | 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | | 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | @@ -259,6 +252,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github | Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | |-----------|------------|------------|------------|------------|------------|------------|------------| +| 5.3.x | NO | YES | YES | YES | YES | NO | YES | | 5.2.x | NO | YES | YES | YES | YES | NO | YES | | 5.1.x | NO | YES | YES | YES | YES | NO | YES | | 5.0.x | NO | YES | YES | YES | YES | NO | YES | @@ -276,7 +270,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.2.3 has been tested and is compatible with the following runtimes: +Spark NLP 5.3.0 has been tested and is compatible with the following runtimes: **CPU:** @@ -318,6 +312,10 @@ Spark NLP 5.2.3 has been tested and is compatible with the following runtimes: - 14.0 ML - 14.1 - 14.1 ML +- 14.2 +- 14.2 ML +- 14.3 +- 14.3 ML **GPU:** @@ -340,10 +338,12 @@ Spark NLP 5.2.3 has been tested and is compatible with the following runtimes: - 13.3 ML & GPU - 14.0 ML & GPU - 14.1 ML & GPU +- 14.2 ML & GPU +- 14.3 ML & GPU ## EMR Support -Spark NLP 5.2.3 has been tested and is compatible with the following EMR releases: +Spark NLP 5.3.0 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -359,8 +359,11 @@ Spark NLP 5.2.3 has been tested and is compatible with the following EMR release - emr-6.12.0 - emr-6.13.0 - emr-6.14.0 +- emr-6.15.0 +- emr-7.0.0 Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) +Full list of [Amazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html) NOTE: The EMR 6.1.0 and 6.1.1 are not supported. @@ -390,11 +393,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` The `spark-nlp` has been published to @@ -403,11 +406,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0 ``` @@ -417,11 +420,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0 ``` @@ -431,11 +434,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0 ``` @@ -449,7 +452,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` ## Scala @@ -467,7 +470,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp_2.12 - 5.2.3 + 5.3.0 ``` @@ -478,7 +481,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.2.3 + 5.3.0 ``` @@ -489,7 +492,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.2.3 + 5.3.0 ``` @@ -500,7 +503,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.2.3 + 5.3.0 ``` @@ -510,28 +513,28 @@ coordinates: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.0" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.0" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.0" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.0" ``` Maven @@ -553,7 +556,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.2.3 +pip install spark-nlp==5.3.0 ``` Conda: @@ -582,7 +585,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0") .getOrCreate() ``` @@ -653,7 +656,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -664,7 +667,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.2.3 +pip install spark-nlp==5.3.0 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -692,7 +695,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.2.3 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.3.0 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -709,7 +712,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -736,7 +739,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.0 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -759,7 +762,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.0 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -778,9 +781,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.2.3` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.3.0` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -831,7 +834,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0" } }] ``` @@ -840,7 +843,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.2.3" \ +--name "Spark NLP 5.3.0" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -904,7 +907,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -947,7 +950,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0") .getOrCreate() ``` @@ -961,7 +964,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` **pyspark:** @@ -974,7 +977,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` **Databricks:** @@ -1246,7 +1249,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.2.3.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.3.0.jar") .getOrCreate() ``` @@ -1255,7 +1258,7 @@ spark = SparkSession.builder version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.3.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.0.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/build.sbt b/build.sbt index f9d51a971aa17d..3f45d5ee14d6c8 100644 --- a/build.sbt +++ b/build.sbt @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64) organization := "com.johnsnowlabs.nlp" -version := "5.2.3" +version := "5.3.0" (ThisBuild / scalaVersion) := scalaVer @@ -144,13 +144,17 @@ lazy val utilDependencies = Seq( exclude ("com.fasterxml.jackson.core", "jackson-annotations") exclude ("com.fasterxml.jackson.core", "jackson-databind") exclude ("com.fasterxml.jackson.core", "jackson-core") + exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor") exclude ("commons-configuration", "commons-configuration"), liblevenshtein exclude ("com.google.guava", "guava") exclude ("org.apache.commons", "commons-lang3") exclude ("com.google.code.findbugs", "annotations") exclude ("org.slf4j", "slf4j-api"), - gcpStorage, + gcpStorage + exclude ("com.fasterxml.jackson.core", "jackson-core") + exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor") + , greex, azureIdentity, azureStorage) diff --git a/conda/meta.yaml b/conda/meta.yaml index 0ffbacb84c96a4..0e3ce22fa34877 100644 --- a/conda/meta.yaml +++ b/conda/meta.yaml @@ -1,5 +1,5 @@ {% set name = "spark-nlp" %} -{% set version = "5.2.3" %} +{% set version = "5.3.0" %} package: name: {{ name|lower }} @@ -7,7 +7,7 @@ package: source: url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz - sha256: bdad9912c6f4fa36aef2169a4d7e4c33cd32d79d6ff0c628c04876d9354252e9 + sha256: 2fa182f1850026fa7f9d5fbb7b92939856f78ddcc2cb2d87d56af5e2e90b97f0 build: noarch: python diff --git a/docs/README.md b/docs/README.md index f80ec476ce179f..3db879d41867c4 100644 --- a/docs/README.md +++ b/docs/README.md @@ -173,7 +173,7 @@ To use Spark NLP you need the following requirements: **GPU (optional):** -Spark NLP 5.2.3 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.3.0 is built with ONNX 1.16.3 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -189,7 +189,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.2.3 pyspark==3.3.1 +$ pip install spark-nlp==5.3.0 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -234,7 +234,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.2.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x +Spark NLP *5.3.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x | Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| @@ -276,7 +276,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.2.3 has been tested and is compatible with the following runtimes: +Spark NLP 5.3.0 has been tested and is compatible with the following runtimes: **CPU:** @@ -343,7 +343,7 @@ Spark NLP 5.2.3 has been tested and is compatible with the following runtimes: ## EMR Support -Spark NLP 5.2.3 has been tested and is compatible with the following EMR releases: +Spark NLP 5.3.0 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -390,11 +390,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` The `spark-nlp` has been published to @@ -403,11 +403,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0 ``` @@ -417,11 +417,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0 ``` @@ -431,11 +431,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0 ``` @@ -449,7 +449,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` ## Scala @@ -467,7 +467,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp_2.12 - 5.2.3 + 5.3.0 ``` @@ -478,7 +478,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.2.3 + 5.3.0 ``` @@ -489,7 +489,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.2.3 + 5.3.0 ``` @@ -500,7 +500,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.2.3 + 5.3.0 ``` @@ -510,28 +510,28 @@ coordinates: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.3.0" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.3.0" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.3.0" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.2.3" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.3.0" ``` Maven @@ -553,7 +553,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.2.3 +pip install spark-nlp==5.3.0 ``` Conda: @@ -582,7 +582,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0") .getOrCreate() ``` @@ -653,7 +653,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -664,7 +664,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.2.3 +pip install spark-nlp==5.3.0 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -692,7 +692,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.2.3 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.3.0 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -709,7 +709,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -736,7 +736,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.0 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -759,7 +759,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.2.3 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.3.0 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -778,9 +778,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.2.3` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.3.0` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -831,7 +831,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0" } }] ``` @@ -840,7 +840,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.2.3" \ +--name "Spark NLP 5.3.0" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -904,7 +904,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -947,7 +947,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0") .getOrCreate() ``` @@ -961,7 +961,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` **pyspark:** @@ -974,7 +974,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0 ``` **Databricks:** @@ -1246,7 +1246,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.2.3.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.3.0.jar") .getOrCreate() ``` @@ -1255,7 +1255,7 @@ spark = SparkSession.builder version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.2.3.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.3.0.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html index 3c112d55a7eaf6..93316984ba5c7e 100755 --- a/docs/_layouts/landing.html +++ b/docs/_layouts/landing.html @@ -201,7 +201,7 @@

{{ _section.title }}

{% highlight bash %} # Using PyPI - $ pip install spark-nlp==5.2.3 + $ pip install spark-nlp==5.3.0 # Using Anaconda/Conda $ conda install -c johnsnowlabs spark-nlp @@ -314,8 +314,8 @@

NLP Features

  • Table Question Answering (TAPAS)
  • Unsupervised keywords extraction
  • Language Detection & Identification (up to 375 languages)
  • -
  • Multi-class Text Classification (DL model)
  • -
  • Multi-label Text Classification (DL model)
  • +
  • Multi-class / Multi-label Text Classification (DL model)
  • +
  • Text Classification (DL model)
  • Multi-class Sentiment Analysis (DL model)
  • BERT for Token & Sequence Classification
  • DistilBERT for Token & Sequence Classification
  • @@ -331,8 +331,10 @@

    NLP Features

  • Facebook BART NLG, Translation, and Comprehension
  • Zero-Shot NER & Text Classification (ZSL)
  • Neural Machine Translation (MarianMT)
  • +
  • Many-to-Many multilingual translation (Facebook M2M100)
  • Text-To-Text Transfer Transformer (Google T5)
  • Generative Pre-trained Transformer 2 (OpenAI GPT-2)
  • +
  • Chat and Conversational LLMs (Facebook Llama-22)
  • Vision Transformer (Google ViT) Image Classification
  • Microsoft Swin Transformer Image Classification
  • Facebook ConvNext Image Classification
  • diff --git a/docs/api/com/index.html b/docs/api/com/index.html index 3329a4d165063d..4c1d3a2ceb34c5 100644 --- a/docs/api/com/index.html +++ b/docs/api/com/index.html @@ -3,9 +3,9 @@ - Spark NLP 5.2.3 ScalaDoc - com - - + Spark NLP 5.3.0 ScalaDoc - com + + @@ -28,7 +28,7 @@