Skip to content

Commit

Permalink
Merge pull request #14389 from JohnSnowLabs/release/550-release-candi…
Browse files Browse the repository at this point in the history
…date

release/550-release-candidate
  • Loading branch information
maziyarpanahi authored Sep 25, 2024
2 parents 1336da0 + cc38757 commit d3d287b
Show file tree
Hide file tree
Showing 1,799 changed files with 237,922 additions and 14,950 deletions.
24 changes: 23 additions & 1 deletion CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,26 @@
========
5.5.0
========
----------------
New Features & Enhancements
----------------
* Introduced QWEN2Transformer (#14188)
* Introduced MiniCPM (#14205)
* Introduced NLLB (#14209)
* Implemented Nomic embeddings (#14217)
* Introduced CamemBertForZeroShotClassification annotator (#14354)
* Implemented Mxbai Embeddings (#14355)
* Introduced AlbertForZeroShotClassification (#14361)
* Introduced Phi-3 (#14373)
* Implemented Starcoder2 for causal language modeling (#14358)
* Integrated llama.cpp (#14364)
* Implemented SnowFlake (#14353)
* Introduced ONNX support to vision annotators (#14356)
* Introduced ONNX and OpenVINO support to Missing Annotators (#14359)
* Added OpenVINO install instructions (#14382)
* Exported notebooks for release candidate (#14393)


========
5.4.2
========
Expand All @@ -9,7 +32,6 @@ New Features & Enhancements
* Added aggressiveMatching parameter to DocumentSimilarityRanker annotator



========
5.4.1
========
Expand Down
53 changes: 38 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,31 @@
<img src="https://static.pepy.tech/personalized-badge/spark-nlp?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads" /></a>
</p>

Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed
environment.
Spark NLP comes with **36000+** pretrained **pipelines** and **models** in more than **200+** languages.
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides **simple**, **performant** & **accurate** NLP annotations for machine learning pipelines that **scale** easily in a distributed environment.

Spark NLP comes with **83000+** pretrained **pipelines** and **models** in more than **200+** languages.
It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of-Speech Tagging**, Word and Sentence **Embeddings**, **Named Entity Recognition**, **Dependency Parsing**, **Spell Checking**, **Text Classification**, **Sentiment Analysis**, **Token Classification**, **Machine Translation** (+180 languages), **Summarization**, **Question Answering**, **Table Question Answering**, **Text Generation**, **Image Classification**, **Image to Text (captioning)**, **Automatic Speech Recognition**, **Zero-Shot Learning**, and many more [NLP tasks](#features).

**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Llama-2**, **M2M100**, **BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.
**Spark NLP** is the only open-source NLP library in **production** that offers state-of-the-art transformers such as **BERT**, **CamemBERT**, **ALBERT**, **ELECTRA**, **XLNet**, **DistilBERT**, **RoBERTa**, **DeBERTa**, **XLM-RoBERTa**, **Longformer**, **ELMO**, **Universal Sentence Encoder**, **Llama-2**, **M2M100**, **BART**, **Instructor**, **E5**, **Google T5**, **MarianMT**, **OpenAI GPT2**, **Vision Transformers (ViT)**, **OpenAI Whisper**, **Llama**, **Mistral**, **Phi**, **Qwen2**, and many more not only to **Python** and **R**, but also to **JVM** ecosystem (**Java**, **Scala**, and **Kotlin**) at **scale** by extending **Apache Spark** natively.

## Model Importing Support

Spark NLP provides easy support for importing models from various popular frameworks:

- **TensorFlow**
- **ONNX**
- **OpenVINO**
- **Llama.cpp (GGUF)**

This wide range of support allows you to seamlessly integrate models from different sources into your Spark NLP workflows, enhancing flexibility and compatibility with existing machine learning ecosystems.

## Project's website

Take a look at our official Spark NLP page: [https://sparknlp.org/](https://sparknlp.org/) for user
documentation and examples

## Features

- [Text Preprocessing](https://sparknlp.org/docs/en/features#text-preproccesing)
- [Parsing and Analysis](https://sparknlp.org/docs/en/features#parsing-and-analysis)
- [Sentiment and Classification](https://sparknlp.org/docs/en/features#sentiment-and-classification)
Expand All @@ -51,7 +63,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.4.0 pyspark==3.3.1
$ pip install spark-nlp==5.5.0 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -108,6 +120,7 @@ community and we had to build most of the dependencies by ourselves to make them
architectures, however, they may not work in some environments.

## Pipelines and Models

For a quick example of using pipelines and models take a look at our official [documentation](https://sparknlp.org/docs/en/install#pipelines-and-models)

#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
Expand All @@ -116,10 +129,11 @@ For a quick example of using pipelines and models take a look at our official [d

### Apache Spark Support

Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *5.5.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
| 5.5.x | YES | YES | YES | YES | YES | YES | NO | NO |
| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
| 5.3.x | YES | YES | YES | YES | YES | YES | NO | NO |
| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO |
Expand All @@ -132,6 +146,8 @@ Find out more about `Spark NLP` versions from our [release notes](https://github

| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
|-----------|------------|------------|------------|------------|------------|------------|------------|
| 5.5.x | NO | YES | YES | YES | YES | NO | YES |
| 5.4.x | NO | YES | YES | YES | YES | NO | YES |
| 5.3.x | NO | YES | YES | YES | YES | NO | YES |
| 5.2.x | NO | YES | YES | YES | YES | NO | YES |
| 5.1.x | NO | YES | YES | YES | YES | NO | YES |
Expand All @@ -141,38 +157,45 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http

### Databricks Support

Spark NLP 5.4.0 has been tested and is compatible with the following runtimes:
Spark NLP 5.5.0 has been tested and is compatible with the following runtimes:

| **CPU** | **GPU** |
|--------------------|--------------------|
| 14.0 / 14.0 ML | 14.0 ML & GPU |
| 14.1 / 14.1 ML | 14.1 ML & GPU |
| 14.2 / 14.2 ML | 14.2 ML & GPU |
| 14.3 / 14.3 ML | 14.3 ML & GPU |
| 15.0 / 15.0 ML | 15.0 ML & GPU |
| 15.1 / 15.0 ML | 15.1 ML & GPU |
| 15.2 / 15.0 ML | 15.2 ML & GPU |
| 15.3 / 15.0 ML | 15.3 ML & GPU |
| 15.4 / 15.0 ML | 15.4 ML & GPU |

We are compatible with older runtimes. For a full list check databricks support in our official [documentation](https://sparknlp.org/docs/en/install#databricks-support)

### EMR Support

Spark NLP 5.4.0 has been tested and is compatible with the following EMR releases:
Spark NLP 5.5.0 has been tested and is compatible with the following EMR releases:

| **EMR Release** |
|--------------------|
| emr-6.13.0 |
| emr-6.14.0 |
| emr-6.15.0 |
| emr-7.0.0 |
| emr-7.1.0 |
| emr-7.2.0 |

We are compatible with older EMR releases. For a full list check EMR support in our official [documentation](https://sparknlp.org/docs/en/install#emr-support)

Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)
Full list 5.4.2mazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html)
Full list of [Amazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html)

NOTE: The EMR 6.1.0 and 6.1.1 are not supported.

## Installation

### Command line (requires internet connection)

To install spark-nlp packages through command line follow [these instructions](https://sparknlp.org/docs/en/install#command-line) from our official documentation

### Scala
Expand All @@ -182,18 +205,19 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
from our official documentation.

If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
projects [Spark NLP SBT S5.4.2r](https://github.com/maziyarpanahi/spark-nlp-starter)
projects [Spark NLP SBT S5.5.0r](https://github.com/maziyarpanahi/spark-nlp-starter)

### Python

Spark NLP supports Python 3.7.x and above depending on your major PySpark version.
Check all available installations for Python in our official [documentation](https://sparknlp.org/docs/en/install#python)


### Compiled JARs

To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation

## Platform-Specific Instructions

For detailed instructions on how to use Spark NLP on supported platforms, please refer to our official documentation:

| Platform | Supported Language(s) |
Expand All @@ -206,7 +230,6 @@ For detailed instructions on how to use Spark NLP on supported platforms, please
| [EMR Cluster](https://sparknlp.org/docs/en/install#emr-cluster) | Scala, Python |
| [GCP Dataproc Cluster](https://sparknlp.org/docs/en/install#gcp-dataproc) | Scala, Python |


### Offline

Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
Expand All @@ -227,7 +250,7 @@ In Spark NLP we can define S3 locations to:

Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.

## Document5.4.2
## Document5.5.0

### Examples

Expand Down Expand Up @@ -260,7 +283,7 @@ the Spark NLP library:
keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
}
}5.4.2
}5.5.0
```

## Community support
Expand Down
13 changes: 12 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.4.2"
version := "5.5.0"

(ThisBuild / scalaVersion) := scalaVer

Expand Down Expand Up @@ -180,6 +180,16 @@ val onnxDependencies: Seq[sbt.ModuleID] =
else
Seq(onnxCPU)

val llamaCppDependencies =
if (is_gpu.equals("true"))
Seq(llamaCppGPU)
else if (is_silicon.equals("true"))
Seq(llamaCppSilicon)
// else if (is_aarch64.equals("true"))
// Seq(openVinoCPU)
else
Seq(llamaCppCPU)

val openVinoDependencies: Seq[sbt.ModuleID] =
if (is_gpu.equals("true"))
Seq(openVinoGPU)
Expand All @@ -202,6 +212,7 @@ lazy val root = (project in file("."))
utilDependencies ++
tensorflowDependencies ++
onnxDependencies ++
llamaCppDependencies ++
openVinoDependencies ++
typedDependencyParserDependencies,
// TODO potentially improve this?
Expand Down
2 changes: 1 addition & 1 deletion docs/_layouts/landing.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
<div class="highlight-box">
{% highlight bash %}
# Using PyPI
$ pip install spark-nlp==5.4.2
$ pip install spark-nlp==5.5.0

# Using Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
Expand Down
8 changes: 4 additions & 4 deletions docs/api/com/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<title>Spark NLP 5.4.2 ScalaDoc - com</title>
<meta name="description" content="Spark NLP 5.4.2 ScalaDoc - com" />
<meta name="keywords" content="Spark NLP 5.4.2 ScalaDoc com" />
<title>Spark NLP 5.5.0 ScalaDoc - com</title>
<meta name="description" content="Spark NLP 5.5.0 ScalaDoc - com" />
<meta name="keywords" content="Spark NLP 5.5.0 ScalaDoc com" />
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />


Expand All @@ -28,7 +28,7 @@
</head>
<body>
<div id="search">
<span id="doc-title">Spark NLP 5.4.2 ScalaDoc<span id="doc-version"></span></span>
<span id="doc-title">Spark NLP 5.5.0 ScalaDoc<span id="doc-version"></span></span>
<span class="close-results"><span class="left">&lt;</span> Back</span>
<div id="textfilter">
<span class="input">
Expand Down
8 changes: 4 additions & 4 deletions docs/api/com/johnsnowlabs/client/CloudClient.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<title>Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudClient</title>
<meta name="description" content="Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudClient" />
<meta name="keywords" content="Spark NLP 5.4.2 ScalaDoc com.johnsnowlabs.client.CloudClient" />
<title>Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudClient</title>
<meta name="description" content="Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudClient" />
<meta name="keywords" content="Spark NLP 5.5.0 ScalaDoc com.johnsnowlabs.client.CloudClient" />
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />


Expand All @@ -28,7 +28,7 @@
</head>
<body>
<div id="search">
<span id="doc-title">Spark NLP 5.4.2 ScalaDoc<span id="doc-version"></span></span>
<span id="doc-title">Spark NLP 5.5.0 ScalaDoc<span id="doc-version"></span></span>
<span class="close-results"><span class="left">&lt;</span> Back</span>
<div id="textfilter">
<span class="input">
Expand Down
8 changes: 4 additions & 4 deletions docs/api/com/johnsnowlabs/client/CloudManager.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<title>Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudManager</title>
<meta name="description" content="Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudManager" />
<meta name="keywords" content="Spark NLP 5.4.2 ScalaDoc com.johnsnowlabs.client.CloudManager" />
<title>Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudManager</title>
<meta name="description" content="Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudManager" />
<meta name="keywords" content="Spark NLP 5.5.0 ScalaDoc com.johnsnowlabs.client.CloudManager" />
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />


Expand All @@ -28,7 +28,7 @@
</head>
<body>
<div id="search">
<span id="doc-title">Spark NLP 5.4.2 ScalaDoc<span id="doc-version"></span></span>
<span id="doc-title">Spark NLP 5.5.0 ScalaDoc<span id="doc-version"></span></span>
<span class="close-results"><span class="left">&lt;</span> Back</span>
<div id="textfilter">
<span class="input">
Expand Down
8 changes: 4 additions & 4 deletions docs/api/com/johnsnowlabs/client/CloudResources$.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<title>Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudResources</title>
<meta name="description" content="Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudResources" />
<meta name="keywords" content="Spark NLP 5.4.2 ScalaDoc com.johnsnowlabs.client.CloudResources" />
<title>Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudResources</title>
<meta name="description" content="Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudResources" />
<meta name="keywords" content="Spark NLP 5.5.0 ScalaDoc com.johnsnowlabs.client.CloudResources" />
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />


Expand All @@ -28,7 +28,7 @@
</head>
<body>
<div id="search">
<span id="doc-title">Spark NLP 5.4.2 ScalaDoc<span id="doc-version"></span></span>
<span id="doc-title">Spark NLP 5.5.0 ScalaDoc<span id="doc-version"></span></span>
<span class="close-results"><span class="left">&lt;</span> Back</span>
<div id="textfilter">
<span class="input">
Expand Down
8 changes: 4 additions & 4 deletions docs/api/com/johnsnowlabs/client/CloudStorage.html
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<title>Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudStorage</title>
<meta name="description" content="Spark NLP 5.4.2 ScalaDoc - com.johnsnowlabs.client.CloudStorage" />
<meta name="keywords" content="Spark NLP 5.4.2 ScalaDoc com.johnsnowlabs.client.CloudStorage" />
<title>Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudStorage</title>
<meta name="description" content="Spark NLP 5.5.0 ScalaDoc - com.johnsnowlabs.client.CloudStorage" />
<meta name="keywords" content="Spark NLP 5.5.0 ScalaDoc com.johnsnowlabs.client.CloudStorage" />
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />


Expand All @@ -28,7 +28,7 @@
</head>
<body>
<div id="search">
<span id="doc-title">Spark NLP 5.4.2 ScalaDoc<span id="doc-version"></span></span>
<span id="doc-title">Spark NLP 5.5.0 ScalaDoc<span id="doc-version"></span></span>
<span class="close-results"><span class="left">&lt;</span> Back</span>
<div id="textfilter">
<span class="input">
Expand Down
Loading

0 comments on commit d3d287b

Please sign in to comment.