Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Spark NLP version #14478

Merged
merged 19 commits into from
Dec 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
33 changes: 33 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,36 @@
========
5.5.1
========
----------------
New Features & Enhancements
----------------
* `BertForMultipleChoice` Transformer Added. Enhanced BERT’s capabilities to handle multiple-choice tasks such as standardized test questions and survey or quiz automation.
* Integrated New Tasks and Documentation:
* Added support and documentation for the following tasks:
* Automatic Speech Recognition
* Dependency Parsing
* Image Captioning
* Image Classification
* Landing Page
* Question Answering
* Summarization
* Table Question Answering
* Text Classification
* Text Generation
* Text Preprocessing
* Token Classification
* Translation
* Zero-Shot Classification
* Zero-Shot Image Classification
* `PromptAssembler` Annotator Introduced. Introduced a new annotator that constructs prompts for LLMs using a chat template and a sequence of messages. Accepts an array of tuples with roles (“system”, “user”, “assistant”) and message texts. Utilizes llama.cpp as a backend for template parsing, supporting basic template applications.

----------------
Bug Fixes
----------------
* Resolved Pretrained Model Loading Issue on DBFS Systems.
* Fixed a bug where pretrained models were not found when running AutoGGUF model pipelines on Databricks due to incorrect path handling of gguf files.


========
5.5.0
========
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.5.0 pyspark==3.3.1
$ pip install spark-nlp==5.5.1 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -129,7 +129,7 @@ For a quick example of using pipelines and models take a look at our official [d

### Apache Spark Support

Spark NLP *5.5.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *5.5.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -157,7 +157,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http

### Databricks Support

Spark NLP 5.5.0 has been tested and is compatible with the following runtimes:
Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:

| **CPU** | **GPU** |
|--------------------|--------------------|
Expand All @@ -174,7 +174,7 @@ We are compatible with older runtimes. For a full list check databricks support

### EMR Support

Spark NLP 5.5.0 has been tested and is compatible with the following EMR releases:
Spark NLP 5.5.1 has been tested and is compatible with the following EMR releases:

| **EMR Release** |
|--------------------|
Expand Down Expand Up @@ -205,7 +205,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
from our official documentation.

If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
projects [Spark NLP SBT S5.5.0r](https://github.com/maziyarpanahi/spark-nlp-starter)
projects [Spark NLP SBT S5.5.1r](https://github.com/maziyarpanahi/spark-nlp-starter)

### Python

Expand Down Expand Up @@ -250,7 +250,7 @@ In Spark NLP we can define S3 locations to:

Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.

## Document5.5.0
## Document5.5.1

### Examples

Expand Down Expand Up @@ -283,7 +283,7 @@ the Spark NLP library:
keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
}
}5.5.0
}5.5.1
```

## Community support
Expand Down
9 changes: 5 additions & 4 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.5.0"
version := "5.5.1"

(ThisBuild / scalaVersion) := scalaVer

Expand Down Expand Up @@ -156,7 +156,8 @@ lazy val utilDependencies = Seq(
exclude ("com.fasterxml.jackson.dataformat", "jackson-dataformat-cbor"),
greex,
azureIdentity,
azureStorage)
azureStorage,
jsoup)

lazy val typedDependencyParserDependencies = Seq(junit)

Expand Down Expand Up @@ -185,8 +186,8 @@ val llamaCppDependencies =
Seq(llamaCppGPU)
else if (is_silicon.equals("true"))
Seq(llamaCppSilicon)
// else if (is_aarch64.equals("true"))
// Seq(openVinoCPU)
else if (is_aarch64.equals("true"))
Seq(llamaCppAarch64)
else
Seq(llamaCppCPU)

Expand Down
4 changes: 2 additions & 2 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
{% set name = "spark-nlp" %}
{% set version = "5.5.0" %}
{% set version = "5.5.1" %}

package:
name: {{ name|lower }}
version: {{ version }}

source:
url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark-nlp-{{ version }}.tar.gz
sha256: edc71585f462f548770bd13899686f10d88fa4a4a6e201bc1bf9c7711e398dc0
sha256: e8ddaf939a1b0acbe0d7b6d6a67f7fa0c5a73339d9e4563e3c1aba1cf0039409

build:
noarch: python
Expand Down
2 changes: 2 additions & 0 deletions docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ sparknlp:
url: /docs/en/pipelines
- title: General Concepts
url: /docs/en/concepts
- title: Tasks
url: /docs/en/tasks/landing_page
- title: Annotators
url: /docs/en/annotators
- title: Transformers
Expand Down
2 changes: 1 addition & 1 deletion docs/_layouts/landing.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
<div class="highlight-box">
{% highlight bash %}
# Using PyPI
$ pip install spark-nlp==5.5.0
$ pip install spark-nlp==5.5.1

# Using Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
Expand Down
101 changes: 101 additions & 0 deletions docs/_posts/Cabir40/2024-10-21-bge_medembed_base_v0_1_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
layout: model
title: English bge_medembed_base_v0_1 BGEEmbeddings from abhinand
author: John Snow Labs
name: bge_medembed_base_v0_1
date: 2024-10-21
tags: [embedding, en, open_source, bge, medical, onnx]
task: Embeddings
language: en
edition: Spark NLP 5.5.0
spark_version: 3.0
supported: true
engine: onnx
annotator: BGEEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained BGEEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.
`bge_medembed_base_v0_1` is a English model originally trained by abhinand

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_medembed_base_v0_1_en_5.5.0_3.0_1729515433167.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_medembed_base_v0_1_en_5.5.0_3.0_1729515433167.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

embeddings = BGEEmbeddings.pretrained("bge_medembed_base_v0_1","en")\
.setInputCols(["document"])\
.setOutputCol("embeddings")

pipeline = Pipeline(
stages = [
document_assembler,
embeddings
])

data = spark.createDataFrame([["I love spark-nlp"]]).toDF("text")

result = pipeline.fit(data).transform(data)

```
```scala

val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val embeddings = BGEEmbeddings.pretrained("bge_medembed_base_v0_1","en")
.setInputCols(Array("document"))
.setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(document_assembler, embeddings))

val data = Seq("I love spark-nlp").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

```
</div>

## Results

```bash

+----------------------------------------------------------------------------------------------------+
| bge_embedding|
+----------------------------------------------------------------------------------------------------+
|[{sentence_embeddings, 0, 15, I love spark-nlp, {sentence -> 0}, [-0.018065551, -0.032784615, 0.0...|
+----------------------------------------------------------------------------------------------------+

```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|bge_medembed_base_v0_1|
|Compatibility:|Spark NLP 5.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[bge]|
|Language:|en|
|Size:|389.7 MB|
101 changes: 101 additions & 0 deletions docs/_posts/Cabir40/2024-10-21-bge_medembed_large_v0_1_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
layout: model
title: English bge_medembed_large_v0_1 BGEEmbeddings from abhinand
author: John Snow Labs
name: bge_medembed_large_v0_1
date: 2024-10-21
tags: [embedding, en, open_source, bge, medical, onnx]
task: Embeddings
language: en
edition: Spark NLP 5.5.0
spark_version: 3.0
supported: true
engine: onnx
annotator: BGEEmbeddings
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained BGEEmbeddings model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.
`bge_medembed_large_v0_1` is a English model originally trained by abhinand

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/bge_medembed_large_v0_1_en_5.5.0_3.0_1729515260623.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/bge_medembed_large_v0_1_en_5.5.0_3.0_1729515260623.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

embeddings = BGEEmbeddings.pretrained("bge_medembed_large_v0_1","en")\
.setInputCols(["document"])\
.setOutputCol("embeddings")

pipeline = Pipeline(
stages = [
document_assembler,
embeddings
])

data = spark.createDataFrame([["I love spark-nlp"]]).toDF("text")

result = pipeline.fit(data).transform(data)

```
```scala

val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val embeddings = BGEEmbeddings.pretrained("bge_medembed_large_v0_1","en")
.setInputCols(Array("document"))
.setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(document_assembler, embeddings))

val data = Seq("I love spark-nlp").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

```
</div>

## Results

```bash

+----------------------------------------------------------------------------------------------------+
| bge_embedding|
+----------------------------------------------------------------------------------------------------+
|[{sentence_embeddings, 0, 15, I love spark-nlp, {sentence -> 0}, [-0.018065551, -0.032784615, 0.0...|
+----------------------------------------------------------------------------------------------------+

```

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|bge_medembed_large_v0_1|
|Compatibility:|Spark NLP 5.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[document]|
|Output Labels:|[bge]|
|Language:|en|
|Size:|1.2 GB|
Loading