Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release/550-release-candidate #14389

Merged
merged 27 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
625f933
SparkNLP 997 Introducing QWEN2Transformer (#14188)
prabod Sep 1, 2024
2102e2d
SparkNLP 1004 - Introducing MiniCPM (#14205)
prabod Sep 1, 2024
c68be6a
SparkNLP 1018 - Introducing NLLB (#14209)
prabod Sep 1, 2024
803edf6
SparkNLP 1005 implement nomic embeddings (#14217)
prabod Sep 1, 2024
50a6966
implementing SnowFlake (#14353)
ahmedlone127 Sep 1, 2024
707bb16
adding Mxbai (#14355)
ahmedlone127 Sep 1, 2024
276a84b
Introducing onnx support to vision annotators (#14356)
ahmedlone127 Sep 1, 2024
f47ee50
Introducing onnx and OpenVino support to Missing Annotators (#14359)
ahmedlone127 Sep 1, 2024
9d94b9a
[SPARKNLP-855] Introducing AlbertForZeroShotClassification (#14361)
danilojsl Sep 1, 2024
1caf296
SparkNLP introducing Phi-3 (#14373)
prabod Sep 1, 2024
82d37a6
OpenVINO install instructions (#14382)
DevinTDHa Sep 1, 2024
66d94a4
SPARKNLP 1034 implement starcoder2 for causal lm (#14358)
prabod Sep 2, 2024
9285df8
fix missing Optional input in the signature
maziyarpanahi Sep 2, 2024
f4fd4e7
SPARKNLP Introducing LLAMA 3 (#14379)
prabod Sep 3, 2024
8d4dc21
550 rc export notebooks (#14393)
prabod Sep 5, 2024
c2c0e48
[SPARKNLP-1027] llama.cpp integration (#14364)
DevinTDHa Sep 5, 2024
9e4b1ad
Merge branch 'master' into release/550-release-candidate
maziyarpanahi Sep 6, 2024
ba67de3
bump version
maziyarpanahi Sep 6, 2024
73fe300
Merge branch 'release/550-release-candidate' of https://github.com/Jo…
maziyarpanahi Sep 6, 2024
22e4e78
upgrade onnxruntime to 1.19.2
maziyarpanahi Sep 6, 2024
141d38c
tested + updated ipynb notebooks
ahmedlone127 Sep 9, 2024
7588484
Update install.md
maziyarpanahi Sep 10, 2024
83a6e7e
Adding openvino support to missing annotators (#14390)
ahmedlone127 Sep 22, 2024
ba18698
[SPARKNLP-1027] Change Default AutoGGUF pretrained model (#14411)
DevinTDHa Sep 24, 2024
7f05669
Bump version to 5.5.0 [run doc]
maziyarpanahi Sep 24, 2024
1a112cd
Update Scala and Python APIs
actions-user Sep 24, 2024
cc38757
[SPARKNLP-1027] Fix issue with pretrained model (#14413)
DevinTDHa Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.4.0 pyspark==3.3.1
$ pip install spark-nlp==5.5.0-rc1 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -116,7 +116,7 @@ For a quick example of using pipelines and models take a look at our official [d

### Apache Spark Support

Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *5.5.0-rc1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand All @@ -141,7 +141,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http

### Databricks Support

Spark NLP 5.4.0 has been tested and is compatible with the following runtimes:
Spark NLP 5.5.0-rc1 has been tested and is compatible with the following runtimes:

| **CPU** | **GPU** |
|--------------------|--------------------|
Expand All @@ -154,7 +154,7 @@ We are compatible with older runtimes. For a full list check databricks support

### EMR Support

Spark NLP 5.4.0 has been tested and is compatible with the following EMR releases:
Spark NLP 5.5.0-rc1 has been tested and is compatible with the following EMR releases:

| **EMR Release** |
|--------------------|
Expand All @@ -166,7 +166,7 @@ Spark NLP 5.4.0 has been tested and is compatible with the following EMR release
We are compatible with older EMR releases. For a full list check EMR support in our official [documentation](https://sparknlp.org/docs/en/install#emr-support)

Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html)
Full list 5.4.2mazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html)
Full list 5.5.0-rc1mazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html)

NOTE: The EMR 6.1.0 and 6.1.1 are not supported.

Expand All @@ -182,7 +182,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
from our official documentation.

If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
projects [Spark NLP SBT S5.4.2r](https://github.com/maziyarpanahi/spark-nlp-starter)
projects [Spark NLP SBT S5.5.0-rc1r](https://github.com/maziyarpanahi/spark-nlp-starter)

### Python

Expand Down Expand Up @@ -227,7 +227,7 @@ In Spark NLP we can define S3 locations to:

Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.

## Document5.4.2
## Document5.5.0-rc1

### Examples

Expand Down Expand Up @@ -260,7 +260,7 @@ the Spark NLP library:
keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
}
}5.4.2
}5.5.0-rc1
```

## Community support
Expand Down
13 changes: 12 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.4.2"
version := "5.5.0-rc1"

(ThisBuild / scalaVersion) := scalaVer

Expand Down Expand Up @@ -180,6 +180,16 @@ val onnxDependencies: Seq[sbt.ModuleID] =
else
Seq(onnxCPU)

val llamaCppDependencies =
if (is_gpu.equals("true"))
Seq(llamaCppGPU)
else if (is_silicon.equals("true"))
Seq(llamaCppSilicon)
// else if (is_aarch64.equals("true"))
// Seq(openVinoCPU)
else
Seq(llamaCppCPU)

val openVinoDependencies: Seq[sbt.ModuleID] =
if (is_gpu.equals("true"))
Seq(openVinoGPU)
Expand All @@ -202,6 +212,7 @@ lazy val root = (project in file("."))
utilDependencies ++
tensorflowDependencies ++
onnxDependencies ++
llamaCppDependencies ++
openVinoDependencies ++
typedDependencyParserDependencies,
// TODO potentially improve this?
Expand Down
2 changes: 1 addition & 1 deletion docs/_layouts/landing.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
<div class="highlight-box">
{% highlight bash %}
# Using PyPI
$ pip install spark-nlp==5.4.2
$ pip install spark-nlp==5.5.0-rc1

# Using Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
Expand Down
6 changes: 3 additions & 3 deletions docs/en/advanced_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.0-rc1")
.getOrCreate()
```

Expand All @@ -66,7 +66,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.0-rc1
```

**pyspark:**
Expand All @@ -79,7 +79,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.0-rc1
```

**Databricks:**
Expand Down
135 changes: 135 additions & 0 deletions docs/en/annotator_entries/AutoGGUF.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
{%- capture title -%}
AutoGGUFModel
{%- endcapture -%}

{%- capture description -%}
Annotator that uses the llama.cpp library to generate text completions with large language
models.

For settable parameters, and their explanations, see [HasLlamaCppProperties](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/HasLlamaCppProperties.scala) and refer to
the llama.cpp documentation of
[server.cpp](https://github.com/ggerganov/llama.cpp/tree/7d5e8777ae1d21af99d4f95be10db4870720da91/examples/server)
for more information.

If the parameters are not set, the annotator will default to use the parameters provided by
the model.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val autoGGUFModel = AutoGGUFModel.pretrained()
.setInputCols("document")
.setOutputCol("completions")
```

The default model is `"gguf-phi3-mini-4k-instruct-q4"`, if no name is provided.

For available pretrained models please see the [Models Hub](https://sparknlp.org/models).

For extended examples of usage, see the
[AutoGGUFModelTest](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFModelTest.scala)
and the
[example notebook](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFModel.ipynb).

**Note**: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
the number of GPU layers with the `setNGpuLayers` method.

When using larger models, we recommend adjusting GPU usage with `setNCtx` and `setNGpuLayers`
according to your hardware to avoid out-of-memory errors.
{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT
{%- endcapture -%}

{%- capture output_anno -%}
DOCUMENT
{%- endcapture -%}

{%- capture python_example -%}
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
... .setInputCol("text") \
... .setOutputCol("document")
>>> autoGGUFModel = AutoGGUFModel.pretrained() \
... .setInputCols(["document"]) \
... .setOutputCol("completions") \
... .setBatchSize(4) \
... .setNPredict(20) \
... .setNGpuLayers(99) \
... .setTemperature(0.4) \
... .setTopK(40) \
... .setTopP(0.9) \
... .setPenalizeNl(True)
>>> pipeline = Pipeline().setStages([document, autoGGUFModel])
>>> data = spark.createDataFrame([["Hello, I am a"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show(truncate = False)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78, new user. I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture scala_example -%}
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val autoGGUFModel = AutoGGUFModel
.pretrained()
.setInputCols("document")
.setOutputCol("completions")
.setBatchSize(4)
.setNPredict(20)
.setNGpuLayers(99)
.setTemperature(0.4f)
.setTopK(40)
.setTopP(0.9f)
.setPenalizeNl(true)

val pipeline = new Pipeline().setStages(Array(document, autoGGUFModel))

val data = Seq("Hello, I am a").toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("completions").show(truncate = false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|completions |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 78, new user. I am currently working on a project and I need to create a list of , {prompt -> Hello, I am a}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------+

{%- endcapture -%}

{%- capture api_link -%}
[AutoGGUFModel](/api/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFModel)
{%- endcapture -%}

{%- capture python_api_link -%}
[AutoGGUFModel](/api/python/reference/autosummary/sparknlp/annotator/seq2seq/auto_gguf_model/index.html)
{%- endcapture -%}

{%- capture source_link -%}
[AutoGGUFModel](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFModel.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
1 change: 1 addition & 0 deletions docs/en/annotators.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ There are two types of Annotators:
{:.table-model-big}
|Annotator|Description|Version |
|---|---|---|
{% include templates/anno_table_entry.md path="" name="AutoGGUFModel" summary="Annotator that uses the llama.cpp library to generate text completions with large language models."%}
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}
{% include templates/anno_table_entry.md path="" name="Chunk2Doc" summary="Converts a `CHUNK` type column back into `DOCUMENT`. Useful when trying to re-tokenize or do further analysis on a `CHUNK` result."%}
Expand Down
2 changes: 1 addition & 1 deletion docs/en/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.4.2 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.5.0-rc1 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand Down
4 changes: 2 additions & 2 deletions docs/en/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ $ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
$ pip install spark-nlp==5.4.2 pyspark==3.3.1
$ pip install spark-nlp==5.5.0-rc1 pyspark==3.3.1
```

</div><div class="h3-box" markdown="1">
Expand All @@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -p is for pyspark
# -s is for spark-nlp
# by default they are set to the latest
!bash colab.sh -p 3.2.3 -s 5.4.2
!bash colab.sh -p 3.2.3 -s 5.5.0-rc1
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/hardware_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a
| DeBERTa Large | +477%(5.8x) |
| Longformer Base | +52%(1.5x) |

Spark NLP 5.4.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
Spark NLP 5.5.0-rc1 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand Down
Loading
Loading