Release Spark NLP 5.3.0: Introducing Llama-2 for CasualLM, M2M100 for Multilingual Translation, MPNet & DeBERTa Enhancements, New Document Similarity Features, Expanded ONNX & In-Memory Support, Updated Runtimes, Essential Bug Fixes, and More! · JohnSnowLabs/spark-nlp

🎉 Celebrating 91 Million Downloads on PyPI - A Spark NLP Milestone! 🚀

We're thrilled to announce the release of Spark NLP 5.3.0, a monumental update that brings cutting-edge advancements and enhancements to the forefront of Natural Language Processing (NLP). This release underscores our commitment to providing the NLP community with state-of-the-art tools and models, furthering our mission to democratize NLP technologies.

This release also addresses critical bug fixes, enhancing the stability and reliability of Spark NLP. Fixes include Spark NLP configuration adjustments, score calculation corrections, input validation, notebook improvements, and serialization issues.

We invite the community to explore these new features and enhancements, and we look forward to seeing the innovative applications that Spark NLP 5.3.0 will enable. 🌟

🔥 New Features & Enhancements

Llama-2 Integration: We're introducing Llama-2 along with models fine-tuned on this architecture, marking our first foray into CasualLM annotators in ONNX. This groundbreaking addition supports quantization in INT4 and INT8 for CPUs, optimizing performance and efficiency.

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. - https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

We have made LLAMA2Transformer annotator compatible with ONNX exports and quantizations:

16 bit (CUDA only)
8 bit (CPU or CUDA)
4 bit (CPU or CIDA)

As always, we made this feature super easy and scalable:

doc_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

llama2 = LLAMA2Transformer \
    .pretrained() \
    .setMaxOutputLength(50) \
    .setDoSample(False) \
    .setInputCols(["documents"]) \
    .setOutputCol("generation")

We will continue improving this annotator and import more models in the future

Multilingual Translation with M2M100: The M2M100 model sets a new benchmark for multilingual translation, supporting direct translation across 9,900 language pairs from 100 languages. This feature represents a significant leap in breaking down language barriers in global communication.

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model. - https://arxiv.org/pdf/2010.11125.pdf

m2m100 = M2M100Transformer.pretrained() \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(50) \
    .setOutputCol("generation") \
    .setSrcLang("zh") \
    .setTgtLang("en")

Document Similarity and Retrieval: We've implemented a retrieval feature in our DocumentSimilarity annotator, offering an efficient and scalable solution for ranking documents based on similarity, ideal for retrieval-augmented generation (RAG) applications.

query = "Florence in Italy, is among the most beautiful cities in Europe."

doc_similarity_ranker = DocumentSimilarityRankerApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("doc_similarity_rankings")\
    .setSimilarityMethod("brp")\ # brp for BucketedRandomProjectionLSH and mh for MinHashLSH
    .setNumberOfNeighbours(3)\
    .setVisibleDistances(True)\
    .setIdentityRanking(True)\
    .asRetriever(query)

NEW: Introducing MPNetForSequenceClassification annotator for sequence classification tasks. This annotator is based on the MPNet architecture, enhances our capabilities in sequence classification tasks, offering more precise and context-aware processing.
NEW: Introducing MPNetForQuestionAnswering annotator for question answering tasks. This annotator is based on the MPNet architecture, enhances our capabilities in question answering tasks, offering more precise and context-aware processing.
NEW: Introducing a new DeBertaForZeroShotClassification annotator, leveraging the DeBERTa architecture, introduces sophisticated zero-shot classification capabilities, enabling the classification of text into predefined classes without direct example training.
NEW: Add support for in-memory use of WordEmbeddingsModel annotator in serverless clusters. We initially introduced the in-memory feature for this annotator for users inside Kubernetes clusters without any HDFS. However, today it runs without any issue locally, on Google Colab, Kaggle, Databricks, AWS EMR, GCP, and AWS Glue.
Add ONNX support for BertForZeroShotClassification annotator
Introduce new Whisper Large and Distil models.
Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, and 14.3 GPU.
Support new EMR versions 6.15.0 and 7.0.0.
Add a notebook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it into Spark NLP.
Add a notebook to import BERT for Zero-Shot classification from Hugging Face.
Add a notebook to import DeBERTa for Zero-Shot classification from Hugging Face.
Update EntityRuler documentation.
Improve SBT project and resolve warnings (almost!).
Update ONNX Runtime to 1.17.0 to enjoy the following features in upcoming releases:
- Support for CUDA 12.1
- Enhanced security for Linux binaries to comply with BinSkim, added Windows ARM64X source build support, removed Windows ARM32 binaries, and introduced AMD GPU packages.
- Optimized graph inlining, added custom logger support at the session level, and introduced new logging and tracing features for session and execution provider options.
- Added 4bit quantization support for NVIDIA GPU and ARM64.

🐛 Bug Fixes

Fix Spark NLP Configuration to set cluster_tmp_dir on Databricks' DBFS via spark.jsl.settings.storage.cluster_tmp_dir #14129
Fix score calculation in RoBertaForQuestionAnswering annotator #14147
Fix optional input col validations #14153
Fix notebooks for importing DeBERTa classifiers #14154
Fix GPT2 deserialization over the cluster (Databricks) #14177

ℹ️ Known Issues

Llama-2, M2M100, and Whisper Large do not work in a cluster. We are working on how best share these large models over a cluster and will provide a fix in the future releases
Previously some ONNX models did not work on CUDA 12.x as we have reported this problem - We have not tested this yet, but it should be resolved in onnxruntime 1.17.0 in Spark NLP 5.3.0

💾 Models

The complete list of all 37000+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Examples for 100+ examples

📖 Documentation

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
JohnSnowLabs official Medium
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.3.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.3.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.3.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.3.0.jar

Pull Requests:

What's Changed

Update 2023-02-08-zero_shot_ner_roberta_en.md by @maziyarpanahi in #14161
[Issue#14129] Fix for spark.jsl.settings.storage.cluster_tmp_dir configuration by @jiamaozheng in #14132
SPARKNLP-942: MPNet Classifiers by @DevinTDHa in #14147
adding import notebook + changing default model + adding onnx support by @ahmedlone127 in #14158
Sparknlp 876: Introducing LLAMA2 by @prabod in #14148
Doc sim rank as retriever by @wolliq in #14149
812 implement de berta for zero shot classification annotator by @ahmedlone127 in #14151
SPARKNLP-886: Add Fine tuned sentence bert notebook by @DevinTDHa in #14152
[SPARKNLP-986] Fixing optional input col validations by @danilojsl in #14153
[SPARKNLP-984] Fixing Deberta notebooks URIs by @danilojsl in #14154
SparkNLP 933: Introducing M2M100 : multilingual translation model by @prabod in #14155
SPARKNLP-985: Make Whisper compatible with onnx_data files by @DevinTDHa in #14165
Fixed a bug with models that has 'onnx_data' file not working in dbfs/hdfs by @prabod in #14169
[SPARKNLP-940] Adding changes to correctly copy cluster index storage… by @danilojsl in #14167
[SPARKNLP-988] Updating EntityRuler documentation by @danilojsl in #14168
SPARKNLP-1000: Fix No Operation named [init_all_tables] for GPT2 by @DevinTDHa in #14177
fixes python documentation by @ahmedlone127 in #14172
fixed all sbt warnings by @ahmedlone127 in #14156
Replace hard exception with soft logs by @maziyarpanahi in #14179
Models hub by @maziyarpanahi in #14183
release/530-release-candidate by @maziyarpanahi in #14164

New Contributors

@jiamaozheng made their first contribution in #14132

Full Changelog: 5.2.3...5.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.3.0: Introducing Llama-2 for CasualLM, M2M100 for Multilingual Translation, MPNet & DeBERTa Enhancements, New Document Similarity Features, Expanded ONNX & In-Memory Support, Updated Runtimes, Essential Bug Fixes, and More!