Skip to content

Spark NLP 5.3.0: Introducing Llama-2 for CasualLM, M2M100 for Multilingual Translation, MPNet & DeBERTa Enhancements, New Document Similarity Features, Expanded ONNX & In-Memory Support, Updated Runtimes, Essential Bug Fixes, and More!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 27 Feb 13:48
· 157 commits to master since this release
ad5a4ea

πŸŽ‰ Celebrating 91 Million Downloads on PyPI - A Spark NLP Milestone! πŸš€

91,000,000 Downloads

We're thrilled to announce the release of Spark NLP 5.3.0, a monumental update that brings cutting-edge advancements and enhancements to the forefront of Natural Language Processing (NLP). This release underscores our commitment to providing the NLP community with state-of-the-art tools and models, furthering our mission to democratize NLP technologies.

This release also addresses critical bug fixes, enhancing the stability and reliability of Spark NLP. Fixes include Spark NLP configuration adjustments, score calculation corrections, input validation, notebook improvements, and serialization issues.

We invite the community to explore these new features and enhancements, and we look forward to seeing the innovative applications that Spark NLP 5.3.0 will enable. 🌟


πŸ”₯ New Features & Enhancements

  • Llama-2 Integration: We're introducing Llama-2 along with models fine-tuned on this architecture, marking our first foray into CasualLM annotators in ONNX. This groundbreaking addition supports quantization in INT4 and INT8 for CPUs, optimizing performance and efficiency.
image

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. - https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

We have made LLAMA2Transformer annotator compatible with ONNX exports and quantizations:

  • 16 bit (CUDA only)
  • 8 bit (CPU or CUDA)
  • 4 bit (CPU or CIDA)

As always, we made this feature super easy and scalable:

doc_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

llama2 = LLAMA2Transformer \
    .pretrained() \
    .setMaxOutputLength(50) \
    .setDoSample(False) \
    .setInputCols(["documents"]) \
    .setOutputCol("generation")

We will continue improving this annotator and import more models in the future


  • Multilingual Translation with M2M100: The M2M100 model sets a new benchmark for multilingual translation, supporting direct translation across 9,900 language pairs from 100 languages. This feature represents a significant leap in breaking down language barriers in global communication.

image

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model. - https://arxiv.org/pdf/2010.11125.pdf

m2m100 = M2M100Transformer.pretrained() \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(50) \
    .setOutputCol("generation") \
    .setSrcLang("zh") \
    .setTgtLang("en")


  • Document Similarity and Retrieval: We've implemented a retrieval feature in our DocumentSimilarity annotator, offering an efficient and scalable solution for ranking documents based on similarity, ideal for retrieval-augmented generation (RAG) applications.
query = "Florence in Italy, is among the most beautiful cities in Europe."

doc_similarity_ranker = DocumentSimilarityRankerApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("doc_similarity_rankings")\
    .setSimilarityMethod("brp")\ # brp for BucketedRandomProjectionLSH and mh for MinHashLSH
    .setNumberOfNeighbours(3)\
    .setVisibleDistances(True)\
    .setIdentityRanking(True)\
    .asRetriever(query)
  • NEW: Introducing MPNetForSequenceClassification annotator for sequence classification tasks. This annotator is based on the MPNet architecture, enhances our capabilities in sequence classification tasks, offering more precise and context-aware processing.
  • NEW: Introducing MPNetForQuestionAnswering annotator for question answering tasks. This annotator is based on the MPNet architecture, enhances our capabilities in question answering tasks, offering more precise and context-aware processing.
  • NEW: Introducing a new DeBertaForZeroShotClassification annotator, leveraging the DeBERTa architecture, introduces sophisticated zero-shot classification capabilities, enabling the classification of text into predefined classes without direct example training.
  • NEW: Add support for in-memory use of WordEmbeddingsModel annotator in serverless clusters. We initially introduced the in-memory feature for this annotator for users inside Kubernetes clusters without any HDFS. However, today it runs without any issue locally, on Google Colab, Kaggle, Databricks, AWS EMR, GCP, and AWS Glue.
  • Add ONNX support for BertForZeroShotClassification annotator
  • Introduce new Whisper Large and Distil models.
  • Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, and 14.3 GPU.
  • Support new EMR versions 6.15.0 and 7.0.0.
  • Add a notebook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it into Spark NLP.
  • Add a notebook to import BERT for Zero-Shot classification from Hugging Face.
  • Add a notebook to import DeBERTa for Zero-Shot classification from Hugging Face.
  • Update EntityRuler documentation.
  • Improve SBT project and resolve warnings (almost!).
  • Update ONNX Runtime to 1.17.0 to enjoy the following features in upcoming releases:
    • Support for CUDA 12.1
    • Enhanced security for Linux binaries to comply with BinSkim, added Windows ARM64X source build support, removed Windows ARM32 binaries, and introduced AMD GPU packages.
    • Optimized graph inlining, added custom logger support at the session level, and introduced new logging and tracing features for session and execution provider options.
    • Added 4bit quantization support for NVIDIA GPU and ARM64.

πŸ› Bug Fixes

  • Fix Spark NLP Configuration to set cluster_tmp_dir on Databricks' DBFS via spark.jsl.settings.storage.cluster_tmp_dir #14129
  • Fix score calculation in RoBertaForQuestionAnswering annotator #14147
  • Fix optional input col validations #14153
  • Fix notebooks for importing DeBERTa classifiers #14154
  • Fix GPT2 deserialization over the cluster (Databricks) #14177

ℹ️ Known Issues

  • Llama-2, M2M100, and Whisper Large do not work in a cluster. We are working on how best share these large models over a cluster and will provide a fix in the future releases
  • Previously some ONNX models did not work on CUDA 12.x as we have reported this problem - We have not tested this yet, but it should be resolved in onnxruntime 1.17.0 in Spark NLP 5.3.0

πŸ’Ύ Models

The complete list of all 37000+ models & pipelines in 230+ languages is available on Models Hub

πŸ““ New Notebooks


πŸ“– Documentation


❀️ Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas,
    and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • JohnSnowLabs official Medium
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.3.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.3.0</version>
</dependency>

FAT JARs

Pull Requests:

What's Changed

New Contributors

Full Changelog: 5.2.3...5.3.0