Spark NLP 5.3.0: Introducing Llama-2 for CasualLM, M2M100 for Multilingual Translation, MPNet & DeBERTa Enhancements, New Document Similarity Features, Expanded ONNX & In-Memory Support, Updated Runtimes, Essential Bug Fixes, and More!
π Celebrating 91 Million Downloads on PyPI - A Spark NLP Milestone! π
We're thrilled to announce the release of Spark NLP 5.3.0, a monumental update that brings cutting-edge advancements and enhancements to the forefront of Natural Language Processing (NLP). This release underscores our commitment to providing the NLP community with state-of-the-art tools and models, furthering our mission to democratize NLP technologies.
This release also addresses critical bug fixes, enhancing the stability and reliability of Spark NLP. Fixes include Spark NLP configuration adjustments, score calculation corrections, input validation, notebook improvements, and serialization issues.
We invite the community to explore these new features and enhancements, and we look forward to seeing the innovative applications that Spark NLP 5.3.0 will enable. π
π₯ New Features & Enhancements
- Llama-2 Integration: We're introducing Llama-2 along with models fine-tuned on this architecture, marking our first foray into CasualLM annotators in ONNX. This groundbreaking addition supports quantization in INT4 and INT8 for CPUs, optimizing performance and efficiency.
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. - https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
We have made LLAMA2Transformer
annotator compatible with ONNX exports and quantizations:
- 16 bit (CUDA only)
- 8 bit (CPU or CUDA)
- 4 bit (CPU or CIDA)
As always, we made this feature super easy and scalable:
doc_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")
llama2 = LLAMA2Transformer \
.pretrained() \
.setMaxOutputLength(50) \
.setDoSample(False) \
.setInputCols(["documents"]) \
.setOutputCol("generation")
We will continue improving this annotator and import more models in the future
- Multilingual Translation with M2M100: The
M2M100
model sets a new benchmark for multilingual translation, supporting direct translation across 9,900 language pairs from 100 languages. This feature represents a significant leap in breaking down language barriers in global communication.
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model. - https://arxiv.org/pdf/2010.11125.pdf
m2m100 = M2M100Transformer.pretrained() \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation") \
.setSrcLang("zh") \
.setTgtLang("en")
- Document Similarity and Retrieval: We've implemented a retrieval feature in our
DocumentSimilarity
annotator, offering an efficient and scalable solution for ranking documents based on similarity, ideal for retrieval-augmented generation (RAG) applications.
query = "Florence in Italy, is among the most beautiful cities in Europe."
doc_similarity_ranker = DocumentSimilarityRankerApproach()\
.setInputCols("sentence_embeddings")\
.setOutputCol("doc_similarity_rankings")\
.setSimilarityMethod("brp")\ # brp for BucketedRandomProjectionLSH and mh for MinHashLSH
.setNumberOfNeighbours(3)\
.setVisibleDistances(True)\
.setIdentityRanking(True)\
.asRetriever(query)
- NEW: Introducing
MPNetForSequenceClassification
annotator for sequence classification tasks. This annotator is based on the MPNet architecture, enhances our capabilities in sequence classification tasks, offering more precise and context-aware processing. - NEW: Introducing
MPNetForQuestionAnswering
annotator for question answering tasks. This annotator is based on the MPNet architecture, enhances our capabilities in question answering tasks, offering more precise and context-aware processing. - NEW: Introducing a new
DeBertaForZeroShotClassification
annotator, leveraging the DeBERTa architecture, introduces sophisticated zero-shot classification capabilities, enabling the classification of text into predefined classes without direct example training. - NEW: Add support for in-memory use of
WordEmbeddingsModel
annotator in serverless clusters. We initially introduced the in-memory feature for this annotator for users inside Kubernetes clusters without anyHDFS
. However, today it runs without any issuelocally
, on GoogleColab
,Kaggle
,Databricks
,AWS EMR
,GCP
, andAWS Glue
. - Add ONNX support for
BertForZeroShotClassification
annotator - Introduce new Whisper Large and Distil models.
- Support new Databricks Runtimes of
14.2
,14.3
,14.2 ML
,14.3 ML
,14.2 GPU
, and14.3 GPU
. - Support new EMR versions
6.15.0
and7.0.0
. - Add a notebook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it into Spark NLP.
- Add a notebook to import BERT for Zero-Shot classification from Hugging Face.
- Add a notebook to import DeBERTa for Zero-Shot classification from Hugging Face.
- Update
EntityRuler
documentation. - Improve SBT project and resolve warnings (almost!).
- Update ONNX Runtime to 1.17.0 to enjoy the following features in upcoming releases:
- Support for CUDA 12.1
- Enhanced security for Linux binaries to comply with BinSkim, added Windows ARM64X source build support, removed Windows ARM32 binaries, and introduced AMD GPU packages.
- Optimized graph inlining, added custom logger support at the session level, and introduced new logging and tracing features for session and execution provider options.
- Added 4bit quantization support for NVIDIA GPU and ARM64.
π Bug Fixes
- Fix Spark NLP Configuration to set
cluster_tmp_dir
on Databricks' DBFS viaspark.jsl.settings.storage.cluster_tmp_dir
#14129 - Fix score calculation in
RoBertaForQuestionAnswering
annotator #14147 - Fix optional input col validations #14153
- Fix notebooks for importing DeBERTa classifiers #14154
- Fix GPT2 deserialization over the cluster (Databricks) #14177
βΉοΈ Known Issues
- Llama-2, M2M100, and Whisper Large do not work in a cluster. We are working on how best share these large models over a cluster and will provide a fix in the future releases
- Previously some ONNX models did not work on CUDA 12.x as we have reported this problem - We have not tested this yet, but it should be resolved in onnxruntime 1.17.0 in Spark NLP 5.3.0
πΎ Models
The complete list of all 37000+ models & pipelines in 230+ languages is available on Models Hub
π New Notebooks
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Examples for 100+ examples
π Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
β€οΈ Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- JohnSnowLabs official Medium
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==5.3.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.3.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.3.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.3.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>5.3.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>5.3.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>5.3.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>5.3.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.3.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.3.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.3.0.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.3.0.jar
Pull Requests:
- #14132
- #14147
- #14148
- #14149
- #14158
- #14151
- #14152
- #14153
- #14154
- #14155
- #14165
- #14169
- #14167
- #14168
- #14177
- #14172
- #14156
What's Changed
- Update 2023-02-08-zero_shot_ner_roberta_en.md by @maziyarpanahi in #14161
- [Issue#14129] Fix for spark.jsl.settings.storage.cluster_tmp_dir configuration by @jiamaozheng in #14132
- SPARKNLP-942: MPNet Classifiers by @DevinTDHa in #14147
- adding import notebook + changing default model + adding onnx support by @ahmedlone127 in #14158
- Sparknlp 876: Introducing LLAMA2 by @prabod in #14148
- Doc sim rank as retriever by @wolliq in #14149
- 812 implement de berta for zero shot classification annotator by @ahmedlone127 in #14151
- SPARKNLP-886: Add Fine tuned sentence bert notebook by @DevinTDHa in #14152
- [SPARKNLP-986] Fixing optional input col validations by @danilojsl in #14153
- [SPARKNLP-984] Fixing Deberta notebooks URIs by @danilojsl in #14154
- SparkNLP 933: Introducing M2M100 : multilingual translation model by @prabod in #14155
- SPARKNLP-985: Make Whisper compatible with onnx_data files by @DevinTDHa in #14165
- Fixed a bug with models that has 'onnx_data' file not working in dbfs/hdfs by @prabod in #14169
- [SPARKNLP-940] Adding changes to correctly copy cluster index storage⦠by @danilojsl in #14167
- [SPARKNLP-988] Updating EntityRuler documentation by @danilojsl in #14168
- SPARKNLP-1000: Fix No Operation named [init_all_tables] for GPT2 by @DevinTDHa in #14177
- fixes python documentation by @ahmedlone127 in #14172
- fixed all sbt warnings by @ahmedlone127 in #14156
- Replace hard exception with soft logs by @maziyarpanahi in #14179
- Models hub by @maziyarpanahi in #14183
- release/530-release-candidate by @maziyarpanahi in #14164
New Contributors
- @jiamaozheng made their first contribution in #14132
Full Changelog: 5.2.3...5.3.0