Skip to content

Spark NLP 4.4.0: New BART for Text Translation & Summarization, new ConvNeXT Transformer for Image Classification, new Zero-Shot Text Classification by BERT, more than 4000+ state-of-the-art models, and many more!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 11 Apr 07:15
· 483 commits to master since this release

πŸ“’ Overview

We are thrilled to announce the release of Spark NLP πŸš€ 4.4.0! This release includes new features such as a New BART for NLG, translation, and comprehension; a new ConvNeXT Transformer for Image Classification, a new Zero-Shot Text Classification by BERT, 4000+ new state-of-the-art models, and more enhancements and bug fixes.

We want to thank our community for their valuable feedback, feature requests, and contributions. Our Models Hub now contains over 17,000+ free and truly open-source models & pipelines. πŸŽ‰

Spark NLP has a new home! https://sparknlp.org is where you can find all the documentation, models, and demos for Spark NLP. It aims to provide valuable resources to anyone interested in 100% open-source NLP solutions by using Spark NLP πŸš€.


πŸ”₯ New Features

ConvNeXT Image Classification (By Facebook)

NEW: Introducing ConvNextForImageClassification annotator in Spark NLP πŸš€. ConvNextForImageClassification can load ConvNeXT models that compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

This annotator is compatible with all the models trained/fine-tuned by using ConvNextForImageClassification for PyTorch or TFConvNextForImageClassification for TensorFlow models in HuggingFace πŸ€—

A ConvNet: ImageNet-1K classification results for β€’ ConvNets and β—¦ vision Transformers. Each bubble’s area is proportional to FLOPs of a variant in a model family. by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie.

BART for NLG, Translation, and Comprehension (By Facebook)

NEW: Introducing BartTransformer annotator in Spark NLP πŸš€. BartTransformer can load BART models fine-tuned for tasks like summarizations.

This annotator is compatible with all the models trained/fine-tuned by using BartForConditionalGeneration for PyTorch or TFBartForConditionalGeneration for TensorFlow models in HuggingFace πŸ€—

The abstract explains that Bart uses a standard seq2seq/machine translation architecture, similar to BERT's bidirectional encoder and GPT's left-to-right decoder. The pretraining task involves randomly shuffling the original sentences and replacing text spans with a single mask token. BART is effective for text generation and comprehension tasks, matching RoBERTa's performance with similar training resources on GLUE and SQuAD. It also achieves new state-of-the-art results on various summarization, dialogue, and question-answering tasks with gains of up to 6 ROUGE.

The Bart model was proposed in BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer

Zero-Shot for Text Classification by BERT

NEW: Introducing BertForZeroShotClassification annotator for Zero-Shot Text Classification in Spark NLP πŸš€. You can use the BertForZeroShotClassification annotator for text classification with your labels! πŸ’―

Zero-Shot Learning (ZSL): Traditionally, ZSL most often referred to a fairly specific type of task: learning a classifier on one set of labels and then evaluating on a different set of labels that the classifier has never seen before. Recently, especially in NLP, it's been used much more broadly to get a model to do something it wasn't explicitly trained to do. A well-known example of this is in the GPT-2 paper where the authors evaluate a language model on downstream tasks like machine translation without fine-tuning on these tasks directly.

Let's see how easy it is to just use any set of labels our trained model has never seen via the setCandidateLabels() param:

zero_shot_classifier = BertForZeroShotClassification \
    .pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("class") \
    .setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"])

For Zero-Short Multi-class Text Classification:

+----------------------------------------------------------------------------------------------------------------+--------+
|result                                                                                                          |result  |
+----------------------------------------------------------------------------------------------------------------+--------+
|[I have a problem with my iPhone that needs to be resolved asap!!]                                              |[mobile]|
|[Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.]|[mobile]|
|[I have a phone and I love it!]                                                                                 |[mobile]|
|[I want to visit Germany and I am planning to go there next year.]                                              |[travel]|
|[Let's watch some movies tonight! I am in the mood for a horror movie.]                                         |[movie] |
|[Have you watched the match yesterday? It was a great game!]                                                    |[sport] |
|[We need to hurry up and get to the airport. We are going to miss our flight!]                                  |[urgent]|
+----------------------------------------------------------------------------------------------------------------+--------+

For Zero-Short Multi-class Text Classification:

+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|result                                                                                                          |result                             |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[I have a problem with my iPhone that needs to be resolved asap!!]                                              |[urgent, mobile, movie, technology]|
|[Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.]|[urgent, technology]               |
|[I have a phone and I love it!]                                                                                 |[mobile]                           |
|[I want to visit Germany and I am planning to go there next year.]                                              |[travel]                           |
|[Let's watch some movies tonight! I am in the mood for a horror movie.]                                         |[movie]                            |
|[Have you watched the match yesterday? It was a great game!]                                                    |[sport]                            |
|[We need to hurry up and get to the airport. We are going to miss our flight!]                                  |[urgent, travel]                   |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+

β­πŸ› Improvements & Bug Fixes

  • Add a new nerHasNoSchema param for NerConverter when labels coming from NerDLMOdel and NerCrfModel don't have any schema
  • Set custom entity name in Data2Chunk via setEntityName param
  • Fix loading WordEmbeddingsModel bug when loading a model from S3 via the cache_folder config
  • Fix the WordEmbeddingsModel bug failing when it's used with setEnableInMemoryStorage set to True and LightPipeline
  • Remove deprecated parameter enablePatternRegex from EntityRulerApproach & EntityRulerModel
  • Welcoming 3 new Databricks runtimes to our Spark NLP family:
    • Databricks 12.2 LTS
    • Databricks 12.2 LTS ML
    • Databricks 12.2 LTS ML GPU
  • Deprecate Python 3.6 in Spark NLP 4.4.0

πŸ’Ύ Models

Spark NLP 4.4.0 comes with more than 4300+ new state-of-the-art pre-trained transformer models in multi-languages.

Featured Models

Model Name Lang
BertForZeroShotClassification bert_base_cased_zero_shot_classifier_xnli en
ConvNextForImageClassification image_classifier_convnext_tiny_224_local en
BartTransformer distilbart_xsum_12_6 en
BartTransformer bart_large_cnn en
BertForQuestionAnswering bert_qa_case_base en
HubertForCTC asr_swin_exp_w2v2t_nl_hubert_s319 nl
BertForTokenClassification bert_token_classifier_base_chinese_ner zh

The complete list of all 17000+ models & pipelines in 230+ languages is available on Models Hub

πŸ““ New Notebooks

Notebooks Colab
Zero-Shot Text Classification Open In Colab
Document Summarization with BART Open In Colab

πŸ“– Documentation

Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas,
    and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==4.4.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.4.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.4.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.4.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.4.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.4.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>4.4.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.4.0</version>
</dependency>

FAT JARs

What's Changed

Contributors

@Cabir40 @bunyamin-polat @danilojsl @dcecchini @Meryem1425 @C-K-Loan @agsfer @maziyarpanahi @jfernandrezj @jsl-builder @DevinTDHa @josejuanmartinez @aymanechilah

Full Changelog: 4.3.2...4.4.0