Skip to content

Spark NLP 4.3.0: New HuBERT for speech recognition, new Swin Transformer for Image Classification, new Zero-shot annotator for Entity Recognition, CamemBERT for question answering, new Databricks and EMR with support for Spark 3.3, 1000+ state-of-the-art models and many more!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 09 Feb 19:45
· 558 commits to master since this release

πŸ“’ Overview

We are very excited to release Spark NLP πŸš€ 4.3.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! πŸŽ‰

This release extends support for another Image Classification by introducing Swin Transformer, also extending support for speech recognition by introducing HuBERT annotator, a brand new modern extractive transformer-based Question answering (QA) annotator for tasks like SQuAD based on CamemBERT architecture, new Databricks & EMR with support for Spark 3.3, 1000+ state-of-the-art models, and many more enhancements and bug fixes!

We are also celebrating crossing 12600+ free and open-source models & pipelines in our Models Hub. πŸŽ‰ As always, we would like to thank our community for their feedback, questions, and feature requests.


πŸ”₯ New Features

HuBERT

NEW: Introducing HubertForCTC annotator in Spark NLP πŸš€. HubertForCTC can load HuBERT models that match or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression. The Hidden-Unit BERT (HuBERT) approach was proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. This annotator is compatible with all the models trained/fine-tuned by using HubertForCTC for PyTorch or TFHubertForCTC for TensorFlow models in HuggingFace πŸ€—

image

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

Swin Transformer

NEW: Introducing SwinForImageClassification annotator in Spark NLP πŸš€. SwinForImageClassification can load transformer-based deep learning models with state-of-the-art performance in vision tasks. Swin Transformer precedes Vision Transformer (ViT) (Dosovitskiy et al., 2020) with great accuracy and efficiency. This annotator is compatible with all the models trained/fine-tuned by using SwinForImageClassification for PyTorch or TFSwinForImageClassification for TensorFlow models in HuggingFace πŸ€—

image

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

Zero-Shot for Named Entity Recognition

Zero-Shot Learning refers to the process by which a model learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

NEW: Introducing ZeroShotNerModel annotator in Spark NLP πŸš€. You can use the ZeroShotNerModel annotator to construct simple questions/answers mapped to NER labels like PERSON, NORP and etc. We use RoBERTa for Question Answering architecture behind the hood and this allows you to use any of the 460 models available on Models Hub to build your Zero-shot Entity Recognition with zero training dataset!

zero_shot_ner = ZeroShotNerModel.pretrained("roberta_base_qa_squad2", "en") \
    .setEntityDefinitions(
        {
            "NAME": ["What is his name?", "What is my name?", "What is her name?"],
            "CITY": ["Which city?", "Which is the city?"]
        }) \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("zero_shot_ner")

This powerful annotator with such simple rules can detect those entities from the following input: "My name is Clara, I live in New York and Hellen lives in Paris."

+-----------------------------------------------------------------+------+------+----------+------------------+
    |result                                                           |result|word  |confidence|question          |
    +-----------------------------------------------------------------+------+------+----------+------------------+
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|Paris |0.5328949 |Which is the city?|
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Clara |0.9360068 |What is my name?  |
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|New   |0.83294415|Which city?       |
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|I-CITY|York  |0.83294415|Which city?       |
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Hellen|0.45366877|What is her name? |
    +-----------------------------------------------------------------+------+------+----------+------------------+

CamemBERT for Question Answering

NEW: Introducing CamemBertForQuestionAnswering annotator in Spark NLP πŸš€. CamemBertForQuestionAnswering can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (linear layers on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using CamembertForQuestionAnswering for PyTorch or TFCamembertForQuestionAnswering for TensorFlow in HuggingFace πŸ€—

Models Hub

Introduces a new filter by annotator which should help to navigate and find models easier:

image


β­πŸ› Improvements & Bug Fixes

  • New Date2Chunk annotator to convert DATE outputs coming from DateMatcher and MultiDateMatcher annotators to CHUNK that is acceptable by a wider range of annotators
  • Spark NLP 4.3.0 supports Apple Silicon M1 and M2 (still under experimental status until GitHub supports Apple Silicon officially). We have refactored the name m1 to silicon and apple_silicon in our code for better clarity
  • Add new templates for issues, docs, and feature requests on GitHub
  • Add a new log4j2 properties for Spark 3.3.x coming with Log4j 2.x to control the logs on Apache Spark
  • Cross compatibility for all saved pipelines for all major releases of Apache Spark and PySpark
  • Relocating Spark NLP examples to the examples directory in our main repository. We will update them on each release, will keep a history of the changes for each version, adding more languages, especially more use cases with Java and Scala
  • Add PyDoc documentation for ResourceDownloader in Python (clearCache(), showPublicModels(), showPublicPipelines(), and showAvailableAnnotators() )
  • Fix calculating delimiter id in CamemBERT annotators. The delimiter id is actually correct and doesn't need any offset
  • Fix AnalysisException exception that requires a different caught message for Spark 3.3
  • Fix copying existing models & pipelines on S3 before unzipping when cache_pretrained is defined as S3 bucket
  • Fix copying existing models & pipelines on GCP before unzipping when cache_pretrained is defined as GCP bucket
  • Fix loadSavedModel() trying to load external models for private buckets on S3 with better error handling and warnings
  • Enable the params argument in the Spark NLP start function. You can create a params = {} with all Spark NLP and Apache Spark configs and pass it when starting the Spark NLP session
  • Add support for doc id in CoNLL() class when trying to read CoNLL files with id inside each document's header
  • Welcoming 6 new Databricks runtimes to our Spark NLP family:
    • Databricks 12.0
    • Databricks 12.0 ML
    • Databricks 12.0 ML GPU
    • Databricks 12.1
    • Databricks 12.1 ML
    • Databricks 12.1 ML GPU
  • Welcoming 2 new EMR 6.x series to our Spark NLP family:
    • EMR 6.8.0 (Apache Spark 3.3.0 / Hadoop 3.2.1)
    • EMR 6.9.0 (Apache Spark 3.3.0 / Hadoop 3.3.3)
  • New article for semantic similarity with Spark NLP on Play/API/Swagger/ https://medium.com/spark-nlp/semantic-similarity-with-sparknlp-da148fafa3d8

Dependencies & Code Changes

  • Update Apache Spark 3.3.1 (not shipped with Spark NLP
  • Update GCP to 2.16.0
  • Update Scala test to 3.2.14
  • Start publishing spark-nlp-m1 Maven package as spark-nlp-silicon
  • Rename all read model traits to a generic name. A new ai module paving a path to another DL engine
  • Rename TF backends to more generic DL names
  • Refactor more duplicate codes in transformer embeddings

πŸ’Ύ Models

Spark NLP 4.3.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

Model Name Lang
DistilBertForQuestionAnswering distilbert_qa_en_de_vi_zh_es_model xx
DistilBertForQuestionAnswering distilbert_qa_extractive en
DistilBertForQuestionAnswering distilbert_qa_base_cased_squadv2 xx
RoBertaForQuestionAnswering roberta_qa_roberta en
RoBertaForQuestionAnswering roberta_qa_ca_v2_squac_ca_catalan ca
T5Transformer t5_flan_small xx
T5Transformer t5_t2t_adex_prompt en
T5Transformer t5_punctuation fr
T5Transformer t5_jainu ja
T5Transformer t5_diversiformer de
T5Transformer t5_small_summarization ro
T5Transformer t5_uk_summarizer uk
T5Transformer t5_vi_small vi
T5Transformer t5_mini_nl8 fi
HubertForCTC asr_hubert_large_ls960 en
SwinForImageClassification image_classifier_swin_tiny_patch4_window7_224 en
CamemBertForQuestionAnswering camembert_base_qa_fquad fr

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (MarāṭhΔ«) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian BokmΓ₯l ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 12600+ models & pipelines in 230+ languages is available on Models Hub

πŸ““ New Notebooks

Notebooks
New params{} in Spark NLP start()
CamemBertForQuestionAnswering
Zero-shot NER

πŸ“– Documentation

Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas,
    and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==4.3.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

FAT JARs

What's Changed

Contributors

@Cabir40 @bunyamin-polat @danilojsl @dcecchini @Meryem1425 @C-K-Loan @agsfer @maziyarpanahi @jfernandrezj @jsl-builder @DevinTDHa @josejuanmartinez @aymanechilah

Full Changelog: 4.2.8...4.3.0