Release Spark NLP 4.3.0: New HuBERT for speech recognition, new Swin Transformer for Image Classification, new Zero-shot annotator for Entity Recognition, CamemBERT for question answering, new Databricks and EMR with support for Spark 3.3, 1000+ state-of-the-art models and many more! · JohnSnowLabs/spark-nlp

📢 Overview

We are very excited to release Spark NLP 🚀 4.3.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉

This release extends support for another Image Classification by introducing Swin Transformer, also extending support for speech recognition by introducing HuBERT annotator, a brand new modern extractive transformer-based Question answering (QA) annotator for tasks like SQuAD based on CamemBERT architecture, new Databricks & EMR with support for Spark 3.3, 1000+ state-of-the-art models, and many more enhancements and bug fixes!

We are also celebrating crossing 12600+ free and open-source models & pipelines in our Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.

🔥 New Features

HuBERT

NEW: Introducing HubertForCTC annotator in Spark NLP 🚀. HubertForCTC can load HuBERT models that match or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression. The Hidden-Unit BERT (HuBERT) approach was proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. This annotator is compatible with all the models trained/fine-tuned by using HubertForCTC for PyTorch or TFHubertForCTC for TensorFlow models in HuggingFace 🤗

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

Swin Transformer

NEW: Introducing SwinForImageClassification annotator in Spark NLP 🚀. SwinForImageClassification can load transformer-based deep learning models with state-of-the-art performance in vision tasks. Swin Transformer precedes Vision Transformer (ViT) (Dosovitskiy et al., 2020) with great accuracy and efficiency. This annotator is compatible with all the models trained/fine-tuned by using SwinForImageClassification for PyTorch or TFSwinForImageClassification for TensorFlow models in HuggingFace 🤗

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

Zero-Shot for Named Entity Recognition

Zero-Shot Learning refers to the process by which a model learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.

NEW: Introducing ZeroShotNerModel annotator in Spark NLP 🚀. You can use the ZeroShotNerModel annotator to construct simple questions/answers mapped to NER labels like PERSON, NORP and etc. We use RoBERTa for Question Answering architecture behind the hood and this allows you to use any of the 460 models available on Models Hub to build your Zero-shot Entity Recognition with zero training dataset!

zero_shot_ner = ZeroShotNerModel.pretrained("roberta_base_qa_squad2", "en") \
    .setEntityDefinitions(
        {
            "NAME": ["What is his name?", "What is my name?", "What is her name?"],
            "CITY": ["Which city?", "Which is the city?"]
        }) \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("zero_shot_ner")

This powerful annotator with such simple rules can detect those entities from the following input: "My name is Clara, I live in New York and Hellen lives in Paris."

+-----------------------------------------------------------------+------+------+----------+------------------+
    |result                                                           |result|word  |confidence|question          |
    +-----------------------------------------------------------------+------+------+----------+------------------+
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|Paris |0.5328949 |Which is the city?|
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Clara |0.9360068 |What is my name?  |
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|New   |0.83294415|Which city?       |
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|I-CITY|York  |0.83294415|Which city?       |
    |[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Hellen|0.45366877|What is her name? |
    +-----------------------------------------------------------------+------+------+----------+------------------+

CamemBERT for Question Answering

NEW: Introducing CamemBertForQuestionAnswering annotator in Spark NLP 🚀. CamemBertForQuestionAnswering can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (linear layers on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using CamembertForQuestionAnswering for PyTorch or TFCamembertForQuestionAnswering for TensorFlow in HuggingFace 🤗

Models Hub

Introduces a new filter by annotator which should help to navigate and find models easier:

⭐🐛 Improvements & Bug Fixes

New Date2Chunk annotator to convert DATE outputs coming from DateMatcher and MultiDateMatcher annotators to CHUNK that is acceptable by a wider range of annotators
Spark NLP 4.3.0 supports Apple Silicon M1 and M2 (still under experimental status until GitHub supports Apple Silicon officially). We have refactored the name m1 to silicon and apple_silicon in our code for better clarity
Add new templates for issues, docs, and feature requests on GitHub
Add a new log4j2 properties for Spark 3.3.x coming with Log4j 2.x to control the logs on Apache Spark
Cross compatibility for all saved pipelines for all major releases of Apache Spark and PySpark
Relocating Spark NLP examples to the examples directory in our main repository. We will update them on each release, will keep a history of the changes for each version, adding more languages, especially more use cases with Java and Scala
Add PyDoc documentation for ResourceDownloader in Python (clearCache(), showPublicModels(), showPublicPipelines(), and showAvailableAnnotators() )
Fix calculating delimiter id in CamemBERT annotators. The delimiter id is actually correct and doesn't need any offset
Fix AnalysisException exception that requires a different caught message for Spark 3.3
Fix copying existing models & pipelines on S3 before unzipping when cache_pretrained is defined as S3 bucket
Fix copying existing models & pipelines on GCP before unzipping when cache_pretrained is defined as GCP bucket
Fix loadSavedModel() trying to load external models for private buckets on S3 with better error handling and warnings
Enable the params argument in the Spark NLP start function. You can create a params = {} with all Spark NLP and Apache Spark configs and pass it when starting the Spark NLP session
Add support for doc id in CoNLL() class when trying to read CoNLL files with id inside each document's header
Welcoming 6 new Databricks runtimes to our Spark NLP family:
- Databricks 12.0
- Databricks 12.0 ML
- Databricks 12.0 ML GPU
- Databricks 12.1
- Databricks 12.1 ML
- Databricks 12.1 ML GPU
Welcoming 2 new EMR 6.x series to our Spark NLP family:
- EMR 6.8.0 (Apache Spark 3.3.0 / Hadoop 3.2.1)
- EMR 6.9.0 (Apache Spark 3.3.0 / Hadoop 3.3.3)
New article for semantic similarity with Spark NLP on Play/API/Swagger/ https://medium.com/spark-nlp/semantic-similarity-with-sparknlp-da148fafa3d8

Dependencies & Code Changes

Update Apache Spark 3.3.1 (not shipped with Spark NLP
Update GCP to 2.16.0
Update Scala test to 3.2.14
Start publishing spark-nlp-m1 Maven package as spark-nlp-silicon
Rename all read model traits to a generic name. A new ai module paving a path to another DL engine
Rename TF backends to more generic DL names
Refactor more duplicate codes in transformer embeddings

💾 Models

Spark NLP 4.3.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.

Featured Models

Model	Name	Lang
DistilBertForQuestionAnswering	distilbert_qa_en_de_vi_zh_es_model	`xx`
DistilBertForQuestionAnswering	distilbert_qa_extractive	`en`
DistilBertForQuestionAnswering	distilbert_qa_base_cased_squadv2	`xx`
RoBertaForQuestionAnswering	roberta_qa_roberta	`en`
RoBertaForQuestionAnswering	roberta_qa_ca_v2_squac_ca_catalan	`ca`
T5Transformer	t5_flan_small	`xx`
T5Transformer	t5_t2t_adex_prompt	`en`
T5Transformer	t5_punctuation	`fr`
T5Transformer	t5_jainu	`ja`
T5Transformer	t5_diversiformer	`de`
T5Transformer	t5_small_summarization	`ro`
T5Transformer	t5_uk_summarizer	`uk`
T5Transformer	t5_vi_small	`vi`
T5Transformer	t5_mini_nl8	`fi`
HubertForCTC	asr_hubert_large_ls960	`en`
SwinForImageClassification	image_classifier_swin_tiny_patch4_window7_224	`en`
CamemBertForQuestionAnswering	camembert_base_qa_fquad	`fr`

Spark NLP covers the following languages:

English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

The complete list of all 12600+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

Notebooks
New params{} in Spark NLP start()
CamemBertForQuestionAnswering
Zero-shot NER

You can visit Import Transformers in Spark NLP
You can visit Spark NLP Examples for 100+ examples

📖 Documentation

Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP!
Medium Spark NLP articles
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==4.3.0

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.0

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.0

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.0

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.0

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.3.0</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.3.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.3.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-4.3.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.3.0.jar

What's Changed

Contributors

@Cabir40 @bunyamin-polat @danilojsl @dcecchini @Meryem1425 @C-K-Loan @agsfer @maziyarpanahi @jfernandrezj @jsl-builder @DevinTDHa @josejuanmartinez @aymanechilah

Full Changelog: 4.2.8...4.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 4.3.0: New HuBERT for speech recognition, new Swin Transformer for Image Classification, new Zero-shot annotator for Entity Recognition, CamemBERT for question answering, new Databricks and EMR with support for Spark 3.3, 1000+ state-of-the-art models and many more!

📢 Overview

🔥 New Features

HuBERT

Swin Transformer

Zero-Shot for Named Entity Recognition

CamemBERT for Question Answering

Models Hub

⭐🐛 Improvements & Bug Fixes

Dependencies & Code Changes

💾 Models

Featured Models

The complete list of all 12600+ models & pipelines in 230+ languages is available on Models Hub

📓 New Notebooks

📖 Documentation

Community support

Installation

What's Changed

Contributors

Contributors