Spark NLP 4.3.0: New HuBERT for speech recognition, new Swin Transformer for Image Classification, new Zero-shot annotator for Entity Recognition, CamemBERT for question answering, new Databricks and EMR with support for Spark 3.3, 1000+ state-of-the-art models and many more!
π’ Overview
We are very excited to release Spark NLP π 4.3.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! π
This release extends support for another Image Classification by introducing Swin Transformer
, also extending support for speech recognition by introducing HuBERT
annotator, a brand new modern extractive transformer-based Question answering (QA) annotator for tasks like SQuAD based on CamemBERT architecture, new Databricks & EMR with support for Spark 3.3, 1000+ state-of-the-art models, and many more enhancements and bug fixes!
We are also celebrating crossing 12600+ free and open-source models & pipelines in our Models Hub. π As always, we would like to thank our community for their feedback, questions, and feature requests.
π₯ New Features
HuBERT
NEW: Introducing HubertForCTC annotator in Spark NLP π. HubertForCTC
can load HuBERT
models that match or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression. The Hidden-Unit BERT (HuBERT) approach was proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. This annotator is compatible with all the models trained/fine-tuned by using HubertForCTC
for PyTorch or TFHubertForCTC
for TensorFlow models in HuggingFace π€
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.
Swin Transformer
NEW: Introducing SwinForImageClassification annotator in Spark NLP π. SwinForImageClassification
can load transformer-based deep learning models with state-of-the-art performance in vision tasks. Swin Transformer precedes Vision Transformer (ViT) (Dosovitskiy et al., 2020) with great accuracy and efficiency. This annotator is compatible with all the models trained/fine-tuned by using SwinForImageClassification
for PyTorch or TFSwinForImageClassification
for TensorFlow models in HuggingFace π€
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
Zero-Shot for Named Entity Recognition
Zero-Shot Learning
refers to the process by which a model learns how to recognize objects (image, text, any features) without any labeled training data to help in the classification.
NEW: Introducing ZeroShotNerModel annotator in Spark NLP π. You can use the ZeroShotNerModel
annotator to construct simple questions/answers mapped to NER labels like PERSON
, NORP
and etc. We use RoBERTa for Question Answering architecture behind the hood and this allows you to use any of the 460 models
available on Models Hub to build your Zero-shot Entity Recognition with zero training dataset!
zero_shot_ner = ZeroShotNerModel.pretrained("roberta_base_qa_squad2", "en") \
.setEntityDefinitions(
{
"NAME": ["What is his name?", "What is my name?", "What is her name?"],
"CITY": ["Which city?", "Which is the city?"]
}) \
.setInputCols(["sentence", "token"]) \
.setOutputCol("zero_shot_ner")
This powerful annotator with such simple rules can detect those entities from the following input: "My name is Clara, I live in New York and Hellen lives in Paris."
+-----------------------------------------------------------------+------+------+----------+------------------+
|result |result|word |confidence|question |
+-----------------------------------------------------------------+------+------+----------+------------------+
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|Paris |0.5328949 |Which is the city?|
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Clara |0.9360068 |What is my name? |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-CITY|New |0.83294415|Which city? |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|I-CITY|York |0.83294415|Which city? |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Hellen|0.45366877|What is her name? |
+-----------------------------------------------------------------+------+------+----------+------------------+
CamemBERT for Question Answering
NEW: Introducing CamemBertForQuestionAnswering annotator in Spark NLP π. CamemBertForQuestionAnswering
can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (linear layers on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using CamembertForQuestionAnswering
for PyTorch or TFCamembertForQuestionAnswering
for TensorFlow in HuggingFace π€
Models Hub
Introduces a new filter by annotator
which should help to navigate and find models easier:
βπ Improvements & Bug Fixes
- New
Date2Chunk
annotator to convertDATE
outputs coming fromDateMatcher
andMultiDateMatcher
annotators toCHUNK
that is acceptable by a wider range of annotators - Spark NLP 4.3.0 supports Apple Silicon M1 and M2 (still under experimental status until GitHub supports Apple Silicon officially). We have refactored the name
m1
tosilicon
andapple_silicon
in our code for better clarity - Add new templates for issues, docs, and feature requests on GitHub
- Add a new log4j2 properties for Spark 3.3.x coming with Log4j 2.x to control the logs on Apache Spark
- Cross compatibility for all saved pipelines for all major releases of Apache Spark and PySpark
- Relocating Spark NLP examples to the examples directory in our main repository. We will update them on each release, will keep a history of the changes for each version, adding more languages, especially more use cases with Java and Scala
- Add PyDoc documentation for
ResourceDownloader
in Python (clearCache()
,showPublicModels()
,showPublicPipelines()
, andshowAvailableAnnotators()
) - Fix calculating
delimiter id
in CamemBERT annotators. The delimiter id is actually correct and doesn't need any offset - Fix AnalysisException exception that requires a different caught message for Spark 3.3
- Fix copying existing models & pipelines on S3 before unzipping when
cache_pretrained
is defined as S3 bucket - Fix copying existing models & pipelines on GCP before unzipping when
cache_pretrained
is defined as GCP bucket - Fix
loadSavedModel()
trying to load external models for private buckets on S3 with better error handling and warnings - Enable the
params
argument in the Spark NLP start function. You can create aparams = {}
with all Spark NLP and Apache Spark configs and pass it when starting the Spark NLP session - Add support for
doc id
in CoNLL() class when trying to read CoNLL files withid
inside each document's header - Welcoming 6 new Databricks runtimes to our Spark NLP family:
- Databricks 12.0
- Databricks 12.0 ML
- Databricks 12.0 ML GPU
- Databricks 12.1
- Databricks 12.1 ML
- Databricks 12.1 ML GPU
- Welcoming 2 new EMR 6.x series to our Spark NLP family:
- EMR 6.8.0 (Apache Spark 3.3.0 / Hadoop 3.2.1)
- EMR 6.9.0 (Apache Spark 3.3.0 / Hadoop 3.3.3)
- New article for semantic similarity with Spark NLP on Play/API/Swagger/ https://medium.com/spark-nlp/semantic-similarity-with-sparknlp-da148fafa3d8
Dependencies & Code Changes
- Update Apache Spark 3.3.1 (not shipped with Spark NLP
- Update GCP to 2.16.0
- Update Scala test to 3.2.14
- Start publishing
spark-nlp-m1
Maven package asspark-nlp-silicon
- Rename all read model traits to a generic name. A new
ai
module paving a path to another DL engine - Rename TF backends to more generic DL names
- Refactor more duplicate codes in transformer embeddings
πΎ Models
Spark NLP 4.3.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.
Featured Models
Model | Name | Lang |
---|---|---|
DistilBertForQuestionAnswering | distilbert_qa_en_de_vi_zh_es_model | xx |
DistilBertForQuestionAnswering | distilbert_qa_extractive | en |
DistilBertForQuestionAnswering | distilbert_qa_base_cased_squadv2 | xx |
RoBertaForQuestionAnswering | roberta_qa_roberta | en |
RoBertaForQuestionAnswering | roberta_qa_ca_v2_squac_ca_catalan | ca |
T5Transformer | t5_flan_small | xx |
T5Transformer | t5_t2t_adex_prompt | en |
T5Transformer | t5_punctuation | fr |
T5Transformer | t5_jainu | ja |
T5Transformer | t5_diversiformer | de |
T5Transformer | t5_small_summarization | ro |
T5Transformer | t5_uk_summarizer | uk |
T5Transformer | t5_vi_small | vi |
T5Transformer | t5_mini_nl8 | fi |
HubertForCTC | asr_hubert_large_ls960 | en |
SwinForImageClassification | image_classifier_swin_tiny_patch4_window7_224 | en |
CamemBertForQuestionAnswering | camembert_base_qa_fquad | fr |
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (MarΔαΉhΔ«)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian BokmΓ₯l
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 12600+ models & pipelines in 230+ languages is available on Models Hub
π New Notebooks
Notebooks |
---|
New params{} in Spark NLP start() |
CamemBertForQuestionAnswering |
Zero-shot NER |
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Examples for 100+ examples
π Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==4.3.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.0
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.0
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.0
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.0
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.3.0</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.3.0</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>4.3.0</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>4.3.0</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.3.0.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.3.0.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-4.3.0.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.3.0.jar
What's Changed
Contributors
@Cabir40 @bunyamin-polat @danilojsl @dcecchini @Meryem1425 @C-K-Loan @agsfer @maziyarpanahi @jfernandrezj @jsl-builder @DevinTDHa @josejuanmartinez @aymanechilah
Full Changelog: 4.2.8...4.3.0