John Snow Labs Spark-NLP 4.3.0: New HuBERT for speech recognition, new Swin Transformer for Image Classification, new Zero-shot annotator for Entity Recognition, CamemBERT for question answering, new Databricks and EMR with support for Spark 3.3, 1000+ state-of-the-art models and many more! #13494
maziyarpanahi
announced in
Announcement
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
📢 Overview
We are very excited to release Spark NLP 🚀 4.3.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉
This release extends support for another Image Classification by introducing
Swin Transformer
, also extending support for speech recognition by introducingHuBERT
annotator, a brand new modern extractive transformer-based Question answering (QA) annotator for tasks like SQuAD based on CamemBERT architecture, new Databricks & EMR with support for Spark 3.3, 1000+ state-of-the-art models, and many more enhancements and bug fixes!We are also celebrating crossing 12600+ free and open-source models & pipelines in our Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.
🔥 New Features
HuBERT
NEW: Introducing HubertForCTC annotator in Spark NLP 🚀.
HubertForCTC
can loadHuBERT
models that match or surpasses the SOTA approaches for speech representation learning for speech recognition, generation, and compression. The Hidden-Unit BERT (HuBERT) approach was proposed for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. This annotator is compatible with all the models trained/fine-tuned by usingHubertForCTC
for PyTorch orTFHubertForCTC
for TensorFlow models in HuggingFace 🤗Swin Transformer
NEW: Introducing SwinForImageClassification annotator in Spark NLP 🚀.
SwinForImageClassification
can load transformer-based deep learning models with state-of-the-art performance in vision tasks. Swin Transformer precedes Vision Transformer (ViT) (Dosovitskiy et al., 2020) with great accuracy and efficiency. This annotator is compatible with all the models trained/fine-tuned by usingSwinForImageClassification
for PyTorch orTFSwinForImageClassification
for TensorFlow models in HuggingFace 🤗Zero-Shot for Named Entity Recognition
NEW: Introducing ZeroShotNerModel annotator in Spark NLP 🚀. You can use the
ZeroShotNerModel
annotator to construct simple questions/answers mapped to NER labels likePERSON
,NORP
and etc. We use RoBERTa for Question Answering architecture behind the hood and this allows you to use any of the460 models
available on Models Hub to build your Zero-shot Entity Recognition with zero training dataset!This powerful annotator with such simple rules can detect those entities from the following input:
"My name is Clara, I live in New York and Hellen lives in Paris."
CamemBERT for Question Answering
NEW: Introducing CamemBertForQuestionAnswering annotator in Spark NLP 🚀.
CamemBertForQuestionAnswering
can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (linear layers on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingCamembertForQuestionAnswering
for PyTorch orTFCamembertForQuestionAnswering
for TensorFlow in HuggingFace 🤗Models Hub
Introduces a new filter by
annotator
which should help to navigate and find models easier:⭐🐛 Improvements & Bug Fixes
Date2Chunk
annotator to convertDATE
outputs coming fromDateMatcher
andMultiDateMatcher
annotators toCHUNK
that is acceptable by a wider range of annotatorsm1
tosilicon
andapple_silicon
in our code for better clarityResourceDownloader
in Python (clearCache()
,showPublicModels()
,showPublicPipelines()
, andshowAvailableAnnotators()
)delimiter id
in CamemBERT annotators. The delimiter id is actually correct and doesn't need any offsetcache_pretrained
is defined as S3 bucketcache_pretrained
is defined as GCP bucketloadSavedModel()
trying to load external models for private buckets on S3 with better error handling and warningsparams
argument in the Spark NLP start function. You can create aparams = {}
with all Spark NLP and Apache Spark configs and pass it when starting the Spark NLP sessiondoc id
in CoNLL() class when trying to read CoNLL files withid
inside each document's headerDependencies & Code Changes
spark-nlp-m1
Maven package asspark-nlp-silicon
ai
module paving a path to another DL engine💾 Models
Spark NLP 4.3.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.
Featured Models
xx
en
xx
en
ca
xx
en
fr
ja
de
ro
uk
vi
fi
en
en
fr
Spark NLP covers the following languages:
English
,Multilingual
,Afrikaans
,Afro-Asiatic languages
,Albanian
,Altaic languages
,American Sign Language
,Amharic
,Arabic
,Argentine Sign Language
,Armenian
,Artificial languages
,Atlantic-Congo languages
,Austro-Asiatic languages
,Austronesian languages
,Azerbaijani
,Baltic languages
,Bantu languages
,Basque
,Basque (family)
,Belarusian
,Bemba (Zambia)
,Bengali, Bangla
,Berber languages
,Bihari
,Bislama
,Bosnian
,Brazilian Sign Language
,Breton
,Bulgarian
,Catalan
,Caucasian languages
,Cebuano
,Celtic languages
,Central Bikol
,Chichewa, Chewa, Nyanja
,Chilean Sign Language
,Chinese
,Chuukese
,Colombian Sign Language
,Congo Swahili
,Croatian
,Cushitic languages
,Czech
,Danish
,Dholuo, Luo (Kenya and Tanzania)
,Dravidian languages
,Dutch
,East Slavic languages
,Eastern Malayo-Polynesian languages
,Efik
,Esperanto
,Estonian
,Ewe
,Fijian
,Finnish
,Finnish Sign Language
,Finno-Ugrian languages
,French
,French-based creoles and pidgins
,Ga
,Galician
,Ganda
,Georgian
,German
,Germanic languages
,Gilbertese
,Greek (modern)
,Greek languages
,Gujarati
,Gun
,Haitian, Haitian Creole
,Hausa
,Hebrew (modern)
,Hiligaynon
,Hindi
,Hiri Motu
,Hungarian
,Icelandic
,Igbo
,Iloko
,Indic languages
,Indo-European languages
,Indo-Iranian languages
,Indonesian
,Irish
,Isoko
,Isthmus Zapotec
,Italian
,Italic languages
,Japanese
,Japanese
,Kabyle
,Kalaallisut, Greenlandic
,Kannada
,Kaonde
,Kinyarwanda
,Kirundi
,Kongo
,Korean
,Kwangali
,Kwanyama, Kuanyama
,Latin
,Latvian
,Lingala
,Lithuanian
,Louisiana Creole
,Lozi
,Luba-Katanga
,Luba-Lulua
,Lunda
,Lushai
,Luvale
,Macedonian
,Malagasy
,Malay
,Malayalam
,Malayo-Polynesian languages
,Maltese
,Manx
,Marathi (Marāṭhī)
,Marshallese
,Mexican Sign Language
,Mon-Khmer languages
,Morisyen
,Mossi
,Multiple languages
,Ndonga
,Nepali
,Niger-Kordofanian languages
,Nigerian Pidgin
,Niuean
,North Germanic languages
,Northern Sotho, Pedi, Sepedi
,Norwegian
,Norwegian Bokmål
,Norwegian Nynorsk
,Nyaneka
,Oromo
,Pangasinan
,Papiamento
,Persian (Farsi)
,Peruvian Sign Language
,Philippine languages
,Pijin
,Pohnpeian
,Polish
,Portuguese
,Portuguese-based creoles and pidgins
,Punjabi (Eastern)
,Romance languages
,Romanian
,Rundi
,Russian
,Ruund
,Salishan languages
,Samoan
,San Salvador Kongo
,Sango
,Semitic languages
,Serbo-Croatian
,Seselwa Creole French
,Shona
,Sindhi
,Sino-Tibetan languages
,Slavic languages
,Slovak
,Slovene
,Somali
,South Caucasian languages
,South Slavic languages
,Southern Sotho
,Spanish
,Spanish Sign Language
,Sranan Tongo
,Swahili
,Swati
,Swedish
,Tagalog
,Tahitian
,Tai
,Tamil
,Telugu
,Tetela
,Tetun Dili
,Thai
,Tigrinya
,Tiv
,Tok Pisin
,Tonga (Tonga Islands)
,Tonga (Zambia)
,Tsonga
,Tswana
,Tumbuka
,Turkic languages
,Turkish
,Tuvalu
,Tzotzil
,Ukrainian
,Umbundu
,Uralic languages
,Urdu
,Venda
,Venezuelan Sign Language
,Vietnamese
,Wallisian
,Walloon
,Waray (Philippines)
,Welsh
,West Germanic languages
,West Slavic languages
,Western Malayo-Polynesian languages
,Wolaitta, Wolaytta
,Wolof
,Xhosa
,Yapese
,Yiddish
,Yoruba
,Yucatec Maya, Yucateco
,Zande (individual language)
,Zulu
The complete list of all 12600+ models & pipelines in 230+ languages is available on Models Hub
📓 New Notebooks
📖 Documentation
Community support
and show off how you use Spark NLP!
Installation
Python
#PyPI pip install spark-nlp==4.3.0
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
GPU
Apple Silicon (M1 & M2)
AArch64
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
spark-nlp-gpu:
spark-nlp-silicon:
spark-nlp-aarch64:
FAT JARs
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.3.0.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.3.0.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-4.3.0.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.3.0.jar
What's Changed
Contributors
@Cabir40 @bunyamin-polat @danilojsl @dcecchini @Meryem1425 @C-K-Loan @agsfer @maziyarpanahi @jfernandrezj @jsl-builder @DevinTDHa @josejuanmartinez @aymanechilah
Full Changelog: 4.2.8...4.3.0
This discussion was created from the release John Snow Labs Spark-NLP 4.3.0: New HuBERT for speech recognition, new Swin Transformer for Image Classification, new Zero-shot annotator for Entity Recognition, CamemBERT for question answering, new Databricks and EMR with support for Spark 3.3, 1000+ state-of-the-art models and many more!.
Beta Was this translation helpful? Give feedback.
All reactions