Skip to content

Releases: JohnSnowLabs/spark-nlp

Spark NLP 4.0.2: Over 620 new state-of-the-art models in 21 languages, full support for Apache Spark 3.3.0, new Databricks runtime 11.1, and bug fixes

19 Jul 17:19
Compare
Choose a tag to compare

Overview

We are pleased to release Spark NLP πŸš€ 4.0.2! This release comes with full compatibility with the newly-released Apache Spark 3.3.0 and official support for Databrick's new runtimes 11.1 Beta (includes Apache Spark 3.3.0, Scala 2.12).

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features

  • Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
    • Databricks 11.1 Beta
    • Databricks 11.1 ML Berta
    • Databricks 11.1 ML Berta GPU
  • SentenceDetector now comes with a new parameter customBoundsStrategy for returning custom bounds #10567

Example

with setCustomBounds([r"\.", ";"])

This is a sentence. This one uses custom bounds; As is this one;

Without the flags will result in

["This is a sentence", "This one uses custom bounds", "As is this one"]

With the new flag:

.setCustomBounds([r"\.", ";"])
.setCustomBoundsStrategy("append")

the result will be

["This is a sentence.", "This one uses custom bounds;", "As is this one;"]

Similarly with prepend:

1. This is a list
1.1 This is a subpoint
2. Second thing
2.2 Second subthing
.setCustomBounds([r"\n[\d\.]+"])
.setCustomBoundsStrategy("prepend")

the result will be

[
    "1. This is a list",
    "1.1 This is a subpoint",
    "2. Second thing",
    "2.2 Second subthing"
]

Bug Fixes

  • Fix bug that attempts to create spark session on executors when using GraphExtraction in Spark/PySpark 3.3 #9905

Models and Pipelines

Spark NLP 4.0.2 comes with 620+ state-of-the-art pre-trained transformer models in 21 languages including multi-lingual models.

Featured Models

Model Name Lang
BertForQuestionAnswering electra_qa_BioM_Base_SQuAD2_BioASQ8B en
BertForQuestionAnswering bert_qa_multilingual_base_cased_chines zh
BertForQuestionAnswering bert_qa_deep_pavlov_full ru
BertForQuestionAnswering bert_qa_firmanindolanguagemodel id
BertForQuestionAnswering bert_qa_kcbert_base_finetuned_squad ko
BertForQuestionAnswering bert_qa_mbert_finetuned_mlqa_de_hi_dev xx
BertForQuestionAnswering bert_qa_modelontquad tr
BertForQuestionAnswering bert_qa_newsqa_el_4 el
BertForQuestionAnswering bert_qa_testpersianqa fa
BertForQuestionAnswering bert_qa_arabert_finetuned_arcd ar
BertForTokenClassification bert_ner_NER_legal_de_Sahajtomar de
BertForTokenClassification bert_ner_NER_en_vi_it_es_tinparadox xx
BertForTokenClassification bert_ner_NER_CAMELBERT ar
BertForTokenClassification bert_ner_Swedish_NER sv
BertForTokenClassification bert_ner_bert_base_chinese_ner zh
BertForTokenClassification bert_ner_bert_base_hu_cased_ner hu
BertForTokenClassification bert_ner_bert_base_indonesian_NER id
BertForTokenClassification bert_ner_bert_base_irish_cased_v1_finetuned_ner ga
BertForTokenClassification bert_ner_bert_base_pt_archive pt
BertForTokenClassification bert_ner_bert_base_spanish_wwm_uncased_finetuned_NER_medical es

The complete list of all 6900+ models & pipelines in 230+ languages is available on Models Hub


πŸ“– Documentation & Articles


Installation

Python

#PyPI

pip install spark-nlp==4.0.2

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.0.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.0.2</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.0.2</version>
</dependency>

FAT JARs

What's Changed

Contributors

@gadde5300 @danilojsl @hsaglamlar @Cabir40 @ahmedlone127 @muhammetsnts @KshitizGIT @maziyarpanahi @albertoandreottiATgmail @DevinTDHa @luca-martial @Damla-Gurbaz @jsl-models @Meryem1425

New Contributors

Full Changelog: 4.0.1...4.0.2

Spark NLP 4.0.1: Full support for Apache Spark 3.3.0, new Databricks runtime 11, enhancements, and other bug fixes!

01 Jul 15:19
Compare
Choose a tag to compare

Overview

We are pleased to release Spark NLP πŸš€ 4.0.1! This release comes with supporting the newly-released Apache Spark 3.3.0 with improved join query performance via Bloom filters, increases the Pandas API coverage, and many other improvements. In addition, Spark NLP comes with official support for Databricks runtimes 11, other enhancements, and bug fixes.

As always, we would like to thank our community for their feedback, questions, and feature requests.


Features & Enhancements

  • Full support for Apache Spark & PySpark 3.3.0
  • Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
  • New -g option for Google Colab and Kaggle setup on GPU device to upgrade libcudnn8 to 8.1.0 to solve the issue on GPU
  • Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
    • Databricks 11.0 LTS
    • Databricks 11.0 LTS ML
    • Databricks 11.0 LTS ML GPU

Bug Fixes

  • Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
  • Fix and re-upload Dependency and Type Dependency parser pre-trained models
  • Update pre-trained pipelines with issues on PySpark 3.2 and 3.3

Documentation


Installation

Python

#PyPI

pip install spark-nlp==4.0.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1

M1

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.0.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.0.1</version>
</dependency>

spark-nlp-m1:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-m1_2.12</artifactId>
    <version>4.0.1</version>
</dependency>

FAT JARs

What's Changed

Contributors

@muhammetsnts @jsl-models @Meryem1425 @Damla-Gurbaz @jsl-builder @rpranab @danilojsl @josejuanmartinez @Cabir40 @DevinTDHa @agsfer @suvrat-joshi @ahmedlone127 @albertoandreottiATgmail @KshitizGIT @mahmoodbayeshi @maziyarpanahi

New Contributors

Full Changelog: 4.0.0...4.0.1

Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more!

15 Jun 17:38
Compare
Choose a tag to compare

Overview

We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! πŸŽ‰

This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!

We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.

As always, we would like to thank our community for their feedback, questions, and feature requests.


Major features and improvements

  • NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance: export TF_ENABLE_ONEDNN_OPTS=1
  • NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
  • NEW: Official support for Apple silicon M1 on macOS devices. You can use the spark-nlp-m1 package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0
  • NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP πŸš€. AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using AlbertForQuestionAnswering for PyTorch or TFAlbertForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing BertForQuestionAnswering annotator in Spark NLP πŸš€. BertForQuestionAnswering can load BERT & ELECTRA Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using BertForQuestionAnswering and ElectraForQuestionAnswering for PyTorch or TFBertForQuestionAnswering and TFElectraForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP πŸš€. DeBertaForQuestionAnswering can load DeBERTa v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForQuestionAnswering for PyTorch or TFDebertaV2ForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP πŸš€. DistilBertForQuestionAnswering can load DistilBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DistilBertForQuestionAnswering for PyTorch or TFDistilBertForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP πŸš€. LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using LongformerForQuestionAnswering for PyTorch or TFLongformerForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP πŸš€. RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using RobertaForQuestionAnswering for PyTorch or TFRobertaForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP πŸš€. XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForQuestionAnswering for PyTorch or TFXLMRobertaForQuestionAnswering for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
  • NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
  • NEW: Introducing enableInMemoryStorage parameter in WordEmbeddingsModel annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.
  • Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
  • Unifying all supported Apache Spark packages on Maven into spark-nlp for CPU, spark-nlp-gpu for GPU, and spark-nlp-m1 for new Apple silicon M1 on macOS. The need for Apache Spark specific packages like spark-nlp-spark32 has been removed.
  • Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (m1=True)
  • Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
  • Upgrade RocksDB with new enhancements and support for Apple silicon M1
  • Upgrade SentencePiece tokenizer TF ops to 2.7.1
  • Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
  • Upgrade to Scala 2.12.15
  • Update Colab, Kaggle, and SageMaker scripts
  • Refactor the entire Python module in Spark NLP to make the development and maintenance easier
  • Refactor unit tests in Python and migrate to pytest
  • Welcoming 6x new Databricks runtimes to our Spark NLP family:
    • Databricks 10.4 LTS
    • Databricks 10.4 LTS ML
    • Databricks 10.4 LTS ML GPU
    • Databricks 10.5
    • Databricks 10.5 ML
    • Databricks 10.5 ML GPU
  • Welcoming a new EMR 6.x series to our Spark NLP family:
    • EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
  • Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
  • Support for 2 inputs in LightPipeline with MultiDocumentAssembler
  • Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
  • Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
  • Allow change of case sensitivity. Currently, the user cannot set the setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
  • Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
  • Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)

Performance Improvements (Benchmarks)

We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.

The following benchmarks have been done by using a single Dell Server with the following specs:

  • GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
  • CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
  • Memory: 80G

GPU

We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:

Model on GPU Spark NLP 3.4.3 vs. 4.0.0
RoBERTa base +560%(6.6x)
RoBERTa Large +332%(4.3x)
Albert Base +587%(6.9x...
Read more

Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!

06 May 13:49
Compare
Choose a tag to compare

Overview

We are very excited to release Spark NLP πŸš€ 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace πŸ€—, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features

  • NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP πŸš€. DeBertaForTokenClassification can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForTokenClassification for PyTorch or TFDebertaV2ForTokenClassification for TensorFlow models in HuggingFace #8082
  • NEW: Introducing CamemBertEmbeddings annotator in Spark NLP πŸš€. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
  • Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234

Bug Fixes & Enhancements

  • Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the exceptions list to be scalable to a large number of exceptions without impacting the overall performance #7881
  • Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
  • Fix bug that caused get input/output/LazyAnnotator to return None #8043
  • Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
  • Fix missing Lemma and POS models from 3.4.3 release

Dependencies

  • Removing outdated trove4j dependency in favour of native Java modules #8236
  • Upgrade the base Apache Spark to 2.4.8, 3.0.3, and 3.2.1
  • Upgrade type typesafe config to 1.4.2
  • Upgrade sbt to 1.6.2

Models

Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:

New DeBERTa Token Classification Models

New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.

Model Name Lang F1 Dev
DeBertaForTokenClassification deberta_v3_large_token_classifier_conll03 en 0.97
DeBertaForTokenClassification deberta_v3_base_token_classifier_conll03 en 0.96
DeBertaForTokenClassification deberta_v3_small_token_classifier_conll03 en 0.95
DeBertaForTokenClassification deberta_v3_xsmall_token_classifier_conll03 en 0.93
DeBertaForTokenClassification deberta_v3_large_token_classifier_ontonotes en 0.89
DeBertaForTokenClassification deberta_v3_base_token_classifier_ontonotes en 0.88
DeBertaForTokenClassification deberta_v3_small_token_classifier_ontonotes en 0.87
DeBertaForTokenClassification deberta_v3_xsmall_token_classifier_ontonotes en 0.86

New CamemBERT Models

Model Name Lang
CamemBertEmbeddings camembert_large fr
CamemBertEmbeddings camembert_base fr
CamemBertEmbeddings camembert_base_ccnet_4gb fr
CamemBertEmbeddings camembert_base_ccnet fr
CamemBertEmbeddings camembert_base_oscar_4gb fr
CamemBertEmbeddings camembert_base_wikipedia_4gb fr

New DistilBERT Embeddings Models

Model Name Lang
DistilBertEmbeddings distilbert_embeddings_distilbert_base_fr_cased fr
DistilBertEmbeddings distilbert_embeddings_marathi_distilbert mr
DistilBertEmbeddings distilbert_embeddings_distilbert_base_indonesian id
DistilBertEmbeddings distilbert_embeddings_javanese_distilbert_small jv
DistilBertEmbeddings distilbert_embeddings_malaysian_distilbert_small ms
DistilBertEmbeddings distilbert_embeddings_distilbert_base_ar_cased ar

New ALBERT Embeddings Models

Model Name Lang
AlbertEmbeddings albert_embeddings_fralbert_base fr
AlbertEmbeddings albert_embeddings_albert_base_arabic ar
AlbertEmbeddings albert_embeddings_marathi_albert_v2 mr
AlbertEmbeddings albert_embeddings_albert_fa_base_v2 fa
AlbertEmbeddings albert_embeddings_albert_large_bahasa_cased ms
AlbertEmbeddings albert_embeddings_marathi_albert mr

The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import CamemBERT models to Spark NLP πŸš€

Spark NLP HuggingFace Notebooks Colab
CamemBertEmbeddings HuggingFace in Spark NLP - CamemBERT Open In Colab

You can visit Import Transformers in Spark NLP for more info


Documentation

Read more

John Snow Labs Spark-NLP 3.4.3: New DeBERTa for Sequence Classification, sigmoid activation for sequence classifiers, new features for SentenceDetectorDL, over 600 new multi-lingual models, and other improvements!

12 Apr 20:08
Compare
Choose a tag to compare

Overview

We are very excited to release Spark NLP πŸš€ 3.4.3! This release comes with a new DeBERTa for Sequence Classification annotator compatible with existing or fine-tuned models on HuggingFace πŸ€—, a new sigmoid activation function in addition to softmax to support multi-label models in all ForSequenceClassification annotators, new features added to SentenceDetectorDL, new features added to CoNLLU and Lemmatizer, and more than 600 new multi-lingual models for DeBERTa, BERT, DistilBERT, fastText, Lemmatizer and Part of Speech, and other improvements!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features

  • NEW: Introducing DeBertaForSequenceClassification annotator in Spark NLP πŸš€. DeBertaForSequenceClassification can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaForSequenceClassification for PyTorch or TFDebertaForSequenceClassification for TensorFlow models in HuggingFace #7713
  • New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification #7479
  • New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL #7214
  • New impossiblePenultimates in SentenceDetectorDLModel #7685
  • New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol #7344
  • New formCol and lemmaCol parameters in Lemmatizer annotator #7344
  • Add new functionality to download and extract models from S3 via direct link #7682

Enhancements


Models

New DeBERTa Classification Models

New fine-tuned DeBERTa v3 models for text classifications over IMDB reviews in English and Urdu, AG News categories in English, and Allocine French reviews.

Model Name Lang
DeBertaForSequenceClassification mdeberta_v3_base_sequence_classifier_imdb ur
DeBertaForSequenceClassification mdeberta_v3_base_sequence_classifier_allocine fr
DeBertaForSequenceClassification deberta_v3_xsmall_sequence_classifier_imdb en
DeBertaForSequenceClassification deberta_v3_small_sequence_classifier_imdb en
DeBertaForSequenceClassification deberta_v3_base_sequence_classifier_imdb en
DeBertaForSequenceClassification deberta_v3_large_sequence_classifier_imdb en
DeBertaForSequenceClassification deberta_v3_xsmall_sequence_classifier_ag_news en
DeBertaForSequenceClassification deberta_v3_small_sequence_classifier_ag_news en

New BERT Models

Spark NLP now has up to 250 state-of-the-art BERT models in 27 languages including Arabic, Bengali, Chinese, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Javanese, Korean, Marathi, Panjabi, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Turkish, Urdu, Vietnamese, and Multi-lingual.

Model Name Lang
BertEmbeddings bert_embeddings_ARBERT ar
BertEmbeddings bert_embeddings_German_MedBERT de
BertEmbeddings bert_embeddings_bangla_bert_base bn
BertEmbeddings bert_embeddings_bert_base_5lang_cased zh
BertEmbeddings bert_embeddings_bert_base_5lang_cased fr
BertEmbeddings bert_embeddings_bert_base_hi_cased hi
BertEmbeddings bert_embeddings_bert_base_it_cased it
BertEmbeddings bert_embeddings_bert_base ko
BertEmbeddings bert_embeddings_bert_base_tr_cased tr
BertEmbeddings bert_embeddings_bert_base_ur_cased ur
BertEmbeddings bert_embeddings_bert_base_vi_cased vi

New fastText Models

Over 128 new Word2Vec models in 128 languages made by fastText word embeddings.

Model Name Lang
WordEmbeddingsModel w2v_cc_300d hi
WordEmbeddingsModel w2v_cc_300d azb
WordEmbeddingsModel w2v_cc_300d bo
WordEmbeddingsModel w2v_cc_300d diq
WordEmbeddingsModel w2v_cc_300d cy
WordEmbeddingsModel w2v_cc_300d ckb
WordEmbeddingsModel w2v_cc_300d el
WordEmbeddingsModel w2v_cc_300d es

New Lemmatizer and Part of Speech Models

234 new Lemmatizer and Part of Speech models in 62 languages based on the new Universal Dependency treebank 2.9 release.

Model Name Lang
LemmatizerModel lemma_afribooms af
LemmatizerModel lemma_alksnis lt
LemmatizerModel lemma_alpino nl
LemmatizerModel lemma_arcosg gd
LemmatizerModel lemma_ancora es
LemmatizerModel lemma_ancora ca
PerceptronModel pos_mtg te
PerceptronModel pos_ttb ta
PerceptronModel pos_vtb vi
PerceptronModel pos_cac cs
PerceptronModel pos_btb bg
PerceptronModel pos_afribooms af

The complete list of all 4800+ models & pipelines in 200+ languages is available on Models Hub.


Documentation

  • [T...
Read more

John Snow Labs Spark-NLP 3.4.2: DeBERTa embeddings, new caching in Word2Vec and Doc2Vec, new state-of-the-art models, and bug fixes!

10 Mar 15:33
Compare
Choose a tag to compare

Overview

We are pleased to release Spark NLP πŸš€ 3.4.2! This release comes with a new DeBERTa transformer for word embeddings, new caching to speed up training Word2Vec and Doc2Vec, new English and multi-lingual state-of-the-art models, and bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features

  • Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2Model for PyTorch or TFDebertaV2Model for TensorFlow models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
  • Introducing a new param enableCaching in Doc2VecApproach to speed up the training
  • Introducing a new param enableCaching in Word2VecApproach to speed up the training
  • Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
  • Support EMR emr-5.34.0 and emr-6.5.0

Bug Fixes

  • Fix bestModelMetric param when the set value was ignored #6978

New Notebooks

Import DeBERTa models to Spark NLP πŸš€

Spark NLP HuggingFace Notebooks Colab
DeBertaEmbeddings HuggingFace in Spark NLP - DeBERTa Open In Colab

You can visit Import Transformers in Spark NLP for more info


Models

New state-of-the-art DeBERTa models:

Model Name Lang
DeBertaEmbeddings deberta_v3_xsmall en
DeBertaEmbeddings deberta_v3_small en
DeBertaEmbeddings deberta_v3_base en
DeBertaEmbeddings deberta_v3_large en
DeBertaEmbeddings mdeberta_v3_base xx

Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.4.2

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.2

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.2

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.4.2</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 3.4.1...3.4.2

New Contributors

@agsfer @KshitizGIT @gadde5300 @kolia1985 @jsl-models @rpranab @josejuanmartinez @bunyamin-polat @maziyarpanahi @jsl-builder @Damla-Gurbaz @xusliebana @mahmoodbayeshi @luca-martial @dependabot @muhammetsnts @albertoandreottiATgmai

John Snow Labs Spark-NLP 3.4.1: TF session warmup, a new F1 metric to track to save the best model in NerDL, new T5 models like WikiSQL or grammar corrector, other new multi-lingual state-of-the-art models, and bug fixes!

08 Feb 18:05
Compare
Choose a tag to compare

Overview

We are pleased to release Spark NLP πŸš€ 3.4.1! This release comes with a TF session warmup in 3 annotators where the first inference was slower than the rest, adding a new param to choose which F1 to track to save the best model when training a NerDL model, new T5 models such as text to SQL or grammar correction, new multi-lingual state-of-the-art models, and other bug fixes!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features & Enhancements

  • Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same #6773
  • Add bestModelMetric param to choose between Micro-average or Macro-average for best model #6749
  • Add trimWhitespace and preservePosition params to RegexTokenizer #6806
  • Add a new setSentenceMatch param to EntityRuler to match entities across documents/sentences and not just tokens #6841
  • Add support spark32 and real_time_output flags in sparknlp.start() function at the same time #6822
  • Allow users to set tasks in the T5Transformer annotator

Bug Fixes

  • Fix random NullPointerException when using TensorFlow models without Kyro serialization #6741
  • Fix RecursiveTokenizerModel not being readable in a saved Pipeline #6748
  • Fix ContextSpellCheckerApproach not being trained on Databricks #6750
  • Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors #6799
  • Fix GraphExtraction when fullAnnotate and document are used at the same time #6845
  • Fix Word2VecModel being cast to Doc2VecModel by mistake #6849
  • Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification #6867
  • Fix missing setExceotionsPath param in Tokenizer when it's used in Python #6868
  • Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1 (the option to choose which metric to be tracked is now available as well)
  • Update broken slow unit tests #6767

Models

New state-of-the-art models in English, French, Vietnamese, Dutch, and Indian (Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

Featured Pretrained Models

Model Name Lang
T5Transformer t5_informal_to_formal_styletransfer en
T5Transformer t5_formal_to_informal_styletransfer en
T5Transformer t5_passive_to_active_styletransfer en
T5Transformer t5_active_to_passive_styletransfer en
T5Transformer t5_grammar_error_corrector en
T5Transformer t5_small_wikiSQL en
LongformerEmbeddings clinical_longformer en
AlbertEmbeddings albert_indic xx
DistilBertEmbeddings distilbert_base_cased vi
BertForSequenceClassification bert_sequence_classifier_news_sentiment de
BertForSequenceClassification bert_sequence_classifier_emotion en
DistilBertForTokenClassification distilbert_token_classifier_typo_detector en
DistilBertForTokenClassification distilbert_base_token_classifier_masakhaner xx
WordEmbeddingsModel word2vec_wiki_1000 fr
WordEmbeddingsModel word2vec_wac_200 fr
WordEmbeddingsModel w2v_cc_300d fr

Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.4.1

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.1

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11...
Read more

John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more!

05 Jan 15:25
Compare
Choose a tag to compare

Overview

We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! πŸŽ‰

Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.

This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!

As always, we would like to thank our community for their feedback, questions, and feature requests.


Major features and improvements

  • NEW: Introducing GPT2Transformer annotator in Spark NLP πŸš€ for Text Generation purposes. GPT2Transformer uses OpenAI GPT-2 models from HuggingFace πŸ€— for prediction at scale in Spark NLP πŸš€ . GPT-2 is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences
  • NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP πŸš€. RoBertaForSequenceClassification can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForSequenceClassification for PyTorch or TFRobertaForSequenceClassification for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP πŸš€. XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForSequenceClassification for PyTorch or TFXLMRobertaForSequenceClassification for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP πŸš€. LongformerForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForSequenceClassification for PyTorch or TFLongformerForSequenceClassification for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP πŸš€. AlbertForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForSequenceClassification for PyTorch or TFAlbertForSequenceClassification for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP πŸš€. XlnetForSequenceClassification can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForSequenceClassification for PyTorch or TFXLNetForSequenceClassification for TensorFlow models in HuggingFace πŸ€—
  • NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL
  • Introducing useBestModel param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
  • Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have spark-nlp-spark32 and spark-nlp-gpu-spark32 packages
  • Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (spark32=True)
  • Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as pip install spark-nlp pyspark==3.1.2
  • Add new scripts/notebook to generate custom TensroFlow graphs for ContextSpellCheckerApproach annotator
  • Add a new graphFolder param to ContextSpellCheckerApproach annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
  • Support DBFS file system in graphFolder param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
  • Add a new feature to all classifiers (ForTokenClassification and ForSequenceClassification) to retrieve classes from the pretrained models
sequenceClassifier = XlmRoBertaForSequenceClassification \
      .pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class')

print(sequenceClassifier.getClasses())

#Sports, Business, World, Sci/Tech
  • Add inputFormats param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
date_matcher = DateMatcher() \
    .setInputCols(['document']) \
    .setOutputCol("date") \
    .setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \
    .setOutputFormat("yyyyMM") \ #previously called `.setDateFormat`
    .setSourceLanguage("en")
  • Enable batch processing in T5Transformer and MarianTransformer annotators
  • Add Schema to readDataset in CoNLL() class
  • Welcoming 6x new Databricks runtimes to our Spark NLP family:
    • Databricks 10.0
    • Databricks 10.0 ML GPU
    • Databricks 10.1
    • Databricks 10.1 ML GPU
    • Databricks 10.2
    • Databricks 10.2 ML GPU
  • Welcoming 3x new EMR 6.x series to our Spark NLP family:
    • EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
    • EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
    • EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)

Bug Fixes

  • Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution #6575
  • Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators #6605
  • Fix a bug in model resolution by not filtering based on the timestamp
  • Fix configProtoBytes param type in Python #6549
  • Fix missing DefaultParamsReadable in RegexTokenizer annotator #6653
  • Fix missing models lemma_antbnc, sentiment_vivekn, and spellcheck_norvig for Spark 3.x
  • Fix missing pipelines clean_slang, check_spelling, match_chunks, and match_datetime for Spark 3.x
  • Fix saveModel in TrainingHelper
  • Fix Keyword/Yake module naming in Scala #6562

Models Hub

Models Hub now comes with new features to easily filter and find your desired models & pipelines by:

  • NLP Task
  • Natural Language
  • Spark NLP version

image

In addition, you can also filter models & pipelines by:

  • Models or Pipelines (finally! πŸ˜ƒ )
  • Tags used inside Model's card
  • Or even by predicted entities (which labels/classes a model can predict)

image

As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! πŸš€


Models and Pipelines
--------------...

Read more

John Snow Labs Spark-NLP 3.3.4: Patch release

25 Nov 15:16
Compare
Choose a tag to compare

Patch release

  • Fix ClassCastException error in pretrained function for DistilBertForSequenceClassification in Python #6513

Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.3.4

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.4

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.3.4</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 3.3.3...3.3.4

John Snow Labs Spark-NLP 3.3.3: New DistilBERT for Sequence Classification, new trainable and distributed Doc2Vec, BERT improvements on GPU, new state-of-the-art DistilBERT models for topic and sentiment detection, enhancements, and bug fixes!

22 Nov 18:37
Compare
Choose a tag to compare

Overview

(knock, knock, knock) Penny? Yes, this is a very special release if you are obsessed with the number 3 as much as we are! So we are pleased to announce Spark NLP πŸš€ 3.3.3 release! πŸŽ‰ 🎊 🎈

This release comes with a new DistilBertForSequenceClassification annotator for existing or fine-tuned DistilBERT models for Text Classification on HuggingFace, new distributed and trainable Doc2Vec annotator based on Word2Vec implementation in Spark ML, improving BertEmbeddings and BertSentenceEmbeddings on a single machine on a GPU device where the DataFrame has 1 sentence per row or input column is set to document, new state-of-the-art fine-tuned DistilBERT models for Sequence Classification, enhancements, bug fixes, and more!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features and Enhancements

  • NEW: Introducing DistilBertForSequenceClassification annotator in Spark NLP πŸš€. DistilBertForSequenceClassification DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DistilBertForSequenceClassification or TFDistilBertForSequenceClassification in HuggingFace πŸ€—
  • NEW: Introducing trainable and distributed Doc2Vec annotators based on Word2Vec in Spark ML
  • Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
  • Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
  • Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
  • Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach
  • Add script to setup AWS SageMaker thanks to @xegulon
  • Add instructions to setup Amazon Linux 2

Bug Fixes

  • Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
  • Fix MarianTransformer bug on empty sequences
  • Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
  • Fix MarianTransformer multi-lingual models and pipelines such as opus_mt_mul_en and opus_mt_mul_en
  • Fix a bug in DateMatcher and MultiDateMatcher when detecting month from subwords by mistake
  • Add the missing lemma_antbnc model to Models Hub
  • Add the missing sentiment_vivekn model to Models Hub
  • Add the missing spellcheck_norvig model to Models Hub

Models

New state-of-the-art fine-tuned DistilBERT models for Sequence Classification:

Featured Pretrained Models

Model Name Build Lang
DistilBertForSequenceClassification distilbert_sequence_classifier_sst2 en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_policy en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_industry en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_emotion en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_banking77 en 3.3.3
DistilBertForSequenceClassification distilbert_multilingual_sequence_classifier_allocine fr 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_imdb ur 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_imdb en 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_amazon_polarity en 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_ag_news en 3.3.3
Doc2VecModel doc2vec_gigaword_300 en 3.3.3
Doc2VecModel doc2vec_gigaword_wiki_300 en 3.3.3

The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Spark NLP Notebooks Colab
DistilBertForSequenceClassification HuggingFace in Spark NLP - DistilBertForSequenceClassification Open In Colab
Doc2Vec Train Doc2Vec for Text Classification Open In Colab

Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.3.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3....
Read more