Spark NLP 4.3.1: Patch release
π’ Overview
Spark NLP 4.3.1 π comes with a new SpacyToAnnotation
feature to import documents, sentences, and tokens from spaCy and similar libraries into Spark NLP pipelines. We have also made other improvements in this patch release.
As always, we would like to thank our community for their feedback, questions, and feature requests. π
β New Features & Enhancements
- Easily use external Sentences and Tokens from external libraries such as spaCy in Spark NLP pipeline
# this is how your file from spaCy would look like
! cat ./multi_doc_tokens.json
[
{
"tokens": ["John", "went", "to", "the", "store", "last", "night", ".", "He", "bought", "some", "bread", "."],
"token_spaces": [true, true, true, true, true, true, false, true, true, true, true, false, false],
"sentence_ends": [7, 12]
},
{
"tokens": ["Hello", "world", "!", "How", "are", "you", "today", "?", "I", "'m", "fine", "thanks", "."],
"token_spaces": [true, false, true, true, true, true, false, true, false, true, true, false, false],
"sentence_ends": [2, 7, 12]
}
]
# we are now going to prepare these documents, sentence, and tokens for Spark NLP
from sparknlp.training import SpacyToAnnotation
nlp_reader = SpacyToAnnotation()
result = nlp_reader.readJsonFile(spark, "./multi_doc_tokens.json")
result.printSchema()
# now you have all the annotations for documents, sentences, and tokens needed in Spark NLP
root
|-- document: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- sentence: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
|-- token: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- annotatorType: string (nullable = true)
| | |-- begin: integer (nullable = false)
| | |-- end: integer (nullable = false)
| | |-- result: string (nullable = true)
| | |-- metadata: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
| | |-- embeddings: array (nullable = true)
| | | |-- element: float (containsNull = false)
- Implement
params
parameter which can supply custom configurations to the SparkSession in Scala (to be sync with Python)
val hadoopAwsVersion: String = "3.3.1"
val awsJavaSdkVersion: String = "1.11.901"
val extraParams: Map[String, String] = Map(
"spark.jars.packages" -> ("org.apache.hadoop:hadoop-aws:" + hadoopAwsVersion + ",com.amazonaws:aws-java-sdk:" + awsJavaSdkVersion),
"spark.hadoop.fs.s3a.path.style.access" -> "true")
val spark = SparkNLP.start(params = extraParams)
- Add
entity
field to the metadata in Date2Chunk - Fix ViT models & pipelines examples in Models Hub
π New Notebooks
Spark NLP |
---|
Import Tokens from spaCy or a JSON file |
- You can visit Import Transformers in Spark NLP
- You can visit Spark NLP Examples for 100+ examples
π Documentation
- Import models from TF Hub & HuggingFace
- Spark NLP Notebooks
- Models Hub with new models
- Spark NLP Articles
- Spark NLP in Action
- Spark NLP Documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
Community support
- Slack For live discussion with the Spark NLP community and the team
- GitHub Bug reports, feature requests, and contributions
- Discussions Engage with other community members, share ideas,
and show off how you use Spark NLP! - Medium Spark NLP articles
- YouTube Spark NLP video tutorials
Installation
Python
#PyPI
pip install spark-nlp==4.3.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1
Apple Silicon (M1 & M2)
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.1
AArch64
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.3.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.3.1</version>
</dependency>
spark-nlp-silicon:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-silicon_2.12</artifactId>
<version>4.3.1</version>
</dependency>
spark-nlp-aarch64:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-aarch64_2.12</artifactId>
<version>4.3.1</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.3.1.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.3.1.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-4.3.1.jar
-
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.3.1.jar
What's Changed
- Update Index and footer date by @agsfer in #13492
- Release notes for 4.6.3 and 4.6.5 by @rpranab in #13455
- Docs/playground prompts by @diatrambitas in #13498
- BUGFIX NMH-155: The generated JSON files should not be in the repo by @pabla in #13499
- deid utility module added by @Ahmetemintek in #13500
- Models hub internal by @Cabir40 in #13502
- deid module update by @Ahmetemintek in #13506
- 4.3.0 released by @Cabir40 in #13511
- RN updated by @Cabir40 in #13512
- Models hub internal by @Cabir40 in #13509
- Cc 2 13 update by @Cabir40 in #13513
- Update licensed docs by @dcecchini in #13405
- Fixed links after the change in example to examples by @dcecchini in #13517
- release notes for ocr 4.3.1 by @albertoandreottiATgmail in #13533
- Ocr release notes 4.3.1 by @albertoandreottiATgmail in #13534
- Legal NLP 1.8.0 by @josejuanmartinez in #13536
- [skip ci] Create PR 4.3.0-healthcare-docs-1c4423e4384b49066a7a73d412e2fb01155fde6e-15 by @jsl-builder in #13505
- Finance NLP 1.8.0 by @josejuanmartinez in #13537
- Update 2023-02-02-legpipe_ner_contract_doc_parties_alias_former_en.md by @gadde5300 in #13490
- Finance NLP 1.8.0 by @josejuanmartinez in #13538
- Docs/prepaid lib product by @diatrambitas in #13545
- FEATURE NMH-152: Update tooltips for small, medium, large, and xlarge models by @pabla in #13549
- Removed all mention to lawinsider in md cards by @Mary-Sci in #13553
- Docs/nlp lab release4.7.1 by @rpranab in #13555
- [SPARKNLP-741] Fix for Scala Examples by @DevinTDHa in #13501
- SPARKNLP-743: Add parameter to SparkNLP.start by @DevinTDHa in #13510
- SPARKNLP-735 Adding SpacyToAnnotation component by @danilojsl in #13515
- Models hub by @maziyarpanahi in #13559
- 431-release-candidate by @maziyarpanahi in #13558
- SPARKNLP 735 adding notebook examples by @danilojsl in #13560
Full Changelog: 4.3.0...4.3.1