Skip to content

Spark NLP 4.3.1: Patch release

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 24 Feb 18:36
· 534 commits to master since this release

πŸ“’ Overview

Spark NLP 4.3.1 πŸš€ comes with a new SpacyToAnnotation feature to import documents, sentences, and tokens from spaCy and similar libraries into Spark NLP pipelines. We have also made other improvements in this patch release.

As always, we would like to thank our community for their feedback, questions, and feature requests. πŸŽ‰


⭐ New Features & Enhancements

  • Easily use external Sentences and Tokens from external libraries such as spaCy in Spark NLP pipeline
# this is how your file from spaCy would look like
! cat ./multi_doc_tokens.json

[
  {
    "tokens": ["John", "went", "to", "the", "store", "last", "night", ".", "He", "bought", "some", "bread", "."],
    "token_spaces": [true, true, true, true, true, true, false, true, true, true, true, false, false],
    "sentence_ends": [7, 12]
  },
  {
    "tokens": ["Hello", "world", "!", "How", "are", "you", "today", "?", "I", "'m", "fine", "thanks", "."],
    "token_spaces": [true, false, true, true, true, true, false, true, false, true, true, false, false],
    "sentence_ends": [2, 7, 12]
  }
]

# we are now going to prepare these documents, sentence, and tokens for Spark NLP
from sparknlp.training import SpacyToAnnotation

nlp_reader = SpacyToAnnotation()
result = nlp_reader.readJsonFile(spark, "./multi_doc_tokens.json")

result.printSchema()
# now you have all the annotations for documents, sentences, and tokens needed in Spark NLP
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
  • Implement params parameter which can supply custom configurations to the SparkSession in Scala (to be sync with Python)
val hadoopAwsVersion: String = "3.3.1"
val awsJavaSdkVersion: String = "1.11.901"

val extraParams: Map[String, String] = Map(
  "spark.jars.packages" -> ("org.apache.hadoop:hadoop-aws:" + hadoopAwsVersion + ",com.amazonaws:aws-java-sdk:" + awsJavaSdkVersion),
  "spark.hadoop.fs.s3a.path.style.access" -> "true")

val spark = SparkNLP.start(params = extraParams)
  • Add entity field to the metadata in Date2Chunk
  • Fix ViT models & pipelines examples in Models Hub

πŸ““ New Notebooks

Spark NLP
Import Tokens from spaCy or a JSON file

πŸ“– Documentation

Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas,
    and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==4.3.1

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:4.3.1

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.1

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.3.1

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>4.3.1</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>4.3.1</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>4.3.1</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>4.3.1</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 4.3.0...4.3.1