From 6957c9d8ed070068550fd3a0ff8cf57eca495bce Mon Sep 17 00:00:00 2001 From: Danilo Burbano Date: Fri, 5 Jul 2024 10:58:06 -0500 Subject: [PATCH] [SPARKNLP-1015] Restructuring Readme and Documentation --- README.md | 1227 +++------------------------------- docs/_data/navigation.yml | 6 + docs/en/advanced_settings.md | 142 ++++ docs/en/features.md | 120 ++++ docs/en/install.md | 435 +++++++++++- docs/en/pipelines.md | 1035 +++------------------------- 6 files changed, 906 insertions(+), 2059 deletions(-) create mode 100644 docs/en/advanced_settings.md create mode 100644 docs/en/features.md diff --git a/README.md b/README.md index cb7c32736e8638..e75f09e6210138 100644 --- a/README.md +++ b/README.md @@ -29,148 +29,17 @@ It also offers tasks such as **Tokenization**, **Word Segmentation**, **Part-of- Take a look at our official Spark NLP page: [https://sparknlp.org/](https://sparknlp.org/) for user documentation and examples -## Community support - -- [Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q) For live discussion with the Spark NLP community and the team -- [GitHub](https://github.com/JohnSnowLabs/spark-nlp) Bug reports, feature requests, and contributions -- [Discussions](https://github.com/JohnSnowLabs/spark-nlp/discussions) Engage with other community members, share ideas, - and show off how you use Spark NLP! -- [Medium](https://medium.com/spark-nlp) Spark NLP articles -- [YouTube](https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos) Spark NLP video tutorials - -## Table of contents - -- [Features](#features) -- [Requirements](#requirements) -- [Quick Start](#quick-start) -- [Apache Spark Support](#apache-spark-support) -- [Scala & Python Support](#scala-and-python-support) -- [Databricks Support](#databricks-support) -- [EMR Support](#emr-support) -- [Using Spark NLP](#usage) - - [Packages Cheatsheet](#packages-cheatsheet) - - [Spark Packages](#spark-packages) - - [Scala](#scala) - - [Maven](#maven) - - [SBT](#sbt) - - [Python](#python) - - [Pip/Conda](#pipconda) - - [Compiled JARs](#compiled-jars) - - [Apache Zeppelin](#apache-zeppelin) - - [Jupyter Notebook](#jupyter-notebook-python) - - [Google Colab Notebook](#google-colab-notebook) - - [Kaggle Kernel](#kaggle-kernel) - - [Databricks Cluster](#databricks-cluster) - - [EMR Cluster](#emr-cluster) - - [GCP Dataproc](#gcp-dataproc) - - [Spark NLP Configuration](#spark-nlp-configuration) -- [Pipelines & Models](#pipelines-and-models) - - [Pipelines](#pipelines) - - [Models](#models) -- [Offline](#offline) -- [Examples](#examples) -- [FAQ](#faq) -- [Citation](#citation) -- [Contributing](#contributing) - ## Features - -- Tokenization -- Trainable Word Segmentation -- Stop Words Removal -- Token Normalizer -- Document Normalizer -- Document & Text Splitter -- Stemmer -- Lemmatizer -- NGrams -- Regex Matching -- Text Matching -- Chunking -- Date Matcher -- Sentence Detector -- Deep Sentence Detector (Deep learning) -- Dependency parsing (Labeled/unlabeled) -- SpanBertCorefModel (Coreference Resolution) -- Part-of-speech tagging -- Sentiment Detection (ML models) -- Spell Checker (ML and DL models) -- Word Embeddings (GloVe and Word2Vec) -- Doc2Vec (based on Word2Vec) -- BERT Embeddings (TF Hub & HuggingFace models) -- DistilBERT Embeddings (HuggingFace models) -- CamemBERT Embeddings (HuggingFace models) -- RoBERTa Embeddings (HuggingFace models) -- DeBERTa Embeddings (HuggingFace v2 & v3 models) -- XLM-RoBERTa Embeddings (HuggingFace models) -- Longformer Embeddings (HuggingFace models) -- ALBERT Embeddings (TF Hub & HuggingFace models) -- XLNet Embeddings -- ELMO Embeddings (TF Hub models) -- Universal Sentence Encoder (TF Hub models) -- BERT Sentence Embeddings (TF Hub & HuggingFace models) -- RoBerta Sentence Embeddings (HuggingFace models) -- XLM-RoBerta Sentence Embeddings (HuggingFace models) -- INSTRUCTOR Embeddings (HuggingFace models) -- E5 Embeddings (HuggingFace models) -- MPNet Embeddings (HuggingFace models) -- UAE Embeddings (HuggingFace models) -- OpenAI Embeddings -- Sentence & Chunk Embeddings -- Unsupervised keywords extraction -- Language Detection & Identification (up to 375 languages) -- Multi-class & Multi-labe Sentiment analysis (Deep learning) -- Multi-class Text Classification (Deep learning) -- BERT for Token & Sequence Classification & Question Answering -- DistilBERT for Token & Sequence Classification & Question Answering -- CamemBERT for Token & Sequence Classification & Question Answering -- ALBERT for Token & Sequence Classification & Question Answering -- RoBERTa for Token & Sequence Classification & Question Answering -- DeBERTa for Token & Sequence Classification & Question Answering -- XLM-RoBERTa for Token & Sequence Classification & Question Answering -- Longformer for Token & Sequence Classification & Question Answering -- MPnet for Token & Sequence Classification & Question Answering -- XLNet for Token & Sequence Classification -- Zero-Shot NER Model -- Zero-Shot Text Classification by Transformers (ZSL) -- Neural Machine Translation (MarianMT) -- Many-to-Many multilingual translation model (Facebook M2M100) -- Table Question Answering (TAPAS) -- Text-To-Text Transfer Transformer (Google T5) -- Generative Pre-trained Transformer 2 (OpenAI GPT2) -- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART) -- Chat and Conversational LLMs (Facebook Llama-2) -- Vision Transformer (Google ViT) -- Swin Image Classification (Microsoft Swin Transformer) -- ConvNext Image Classification (Facebook ConvNext) -- Vision Encoder Decoder for image-to-text like captioning -- Zero-Shot Image Classification by OpenAI's CLIP -- Automatic Speech Recognition (Wav2Vec2) -- Automatic Speech Recognition (HuBERT) -- Automatic Speech Recognition (OpenAI Whisper) -- Named entity recognition (Deep learning) -- Easy ONNX, OpenVINO, and TensorFlow integrations -- GPU Support -- Full integration with Spark ML functions -- +31000 pre-trained models in +200 languages! -- +6000 pre-trained pipelines in +200 languages! -- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, - Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more. - -## Requirements - -To use Spark NLP you need the following requirements: - -- Java 8 and 11 -- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x - -**GPU (optional):** - -Spark NLP 5.4.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - -- NVIDIA® GPU drivers version 450.80.02 or higher -- CUDA® Toolkit 11.2 -- cuDNN SDK 8.1.0 +- [Text Preprocessing](https://sparknlp.org/docs/en/features#text-preproccesing) +- [Parsing and Analysis](https://sparknlp.org/docs/en/features#parsing-and-analysis) +- [Sentiment and Classification](https://sparknlp.org/docs/en/features#sentiment-and-classification) +- [Embeddings](https://sparknlp.org/docs/en/features#embeddings) +- [Classification and Question Answering Models](https://sparknlp.org/docs/en/features#classification-and-question-answering-models) +- [Machine Translation and Generation](https://sparknlp.org/docs/en/features#machine-translation-and-generation) +- [Image and Speech](https://sparknlp.org/docs/en/features#image-and-speech) +- [Integration and Interoperability (ONNX, OpenVINO)](https://sparknlp.org/docs/en/features#integration-and-interoperability) +- [Pre-trained Models (36000+ in +200 languages)](https://sparknlp.org/docs/en/features#pre-trained-models) +- [Multi-lingual Support](https://sparknlp.org/docs/en/features#multi-lingual-support) ## Quick Start @@ -225,7 +94,27 @@ Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris'] For more examples, you can visit our dedicated [examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) to showcase all Spark NLP use cases! -## Apache Spark Support +### Packages Cheatsheet + +This is a cheatsheet for corresponding Spark NLP Maven package to Apache Spark / PySpark major version: + +| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | +|-------------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| +| 3.0/3.1/3.2/3.3/3.4/3.5 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | +| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | + +NOTE: `M1/M2` and `AArch64` are under `experimental` support. Access and support to these architectures are limited by the +community and we had to build most of the dependencies by ourselves to make them compatible. We support these two +architectures, however, they may not work in some environments. + +## Pipelines and Models +For a quick example of using pipelines and models take a look at our official [documentation](https://sparknlp.org/docs/en/install#pipelines-and-models) + +#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more + +## Platform and Ecosystem Support + +### Apache Spark Support Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x @@ -236,15 +125,10 @@ Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports | 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO | | 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO | | 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | -| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | -| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO | -| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO | -| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO | -| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO | Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases). -## Scala and Python Support +### Scala and Python Support | Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | |-----------|------------|------------|------------|------------|------------|------------|------------| @@ -252,737 +136,87 @@ Find out more about `Spark NLP` versions from our [release notes](https://github | 5.2.x | NO | YES | YES | YES | YES | NO | YES | | 5.1.x | NO | YES | YES | YES | YES | NO | YES | | 5.0.x | NO | YES | YES | YES | YES | NO | YES | -| 4.4.x | NO | YES | YES | YES | YES | NO | YES | -| 4.3.x | YES | YES | YES | YES | YES | NO | YES | -| 4.2.x | YES | YES | YES | YES | YES | NO | YES | -| 4.1.x | YES | YES | YES | YES | NO | NO | YES | -| 4.0.x | YES | YES | YES | YES | NO | NO | YES | -## Databricks Support +Find out more about 4.x `SparkNLP` versions in our official [documentation](https://sparknlp.org/docs/en/install#apache-spark-support) + +### Databricks Support Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: -**CPU:** - -- 9.1 -- 9.1 ML -- 10.1 -- 10.1 ML -- 10.2 -- 10.2 ML -- 10.3 -- 10.3 ML -- 10.4 -- 10.4 ML -- 10.5 -- 10.5 ML -- 11.0 -- 11.0 ML -- 11.1 -- 11.1 ML -- 11.2 -- 11.2 ML -- 11.3 -- 11.3 ML -- 12.0 -- 12.0 ML -- 12.1 -- 12.1 ML -- 12.2 -- 12.2 ML -- 13.0 -- 13.0 ML -- 13.1 -- 13.1 ML -- 13.2 -- 13.2 ML -- 13.3 -- 13.3 ML -- 14.0 -- 14.0 ML -- 14.1 -- 14.1 ML -- 14.2 -- 14.2 ML -- 14.3 -- 14.3 ML - -**GPU:** - -- 9.1 ML & GPU -- 10.1 ML & GPU -- 10.2 ML & GPU -- 10.3 ML & GPU -- 10.4 ML & GPU -- 10.5 ML & GPU -- 11.0 ML & GPU -- 11.1 ML & GPU -- 11.2 ML & GPU -- 11.3 ML & GPU -- 12.0 ML & GPU -- 12.1 ML & GPU -- 12.2 ML & GPU -- 13.0 ML & GPU -- 13.1 ML & GPU -- 13.2 ML & GPU -- 13.3 ML & GPU -- 14.0 ML & GPU -- 14.1 ML & GPU -- 14.2 ML & GPU -- 14.3 ML & GPU - -## EMR Support +| **CPU** | **GPU** | +|--------------------|--------------------| +| 14.0 / 14.0 ML | 14.0 ML & GPU | +| 14.1 / 14.1 ML | 14.1 ML & GPU | +| 14.2 / 14.2 ML | 14.2 ML & GPU | +| 14.3 / 14.3 ML | 14.3 ML & GPU | + +We are compatible with older runtimes. For a full list check databricks support in our official [documentation](https://sparknlp.org/docs/en/install#databricks-support) + +### EMR Support Spark NLP 5.4.0 has been tested and is compatible with the following EMR releases: -- emr-6.2.0 -- emr-6.3.0 -- emr-6.3.1 -- emr-6.4.0 -- emr-6.5.0 -- emr-6.6.0 -- emr-6.7.0 -- emr-6.8.0 -- emr-6.9.0 -- emr-6.10.0 -- emr-6.11.0 -- emr-6.12.0 -- emr-6.13.0 -- emr-6.14.0 -- emr-6.15.0 -- emr-7.0.0 +| **EMR Release** | +|--------------------| +| emr-6.13.0 | +| emr-6.14.0 | +| emr-6.15.0 | +| emr-7.0.0 | + +We are compatible with older EMR releases. For a full list check EMR support in our official [documentation](https://sparknlp.org/docs/en/install#emr-support) Full list of [Amazon EMR 6.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html) Full list of [Amazon EMR 7.x releases](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-7x.html) NOTE: The EMR 6.1.0 and 6.1.1 are not supported. -## Usage - -## Packages Cheatsheet - -This is a cheatsheet for corresponding Spark NLP Maven package to Apache Spark / PySpark major version: - -| Apache Spark | Spark NLP on CPU | Spark NLP on GPU | Spark NLP on AArch64 (linux) | Spark NLP on Apple Silicon | -|-------------------------|--------------------|----------------------------|--------------------------------|--------------------------------------| -| 3.0/3.1/3.2/3.3/3.4/3.5 | `spark-nlp` | `spark-nlp-gpu` | `spark-nlp-aarch64` | `spark-nlp-silicon` | -| Start Function | `sparknlp.start()` | `sparknlp.start(gpu=True)` | `sparknlp.start(aarch64=True)` | `sparknlp.start(apple_silicon=True)` | - -NOTE: `M1/M2` and `AArch64` are under `experimental` support. Access and support to these architectures are limited by the -community and we had to build most of the dependencies by ourselves to make them compatible. We support these two -architectures, however, they may not work in some environments. - -## Spark Packages +## Installation ### Command line (requires internet connection) +To install spark-nlp packages through command line follow [these instructions](https://sparknlp.org/docs/en/install#command-line) from our official documentation -Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x - -#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12) - -```sh -# CPU - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -The `spark-nlp` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). - -```sh -# GPU - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 - -``` - -The `spark-nlp-gpu` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu). - -```sh -# AArch64 - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 - -``` - -The `spark-nlp-aarch64` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64). - -```sh -# M1/M2 (Apple Silicon) - -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 - -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 - -``` - -The `spark-nlp-silicon` has been published to -the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon). - -**NOTE**: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following -set in your SparkSession: - -```sh -spark-shell \ - --driver-memory 16g \ - --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -## Scala +### Scala Spark NLP supports Scala 2.12.15 if you are using Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x versions. Our packages are -deployed to Maven central. To add any of our packages as a dependency in your application you can follow these -coordinates: - -### Maven - -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: - -```xml - - - com.johnsnowlabs.nlp - spark-nlp_2.12 - 5.4.0 - -``` - -**spark-nlp-gpu:** - -```xml - - - com.johnsnowlabs.nlp - spark-nlp-gpu_2.12 - 5.4.0 - -``` - -**spark-nlp-aarch64:** - -```xml - - - com.johnsnowlabs.nlp - spark-nlp-aarch64_2.12 - 5.4.0 - -``` - -**spark-nlp-silicon:** - -```xml - - - com.johnsnowlabs.nlp - spark-nlp-silicon_2.12 - 5.4.0 - -``` - -### SBT - -**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.4.0" -``` - -**spark-nlp-gpu:** - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.4.0" -``` - -**spark-nlp-aarch64:** - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.4.0" -``` - -**spark-nlp-silicon:** - -```sbtshell -// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.4.0" -``` - -Maven -Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp) +deployed to Maven central. To add any of our packages as a dependency in your application you can follow [these instructions](https://sparknlp.org/docs/en/install#scala-and-java) +from our official documentation. If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your projects [Spark NLP SBT Starter](https://github.com/maziyarpanahi/spark-nlp-starter) -## Python - -Spark NLP supports Python 3.6.x and above depending on your major PySpark version. - -### Python without explicit Pyspark installation - -### Pip/Conda - -If you installed pyspark through pip/conda, you can install `spark-nlp` through the same channel. - -Pip: - -```bash -pip install spark-nlp==5.4.0 -``` - -Conda: - -```bash -conda install -c johnsnowlabs spark-nlp -``` - -PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/) / -Anaconda [spark-nlp package](https://anaconda.org/JohnSnowLabs/spark-nlp) - -Then you'll have to create a SparkSession either from Spark NLP: - -```python -import sparknlp - -spark = sparknlp.start() -``` - -or manually: - -```python -spark = SparkSession.builder - .appName("Spark NLP") - .master("local[*]") - .config("spark.driver.memory", "16G") - .config("spark.driver.maxResultSize", "0") - .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") - .getOrCreate() -``` - -If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course, -you'll have to put the jars in a reachable location for all driver and executor nodes. - -**Quick example:** - -```python -import sparknlp -from sparknlp.pretrained import PretrainedPipeline - -# create or get Spark Session - -spark = sparknlp.start() - -sparknlp.version() -spark.version - -# download, load and annotate a text by pre-trained pipeline - -pipeline = PretrainedPipeline('recognize_entities_dl', 'en') -result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo') -``` - -## Compiled JARs - -### Build from source - -#### spark-nlp - -- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - -```bash -sbt assembly -``` - -- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - -```bash -sbt -Dis_gpu=true assembly -``` - -- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - -```bash -sbt -Dis_silicon=true assembly -``` - -### Using the jar manually - -If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it -from [Maven Central](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp). - -To add JARs to spark programs use the `--jars` option: - -```sh -spark-shell --jars spark-nlp.jar -``` - -The preferred way to use the library when running spark programs is using the `--packages` option as specified in -the `spark-packages` section. - -## Apache Zeppelin - -Use either one of the following options - -- Add the following Maven Coordinates to the interpreter's library list - -```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is - available to driver path - -### Python in Zeppelin - -Apart from the previous step, install the python module through pip - -```bash -pip install spark-nlp==5.4.0 -``` - -Or you can install `spark-nlp` from inside Zeppelin by using Conda: - -```bash -python.conda install -c johnsnowlabs spark-nlp -``` - -Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose. - -Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and -install the pip library with (e.g. `python3`). - -An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as -shown earlier since it includes both scala and python side installation. - -## Jupyter Notebook (Python) - -**Recommended:** - -The easiest way to get this done on Linux and macOS is to simply install `spark-nlp` and `pyspark` PyPI packages and -launch the Jupyter from the same Python environment: - -```sh -$ conda create -n sparknlp python=3.8 -y -$ conda activate sparknlp -# spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter -$ jupyter notebook -``` - -Then you can use `python3` kernel to run your code with creating SparkSession via `spark = sparknlp.start()`. - -**Optional:** - -If you are in different operating systems and require to make Jupyter Notebook run by using pyspark, you can follow -these steps: - -```bash -export SPARK_HOME=/path/to/your/spark/folder -export PYSPARK_PYTHON=python3 -export PYSPARK_DRIVER_PYTHON=jupyter -export PYSPARK_DRIVER_PYTHON_OPTS=notebook - -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` - -If not using pyspark at all, you'll have to run the instructions -pointed [here](#python-without-explicit-pyspark-installation) - -## Google Colab Notebook - -Google Colab is perhaps the easiest way to get started with spark-nlp. It requires no installation or setup other than -having a Google account. - -Run the following code in Google Colab notebook and start using spark-nlp right away. - -```sh -# This is only to setup PySpark and Spark NLP on Colab -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash -``` - -This script comes with the two options to define `pyspark` and `spark-nlp` versions via options: - -```sh -# -p is for pyspark -# -s is for spark-nlp -# -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage -# by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0 -``` - -[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) -is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP -pretrained pipelines. - -## Kaggle Kernel - -Run the following code in Kaggle Kernel and start using spark-nlp right away. - -```sh -# Let's setup Kaggle for Spark NLP and PySpark -!wget https://setup.johnsnowlabs.com/kaggle.sh -O - | bash -``` - -This script comes with the two options to define `pyspark` and `spark-nlp` versions via options: - -```sh -# -p is for pyspark -# -s is for spark-nlp -# -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage -# by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.4.0 -``` - -[Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live -demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP pretrained pipeline. - -## Databricks Cluster - -1. Create a cluster if you don't have one already - -2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: - - ```bash - spark.kryoserializer.buffer.max 2000M - spark.serializer org.apache.spark.serializer.KryoSerializer - ``` - -3. In `Libraries` tab inside your cluster you need to follow these steps: - - 3.1. Install New -> PyPI -> `spark-nlp==5.4.0` -> Install - - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install - -4. Now you can attach your notebook to the cluster and use Spark NLP! - -NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark -NLP Maven package name (Maven Coordinate) for your runtime from -our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet) - -## EMR Cluster - -To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software -configuration. - -A sample of your bootstrap script - -```.sh -#!/bin/bash -set -x -e - -echo -e 'export PYSPARK_PYTHON=/usr/bin/python3 -export HADOOP_CONF_DIR=/etc/hadoop/conf -export SPARK_JARS_DIR=/usr/lib/spark/jars -export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc - -sudo python3 -m pip install awscli boto spark-nlp - -set +x -exit 0 - -``` - -A sample of your software configuration in JSON on S3 (must be public access): - -```.json -[{ - "Classification": "spark-env", - "Configurations": [{ - "Classification": "export", - "Properties": { - "PYSPARK_PYTHON": "/usr/bin/python3" - } - }] -}, -{ - "Classification": "spark-defaults", - "Properties": { - "spark.yarn.stagingDir": "hdfs:///tmp", - "spark.yarn.preserve.staging.files": "true", - "spark.kryoserializer.buffer.max": "2000M", - "spark.serializer": "org.apache.spark.serializer.KryoSerializer", - "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0" - } -}] -``` - -A sample of AWS CLI to launch EMR cluster: - -```.sh -aws emr create-cluster \ ---name "Spark NLP 5.4.0" \ ---release-label emr-6.2.0 \ ---applications Name=Hadoop Name=Spark Name=Hive \ ---instance-type m4.4xlarge \ ---instance-count 3 \ ---use-default-roles \ ---log-uri "s3:///" \ ---bootstrap-actions Path=s3:///emr-bootstrap.sh,Name=custome \ ---configurations "https:///sparknlp-config.json" \ ---ec2-attributes KeyName=,EmrManagedMasterSecurityGroup=,EmrManagedSlaveSecurityGroup= \ ---profile -``` - -## GCP Dataproc - -1. Create a cluster if you don't have one already as follows. - -At gcloud shell: - -```bash -gcloud services enable dataproc.googleapis.com \ - compute.googleapis.com \ - storage-component.googleapis.com \ - bigquery.googleapis.com \ - bigquerystorage.googleapis.com -``` - -```bash -REGION= -``` - -```bash -BUCKET_NAME= -gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME} -``` - -```bash -REGION= -ZONE= -CLUSTER_NAME= -BUCKET_NAME= -``` - -You can set image-version, master-machine-type, worker-machine-type, -master-boot-disk-size, worker-boot-disk-size, num-workers as your needs. -If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components. -And, you should enable gateway. -Don't forget to set the maven coordinates for the jar in properties. - -```bash -gcloud dataproc clusters create ${CLUSTER_NAME} \ - --region=${REGION} \ - --zone=${ZONE} \ - --image-version=2.0 \ - --master-machine-type=n1-standard-4 \ - --worker-machine-type=n1-standard-2 \ - --master-boot-disk-size=128GB \ - --worker-boot-disk-size=128GB \ - --num-workers=2 \ - --bucket=${BUCKET_NAME} \ - --optional-components=JUPYTER \ - --enable-component-gateway \ - --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ - --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. +### Python -3. Now, you can attach your notebook to the cluster and use the Spark NLP! +Spark NLP supports Python 3.7.x and above depending on your major PySpark version. +Check all available installations for Python in our official [documentation](https://sparknlp.org/docs/en/install#python) -## Spark NLP Configuration -You can change the following Spark NLP configurations via Spark Configuration: +### Compiled JARs +To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation -| Property Name | Default | Meaning | -|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory | -| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS | -| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory | -| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | -| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. | -| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. | -| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. | -| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. | +## Platform-Specific Instructions +For detailed instructions on how to use Spark NLP on supported platforms, please refer to our official documentation: -### How to set Spark NLP Configuration +| Platform | Supported Language(s) | +|-------------------------|-----------------------| +| [Apache Zeppelin](https://sparknlp.org/docs/en/install#apache-zeppelin) | Scala, Python | +| [Jupyter Notebook](https://sparknlp.org/docs/en/install#jupter-notebook) | Python | +| [Google Colab Notebook](https://sparknlp.org/docs/en/install#google-colab-notebook) | Python | +| [Kaggle Kernel](https://sparknlp.org/docs/en/install#kaggle-kernel) | Python | +| [Databricks Cluster](https://sparknlp.org/docs/en/install#databricks-cluster) | Scala, Python | +| [EMR Cluster](https://sparknlp.org/docs/en/install#emr-cluster) | Scala, Python | +| [GCP Dataproc Cluster](https://sparknlp.org/docs/en/install#gcp-dataproc) | Scala, Python | -**SparkSession:** - -You can use `.config()` during SparkSession creation to set Spark NLP configurations. - -```python -from pyspark.sql import SparkSession - -spark = SparkSession.builder - .master("local[*]") - .config("spark.driver.memory", "16G") - .config("spark.driver.maxResultSize", "0") - .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") - .config("spark.kryoserializer.buffer.max", "2000m") - .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") - .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") - .getOrCreate() -``` - -**spark-shell:** - -```sh -spark-shell \ - --driver-memory 16g \ - --conf spark.driver.maxResultSize=0 \ - --conf spark.serializer=org.apache.spark.serializer.KryoSerializer - --conf spark.kryoserializer.buffer.max=2000M \ - --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ - --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` -**pyspark:** +### Offline -```sh -pyspark \ - --driver-memory 16g \ - --conf spark.driver.maxResultSize=0 \ - --conf spark.serializer=org.apache.spark.serializer.KryoSerializer - --conf spark.kryoserializer.buffer.max=2000M \ - --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ - --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 -``` - -**Databricks:** - -On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: +Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet. +Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation +to use Spark NLP offline -```bash -spark.kryoserializer.buffer.max 2000M -spark.serializer org.apache.spark.serializer.KryoSerializer -spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE -spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE -spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS -``` +## Advanced Settings -NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it. +You can change Spark NLP configurations via Spark properties configuration. +Please check [these instructions](https://sparknlp.org/docs/en/install#sparknlp-properties) from our official documentation. ### S3 Integration @@ -991,302 +225,24 @@ In Spark NLP we can define S3 locations to: - Export log files of training models - Store tensorflow graphs used in `NerDLApproach` -**Logging:** - -To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path - -```bash -spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs") -spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") -spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") -spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket") -spark.conf.set("spark.jsl.settings.aws.region", "my-region") -``` - -Now you can check the log on your S3 path defined in *spark.jsl.settings.annotator.log_folder* property. -Make sure to use the prefix *s3://*, otherwise it will use the default configuration. - -**Tensorflow Graphs:** - -To reference S3 location for downloading graphs. We need to set up AWS credentials - -```bash -spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") -spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") -spark.conf.set("spark.jsl.settings.aws.region", "my-region") -``` - -**MFA Configuration:** - -In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token -to the configuration as shown in the examples below -For logging: - -```bash -spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN") -``` - -An example of a bash script that gets temporal AWS credentials can be -found [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh) -This script requires three arguments: - -```bash -./aws_tmp_credentials.sh iam_user duration serial_number -``` - -## Pipelines and Models - -### Pipelines - -**Quick example:** - -```scala -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( - (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), - (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("explain_document_dl", lang = "en") - -val annotation = pipeline.transform(testData) - -annotation.show() -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.5.0 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| checked| lemma| stem| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Google has announ...|[[document, 0, 10...|[[token, 0, 5, Go...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...| -| 2|The Paris metro w...|[[document, 0, 11...|[[token, 0, 2, Th...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 4, 8, Pa...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+----------------------------------+ -|result | -+----------------------------------+ -|[Google, TensorFlow] | -|[Donald John Trump, United States]| -+----------------------------------+ -*/ -``` - -#### Showing Available Pipelines - -There are functions in Spark NLP that will list all the available Pipelines -of a particular language for you: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicPipelines(lang = "en") -/* -+--------------------------------------------+------+---------+ -| Pipeline | lang | version | -+--------------------------------------------+------+---------+ -| dependency_parse | en | 2.0.2 | -| analyze_sentiment_ml | en | 2.0.2 | -| check_spelling | en | 2.1.0 | -| match_datetime | en | 2.1.0 | - ... -| explain_document_ml | en | 3.1.3 | -+--------------------------------------------+------+---------+ -*/ -``` - -Or if we want to check for a particular version: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0") -/* -+---------------------------------------+------+---------+ -| Pipeline | lang | version | -+---------------------------------------+------+---------+ -| dependency_parse | en | 2.0.2 | - ... -| clean_slang | en | 3.0.0 | -| clean_pattern | en | 3.0.0 | -| check_spelling | en | 3.0.0 | -| dependency_parse | en | 3.0.0 | -+---------------------------------------+------+---------+ -*/ -``` - -#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more - -### Models +Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation. -**Some selected languages: -** `Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu` +## Documentation -**Quick online example:** - -```python -# load NER model trained by deep learning approach and GloVe word embeddings -ner_dl = NerDLModel.pretrained('ner_dl') -# load NER model trained by deep learning approach and BERT word embeddings -ner_bert = NerDLModel.pretrained('ner_dl_bert') -``` - -```scala -// load French POS tagger model trained by Universal Dependencies -val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang = "fr") -// load Italian LemmatizerModel -val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang = "it") -```` - -**Quick offline example:** - -- Loading `PerceptronModel` annotator model inside Spark NLP Pipeline - -```scala -val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/") - .setInputCols("document", "token") - .setOutputCol("pos") -``` - -#### Showing Available Models - -There are functions in Spark NLP that will list all the available Models -of a particular Annotator and language for you: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en") -/* -+---------------------------------------------+------+---------+ -| Model | lang | version | -+---------------------------------------------+------+---------+ -| onto_100 | en | 2.1.0 | -| onto_300 | en | 2.1.0 | -| ner_dl_bert | en | 2.2.0 | -| onto_100 | en | 2.4.0 | -| ner_conll_elmo | en | 3.2.2 | -+---------------------------------------------+------+---------+ -*/ -``` - -Or if we want to check for a particular version: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en", version = "3.1.0") -/* -+----------------------------+------+---------+ -| Model | lang | version | -+----------------------------+------+---------+ -| onto_100 | en | 2.1.0 | -| ner_aspect_based_sentiment | en | 2.6.2 | -| ner_weibo_glove_840B_300d | en | 2.6.2 | -| nerdl_atis_840b_300d | en | 2.7.1 | -| nerdl_snips_100d | en | 2.7.3 | -+----------------------------+------+---------+ -*/ -``` - -And to see a list of available annotators, you can use: - -```scala -import com.johnsnowlabs.nlp.pretrained.ResourceDownloader - -ResourceDownloader.showAvailableAnnotators() -/* -AlbertEmbeddings -AlbertForTokenClassification -AssertionDLModel -... -XlmRoBertaSentenceEmbeddings -XlnetEmbeddings -*/ -``` - -#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more - -## Offline - -Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet. -If you are behind a proxy or a firewall with no access to the Maven repository (to download packages) or/and no access -to S3 (to automatically download models and pipelines), you can simply follow the instructions to have Spark NLP without -any limitations offline: - -- Instead of using the Maven package, you need to load our Fat JAR -- Instead of using PretrainedPipeline for pretrained pipelines or the `.pretrained()` function to download pretrained - models, you will need to manually download your pipeline/model from [Models Hub](https://sparknlp.org/models), - extract it, and load it. - -Example of `SparkSession` with Fat JAR to have Spark NLP offline: - -```python -spark = SparkSession.builder - .appName("Spark NLP") - .master("local[*]") - .config("spark.driver.memory", "16G") - .config("spark.driver.maxResultSize", "0") - .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.4.0.jar") - .getOrCreate() -``` - -- You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), - please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark - version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x) -- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need - to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.4.0.jar`) - -Example of using pretrained Models and Pipelines in offline: - -```python -# instead of using pretrained() for online: -# french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang="fr") -# you download this model, extract it, and use .load -french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/") - .setInputCols("document", "token") - .setOutputCol("pos") - -# example for pipelines -# instead of using PretrainedPipeline -# pipeline = PretrainedPipeline('explain_document_dl', lang='en') -# you download this pipeline, extract it, and use PipelineModel -PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/") -``` - -- Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most - recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you -- If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup - you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`) - -## Examples +### Examples Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) repository to showcase all Spark NLP use cases! Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit. -### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) +#### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples) -## FAQ +### FAQ [Check our Articles and Videos page here](https://sparknlp.org/learn) -## Citation +### Citation We have published a [paper](https://www.sciencedirect.com/science/article/pii/S2665963821000063) that you can cite for the Spark NLP library: @@ -1307,6 +263,15 @@ the Spark NLP library: } ``` +## Community support + +- [Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q) For live discussion with the Spark NLP community and the team +- [GitHub](https://github.com/JohnSnowLabs/spark-nlp) Bug reports, feature requests, and contributions +- [Discussions](https://github.com/JohnSnowLabs/spark-nlp/discussions) Engage with other community members, share ideas, + and show off how you use Spark NLP! +- [Medium](https://medium.com/spark-nlp) Spark NLP articles +- [YouTube](https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos) Spark NLP video tutorials + ## Contributing We appreciate any sort of contributions: diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml index 21b4f372614dd6..c6e75a2a846237 100755 --- a/docs/_data/navigation.yml +++ b/docs/_data/navigation.yml @@ -36,6 +36,12 @@ sparknlp: url: /docs/en/quickstart - title: Install Spark NLP url: /docs/en/install + - title: Advanced Settings + url: /docs/en/advanced_settings + - title: Features + url: /docs/en/features + - title: Pipelines and Models + url: /docs/en/pipelines - title: General Concepts url: /docs/en/concepts - title: Annotators diff --git a/docs/en/advanced_settings.md b/docs/en/advanced_settings.md new file mode 100644 index 00000000000000..84c8dc5751187e --- /dev/null +++ b/docs/en/advanced_settings.md @@ -0,0 +1,142 @@ +--- +layout: docs +header: true +seotitle: Spark NLP - Advanced Settings +title: Spark NLP - Advanced Settings +permalink: /docs/en/advanced_settings +key: docs-install +modify_date: "2024-07-04" +show_nav: true +sidebar: + nav: sparknlp +--- + +
+ +## SparkNLP Properties + +You can change the following Spark NLP configurations via Spark Configuration: + +| Property Name | Default | Meaning | +|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory | +| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS | +| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory | +| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` | +| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. | +| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. | +| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. | +| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. | + +### How to set Spark NLP Configuration + +**SparkSession:** + +You can use `.config()` during SparkSession creation to set Spark NLP configurations. + +```python +from pyspark.sql import SparkSession + +spark = SparkSession.builder + .master("local[*]") + .config("spark.driver.memory", "16G") + .config("spark.driver.maxResultSize", "0") + .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") + .config("spark.kryoserializer.buffer.max", "2000m") + .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") + .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") + .getOrCreate() +``` + +**spark-shell:** + +```sh +spark-shell \ + --driver-memory 16g \ + --conf spark.driver.maxResultSize=0 \ + --conf spark.serializer=org.apache.spark.serializer.KryoSerializer + --conf spark.kryoserializer.buffer.max=2000M \ + --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ + --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +**pyspark:** + +```sh +pyspark \ + --driver-memory 16g \ + --conf spark.driver.maxResultSize=0 \ + --conf spark.serializer=org.apache.spark.serializer.KryoSerializer + --conf spark.kryoserializer.buffer.max=2000M \ + --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ + --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +**Databricks:** + +On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: + +```bash +spark.kryoserializer.buffer.max 2000M +spark.serializer org.apache.spark.serializer.KryoSerializer +spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE +spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE +spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS +``` + +NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it. + + +### S3 Integration + +**Logging:** + +To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path + +```bash +spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs") +spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") +spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") +spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket") +spark.conf.set("spark.jsl.settings.aws.region", "my-region") +``` + +Now you can check the log on your S3 path defined in *spark.jsl.settings.annotator.log_folder* property. +Make sure to use the prefix *s3://*, otherwise it will use the default configuration. + +**Tensorflow Graphs:** + +To reference S3 location for downloading graphs. We need to set up AWS credentials + +```bash +spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID") +spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY") +spark.conf.set("spark.jsl.settings.aws.region", "my-region") +``` + +**MFA Configuration:** + +In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token +to the configuration as shown in the examples below +For logging: + +```bash +spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN") +``` + +An example of a bash script that gets temporal AWS credentials can be +found [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh) +This script requires three arguments: + +```bash +./aws_tmp_credentials.sh iam_user duration serial_number +``` + +
\ No newline at end of file diff --git a/docs/en/features.md b/docs/en/features.md new file mode 100644 index 00000000000000..7bd0d06ef8d71a --- /dev/null +++ b/docs/en/features.md @@ -0,0 +1,120 @@ +--- +layout: docs +header: true +seotitle: Spark NLP - Features +title: Spark NLP - Features +permalink: /docs/en/features +key: docs-install +modify_date: "2024-07-03" +show_nav: true +sidebar: + nav: sparknlp +--- + + +
+ +## Text Preprocessing +- Tokenization +- Trainable Word Segmentation +- Stop Words Removal +- Token Normalizer +- Document Normalizer +- Document & Text Splitter +- Stemmer +- Lemmatizer +- NGrams +- Regex Matching +- Text Matching +- Spell Checker (ML and DL models) + +## Parsing and Analysis +- Chunking +- Date Matcher +- Sentence Detector +- Deep Sentence Detector (Deep learning) +- Dependency parsing (Labeled/unlabeled) +- SpanBertCorefModel (Coreference Resolution) +- Part-of-speech tagging +- Named entity recognition (Deep learning) +- Unsupervised keywords extraction +- Language Detection & Identification (up to 375 languages) + +## Sentiment and Classification +- Sentiment Detection (ML models) +- Multi-class & Multi-label Sentiment analysis (Deep learning) +- Multi-class Text Classification (Deep learning) +- Zero-Shot NER Model +- Zero-Shot Text Classification by Transformers (ZSL) + +## Embeddings +- Word Embeddings (GloVe and Word2Vec) +- Doc2Vec (based on Word2Vec) +- BERT Embeddings (TF Hub & HuggingFace models) +- DistilBERT Embeddings (HuggingFace models) +- CamemBERT Embeddings (HuggingFace models) +- RoBERTa Embeddings (HuggingFace models) +- DeBERTa Embeddings (HuggingFace v2 & v3 models) +- XLM-RoBERTa Embeddings (HuggingFace models) +- Longformer Embeddings (HuggingFace models) +- ALBERT Embeddings (TF Hub & HuggingFace models) +- XLNet Embeddings +- ELMO Embeddings (TF Hub models) +- Universal Sentence Encoder (TF Hub models) +- BERT Sentence Embeddings (TF Hub & HuggingFace models) +- RoBerta Sentence Embeddings (HuggingFace models) +- XLM-RoBerta Sentence Embeddings (HuggingFace models) +- INSTRUCTOR Embeddings (HuggingFace models) +- E5 Embeddings (HuggingFace models) +- MPNet Embeddings (HuggingFace models) +- UAE Embeddings (HuggingFace models) +- OpenAI Embeddings +- Sentence & Chunk Embeddings + +## Classification and Question Answering Models +- BERT for Token & Sequence Classification & Question Answering +- DistilBERT for Token & Sequence Classification & Question Answering +- CamemBERT for Token & Sequence Classification & Question Answering +- ALBERT for Token & Sequence Classification & Question Answering +- RoBERTa for Token & Sequence Classification & Question Answering +- DeBERTa for Token & Sequence Classification & Question Answering +- XLM-RoBERTa for Token & Sequence Classification & Question Answering +- Longformer for Token & Sequence Classification & Question Answering +- MPnet for Token & Sequence Classification & Question Answering +- XLNet for Token & Sequence Classification + +## Machine Translation and Generation +- Neural Machine Translation (MarianMT) +- Many-to-Many multilingual translation model (Facebook M2M100) +- Table Question Answering (TAPAS) +- Text-To-Text Transfer Transformer (Google T5) +- Generative Pre-trained Transformer 2 (OpenAI GPT2) +- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART) +- Chat and Conversational LLMs (Facebook Llama-2) + +## Image and Speech +- Vision Transformer (Google ViT) +- Swin Image Classification (Microsoft Swin Transformer) +- ConvNext Image Classification (Facebook ConvNext) +- Vision Encoder Decoder for image-to-text like captioning +- Zero-Shot Image Classification by OpenAI's CLIP +- Automatic Speech Recognition (Wav2Vec2) +- Automatic Speech Recognition (HuBERT) +- Automatic Speech Recognition (OpenAI Whisper) + +## Integration and Interoperability +- Easy ONNX, OpenVINO, and TensorFlow integrations +- Full integration with Spark ML functions +- GPU Support + +## Pre-trained Models +- +31000 pre-trained models in +200 languages! +- +6000 pre-trained pipelines in +200 languages! + +#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more + +## Multi-lingual Support +- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, + Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more. + +
\ No newline at end of file diff --git a/docs/en/install.md b/docs/en/install.md index 4bc861a2c0d496..3d32683830df96 100644 --- a/docs/en/install.md +++ b/docs/en/install.md @@ -5,7 +5,7 @@ seotitle: Spark NLP - Installation title: Spark NLP - Installation permalink: /docs/en/install key: docs-install -modify_date: "2023-05-10" +modify_date: "2024-07-04" show_nav: true sidebar: nav: sparknlp @@ -35,6 +35,14 @@ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 spark-shell --jars spark-nlp-assembly-5.4.0.jar ``` +**GPU (optional):** + +Spark NLP 5.4.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: + +- NVIDIA® GPU drivers version 450.80.02 or higher +- CUDA® Toolkit 11.2 +- cuDNN SDK 8.1.0 +
## Python @@ -95,15 +103,73 @@ spark = SparkSession.builder \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") \ .getOrCreate() ``` +If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course, +you'll have to put the jars in a reachable location for all driver and executor nodes. + +### Python without explicit Pyspark installation + +### Pip/Conda + +If you installed pyspark through pip/conda, you can install `spark-nlp` through the same channel. + +Pip: + +```bash +pip install spark-nlp==5.4.0 +``` + +Conda: + +```bash +conda install -c johnsnowlabs spark-nlp +``` + +PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/) / +Anaconda [spark-nlp package](https://anaconda.org/JohnSnowLabs/spark-nlp) + +Then you'll have to create a SparkSession either from Spark NLP: + +```python +import sparknlp + +spark = sparknlp.start() +``` + +**Quick example:** + +```python +import sparknlp +from sparknlp.pretrained import PretrainedPipeline + +# create or get Spark Session + +spark = sparknlp.start() + +sparknlp.version() +spark.version + +# download, load and annotate a text by pre-trained pipeline + +pipeline = PretrainedPipeline('recognize_entities_dl', 'en') +result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo') +```
## Scala and Java +To use Spark NLP you need the following requirements: + +- Java 8 and 11 +- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x + #### Maven **spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +The `spark-nlp` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). + ```xml @@ -240,6 +306,81 @@ as expected.
+ +## Command line + +Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x +This steps require internet connection. + +#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12) + +```sh +# CPU + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +The `spark-nlp` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp). + +```sh +# GPU + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0 + +``` + +The `spark-nlp-gpu` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu). + +```sh +# AArch64 + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0 + +``` + +The `spark-nlp-aarch64` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64). + +```sh +# M1/M2 (Apple Silicon) + +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 + +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0 + +``` + +The `spark-nlp-silicon` has been published to +the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon). + +**NOTE**: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following +set in your SparkSession: + +```sh +spark-shell \ + --driver-memory 16g \ + --conf spark.kryoserializer.buffer.max=2000M \ + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +## Installation for M1 & M2 Chips + ### Scala and Java for M1 Adding Spark NLP to your Scala or Java project is easy: @@ -370,6 +511,258 @@ Run the following code in Kaggle Kernel and start using spark-nlp right away.
+## Apache Zeppelin + +Use either one of the following options + +- Add the following Maven Coordinates to the interpreter's library list + +```bash +com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is + available to driver path + +## Python in Zeppelin + +Apart from the previous step, install the python module through pip + +```bash +pip install spark-nlp==5.4.0 +``` + +Or you can install `spark-nlp` from inside Zeppelin by using Conda: + +```bash +python.conda install -c johnsnowlabs spark-nlp +``` + +Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose. + +Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and +install the pip library with (e.g. `python3`). + +An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as +shown earlier since it includes both scala and python side installation. + +## Jupyter Notebook + +**Recommended:** + +The easiest way to get this done on Linux and macOS is to simply install `spark-nlp` and `pyspark` PyPI packages and +launch the Jupyter from the same Python environment: + +```sh +$ conda create -n sparknlp python=3.8 -y +$ conda activate sparknlp +# spark-nlp by default is based on pyspark 3.x +$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter +$ jupyter notebook +``` + +Then you can use `python3` kernel to run your code with creating SparkSession via `spark = sparknlp.start()`. + +**Optional:** + +If you are in different operating systems and require to make Jupyter Notebook run by using pyspark, you can follow +these steps: + +```bash +export SPARK_HOME=/path/to/your/spark/folder +export PYSPARK_PYTHON=python3 +export PYSPARK_DRIVER_PYTHON=jupyter +export PYSPARK_DRIVER_PYTHON_OPTS=notebook + +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` + +If not using pyspark at all, you'll have to run the instructions +pointed [here](#python-without-explicit-pyspark-installation) + +## Databricks Cluster + +1. Create a cluster if you don't have one already + +2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab: + + ```bash + spark.kryoserializer.buffer.max 2000M + spark.serializer org.apache.spark.serializer.KryoSerializer + ``` + +3. In `Libraries` tab inside your cluster you need to follow these steps: + + 3.1. Install New -> PyPI -> `spark-nlp==5.4.0` -> Install + + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install + +4. Now you can attach your notebook to the cluster and use Spark NLP! + +NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark +NLP Maven package name (Maven Coordinate) for your runtime from +our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet) + +## EMR Cluster + +To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software +configuration. + +A sample of your bootstrap script + +```.sh +#!/bin/bash +set -x -e + +echo -e 'export PYSPARK_PYTHON=/usr/bin/python3 +export HADOOP_CONF_DIR=/etc/hadoop/conf +export SPARK_JARS_DIR=/usr/lib/spark/jars +export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc + +sudo python3 -m pip install awscli boto spark-nlp + +set +x +exit 0 + +``` + +A sample of your software configuration in JSON on S3 (must be public access): + +```.json +[{ + "Classification": "spark-env", + "Configurations": [{ + "Classification": "export", + "Properties": { + "PYSPARK_PYTHON": "/usr/bin/python3" + } + }] +}, +{ + "Classification": "spark-defaults", + "Properties": { + "spark.yarn.stagingDir": "hdfs:///tmp", + "spark.yarn.preserve.staging.files": "true", + "spark.kryoserializer.buffer.max": "2000M", + "spark.serializer": "org.apache.spark.serializer.KryoSerializer", + "spark.driver.maxResultSize": "0", + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0" + } +}] +``` + +A sample of AWS CLI to launch EMR cluster: + +```.sh +aws emr create-cluster \ +--name "Spark NLP 5.4.0" \ +--release-label emr-6.2.0 \ +--applications Name=Hadoop Name=Spark Name=Hive \ +--instance-type m4.4xlarge \ +--instance-count 3 \ +--use-default-roles \ +--log-uri "s3:///" \ +--bootstrap-actions Path=s3:///emr-bootstrap.sh,Name=custome \ +--configurations "https:///sparknlp-config.json" \ +--ec2-attributes KeyName=,EmrManagedMasterSecurityGroup=,EmrManagedSlaveSecurityGroup= \ +--profile +``` + +## GCP Dataproc + +1. Create a cluster if you don't have one already as follows. + +At gcloud shell: + +```bash +gcloud services enable dataproc.googleapis.com \ + compute.googleapis.com \ + storage-component.googleapis.com \ + bigquery.googleapis.com \ + bigquerystorage.googleapis.com +``` + +```bash +REGION= +``` + +```bash +BUCKET_NAME= +gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME} +``` + +```bash +REGION= +ZONE= +CLUSTER_NAME= +BUCKET_NAME= +``` + +You can set image-version, master-machine-type, worker-machine-type, +master-boot-disk-size, worker-boot-disk-size, num-workers as your needs. +If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components. +And, you should enable gateway. +Don't forget to set the maven coordinates for the jar in properties. + +```bash +gcloud dataproc clusters create ${CLUSTER_NAME} \ + --region=${REGION} \ + --zone=${ZONE} \ + --image-version=2.0 \ + --master-machine-type=n1-standard-4 \ + --worker-machine-type=n1-standard-2 \ + --master-boot-disk-size=128GB \ + --worker-boot-disk-size=128GB \ + --num-workers=2 \ + --bucket=${BUCKET_NAME} \ + --optional-components=JUPYTER \ + --enable-component-gateway \ + --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ + --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0 +``` + +2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. + +3. Now, you can attach your notebook to the cluster and use the Spark NLP! + + +## Apache Spark Support + +Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x | +|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| +| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.3.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO | +| 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO | +| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO | +| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO | + +Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases). + +## Scala and Python Support + +| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 | +|-----------|------------|------------|------------|------------|------------|------------|------------| +| 5.3.x | NO | YES | YES | YES | YES | NO | YES | +| 5.2.x | NO | YES | YES | YES | YES | NO | YES | +| 5.1.x | NO | YES | YES | YES | YES | NO | YES | +| 5.0.x | NO | YES | YES | YES | YES | NO | YES | +| 4.4.x | NO | YES | YES | YES | YES | NO | YES | +| 4.3.x | YES | YES | YES | YES | YES | NO | YES | +| 4.2.x | YES | YES | YES | YES | YES | NO | YES | +| 4.1.x | YES | YES | YES | YES | NO | NO | YES | +| 4.0.x | YES | YES | YES | YES | NO | NO | YES | + + ## Databricks Support Spark NLP 5.4.0 has been tested and is compatible with the following runtimes: @@ -867,4 +1260,44 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/") - Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you - If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`) + +## Compiled JARs + +### Build from source + +#### spark-nlp + +- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +```bash +sbt assembly +``` + +- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +```bash +sbt -Dis_gpu=true assembly +``` + +- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x + +```bash +sbt -Dis_silicon=true assembly +``` + +### Using the jar manually + +If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it +from [Maven Central](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp). + +To add JARs to spark programs use the `--jars` option: + +```sh +spark-shell --jars spark-nlp.jar +``` + +The preferred way to use the library when running spark programs is using the `--packages` option as specified in +the `spark-packages` section. + +
diff --git a/docs/en/pipelines.md b/docs/en/pipelines.md index 43728d43863270..0204f8c62b88f9 100644 --- a/docs/en/pipelines.md +++ b/docs/en/pipelines.md @@ -5,7 +5,7 @@ seotitle: Spark NLP - Pipelines title: Spark NLP - Pipelines permalink: /docs/en/pipelines key: docs-pipelines -modify_date: "2021-11-20" +modify_date: "2024-07-04" show_nav: true sidebar: nav: sparknlp @@ -13,96 +13,24 @@ sidebar:
-Pretrained Pipelines have moved to Models Hub. -Please follow this link for the updated list of all models and pipelines: -[Models Hub](https://sparknlp.org/models) -{:.success} - -
- -## English - -**NOTE:** -`noncontrib` pipelines are compatible with `Windows` operating systems. - -{:.table-model-big} -| Pipelines | Name | -| -------------------- | ---------------------- | -| [Explain Document ML](#explaindocumentml) | `explain_document_ml` -| [Explain Document DL](#explaindocumentdl) | `explain_document_dl` -| [Explain Document DL Win]() | `explain_document_dl_noncontrib` -| Explain Document DL Fast | `explain_document_dl_fast` -| Explain Document DL Fast Win | `explain_document_dl_fast_noncontrib` | -| [Recognize Entities DL](#recognizeentitiesdl) | `recognize_entities_dl` | -| Recognize Entities DL Win | `recognize_entities_dl_noncontrib` | -| [OntoNotes Entities Small](#ontorecognizeentitiessm) | `onto_recognize_entities_sm` | -| [OntoNotes Entities Large](#ontorecognizeentitieslg) | `onto_recognize_entities_lg` | -| [Match Datetime](#matchdatetime) | `match_datetime` | -| [Match Pattern](#matchpattern) | `match_pattern` | -| [Match Chunk](#matchchunks) | `match_chunks` | -| Match Phrases | `match_phrases`| -| Clean Stop | `clean_stop`| -| Clean Pattern | `clean_pattern`| -| Clean Slang | `clean_slang`| -| Check Spelling | `check_spelling`| -| Analyze Sentiment | `analyze_sentiment` | -| Analyze Sentiment DL | `analyze_sentimentdl_use_imdb` | -| Analyze Sentiment DL | `analyze_sentimentdl_use_twitter` | -| Dependency Parse | `dependency_parse` | - -
- -### explain_document_ml - -{% highlight scala %} -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), -(2, "The Paris metro will soon enter the 21st century, ditching single-use paper tickets for rechargeable electronic cards.") -)).toDF("id", "text") +## Pipelines and Models -val pipeline = PretrainedPipeline("explain_document_ml", lang="en") +### Pipelines -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_ml,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 7 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| checked| lemmas| stems| pos| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...| -| 2|The Paris metro w...|[[document, 0, 11...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -{% endhighlight %} - -
- -### explain_document_dl - -{% highlight scala %} +**Quick example:** +```scala import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP SparkNLP.version() val testData = spark.createDataFrame(Seq( -(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), -(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") + (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), + (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") )).toDF("id", "text") -val pipeline = PretrainedPipeline("explain_document_dl", lang="en") +val pipeline = PretrainedPipeline("explain_document_dl", lang = "en") val annotation = pipeline.transform(testData) @@ -110,7 +38,7 @@ annotation.show() /* import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline import com.johnsnowlabs.nlp.SparkNLP -2.0.8 +2.5.0 testData: org.apache.spark.sql.DataFrame = [id: int, text: string] pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models) annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields] @@ -132,888 +60,141 @@ annotation.select("entities.result").show(false) |[Donald John Trump, United States]| +----------------------------------+ */ +``` -{% endhighlight %} +#### Showing Available Pipelines -
+There are functions in Spark NLP that will list all the available Pipelines +of a particular language for you: -### recognize_entities_dl - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"), -(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| embeddings| ner| ner_converter| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...| -| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicPipelines(lang = "en") /* -+----------------------------------+ -|result | -+----------------------------------+ -|[Google, TensorFlow] | -|[Donald John Trump, United States]| -+----------------------------------+ ++--------------------------------------------+------+---------+ +| Pipeline | lang | version | ++--------------------------------------------+------+---------+ +| dependency_parse | en | 2.0.2 | +| analyze_sentiment_ml | en | 2.0.2 | +| check_spelling | en | 2.1.0 | +| match_datetime | en | 2.1.0 | + ... +| explain_document_ml | en | 3.1.3 | ++--------------------------------------------+------+---------+ */ +``` -{% endhighlight %} - -
+Or if we want to check for a particular version: -### onto_recognize_entities_sm - -Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **OntoNotes** corpus and supports the identification of 18 entities. - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament. "), -(2, "A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("onto_recognize_entities_sm", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.1.0 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(onto_recognize_entities_sm,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Johnson first ent...|[[document, 0, 17...|[[token, 0, 6, Jo...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 6, Jo...| -| 2|A little less tha...|[[document, 0, 22...|[[token, 0, 0, A,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 32, A...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0") /* -+---------------------------------------------------------------------------------+ -|result | -+---------------------------------------------------------------------------------+ -|[Johnson, first, 2001, Parliament, eight years, London, 2008 to 2016, Parliament]| -|[A little less than a decade later, dozens] | -+---------------------------------------------------------------------------------+ ++---------------------------------------+------+---------+ +| Pipeline | lang | version | ++---------------------------------------+------+---------+ +| dependency_parse | en | 2.0.2 | + ... +| clean_slang | en | 3.0.0 | +| clean_pattern | en | 3.0.0 | +| check_spelling | en | 3.0.0 | +| dependency_parse | en | 3.0.0 | ++---------------------------------------+------+---------+ */ +``` -{% endhighlight %} +#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more -
+### Models -### onto_recognize_entities_lg +**Some selected languages: +** `Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu` -Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **OntoNotes** corpus and supports the identification of 18 entities. +**Quick online example:** -{% highlight scala %} +```python +# load NER model trained by deep learning approach and GloVe word embeddings +ner_dl = NerDLModel.pretrained('ner_dl') +# load NER model trained by deep learning approach and BERT word embeddings +ner_bert = NerDLModel.pretrained('ner_dl_bert') +``` -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP +```scala +// load French POS tagger model trained by Universal Dependencies +val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang = "fr") +// load Italian LemmatizerModel +val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang = "it") +```` -SparkNLP.version() +**Quick offline example:** -val testData = spark.createDataFrame(Seq( -(1, "Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament. "), -(2, "A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("onto_recognize_entities_lg", lang="en") +- Loading `PerceptronModel` annotator model inside Spark NLP Pipeline -val annotation = pipeline.transform(testData) +```scala +val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/") + .setInputCols("document", "token") + .setOutputCol("pos") +``` -annotation.show() +#### Showing Available Models -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.1.0 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(onto_recognize_entities_lg,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|Johnson first ent...|[[document, 0, 17...|[[token, 0, 6, Jo...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 6, Jo...| -| 2|A little less tha...|[[document, 0, 22...|[[token, 0, 0, A,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 32, A...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ +There are functions in Spark NLP that will list all the available Models +of a particular Annotator and language for you: -annotation.select("entities.result").show(false) +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en") /* -+-------------------------------------------------------------------------------+ -|result | -+-------------------------------------------------------------------------------+ -|[Johnson, first, 2001, Parliament, eight years, London, 2008, 2016, Parliament]| -|[A little less than a decade later, dozens] | -+-------------------------------------------------------------------------------+ ++---------------------------------------------+------+---------+ +| Model | lang | version | ++---------------------------------------------+------+---------+ +| onto_100 | en | 2.1.0 | +| onto_300 | en | 2.1.0 | +| ner_dl_bert | en | 2.2.0 | +| onto_100 | en | 2.4.0 | +| ner_conll_elmo | en | 3.2.2 | ++---------------------------------------------+------+---------+ */ +``` -{% endhighlight %} - -
- -### match_datetime - -#### DateMatcher yyyy/MM/dd - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "I would like to come over and see you in 01/02/2019."), -(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States") -)).toDF("id", "text") +Or if we want to check for a particular version: -val pipeline = PretrainedPipeline("match_datetime", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en", version = "3.1.0") /* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_datetime,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 4 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| date| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|I would like to c...|[[document, 0, 51...|[[document, 0, 51...|[[token, 0, 0, I,...|[[date, 41, 50, 2...| -| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[date, 24, 36, 1...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ ++----------------------------+------+---------+ +| Model | lang | version | ++----------------------------+------+---------+ +| onto_100 | en | 2.1.0 | +| ner_aspect_based_sentiment | en | 2.6.2 | +| ner_weibo_glove_840B_300d | en | 2.6.2 | +| nerdl_atis_840b_300d | en | 2.7.1 | +| nerdl_snips_100d | en | 2.7.3 | ++----------------------------+------+---------+ */ +``` -annotation.select("date.result").show(false) +And to see a list of available annotators, you can use: -/* -+------------+ -|result | -+------------+ -|[2019/01/02]| -|[1946/06/14]| -+------------+ -*/ - -{% endhighlight %} - -
- -### match_pattern - -RegexMatcher (match phone numbers) - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "You should call Mr. Jon Doe at +33 1 79 01 22 89") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("match_pattern", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_pattern,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 4 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| regex| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|You should call M...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 2, Yo...|[[chunk, 31, 47, ...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("regex.result").show(false) - -/* -+-------------------+ -|result | -+-------------------+ -|[+33 1 79 01 22 89]| -+-------------------+ -*/ - -{% endhighlight %} - -
- -### match_chunks - -The pipeline uses regex `
?/*+` - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val testData = spark.createDataFrame(Seq( -(1, "The book has many chapters"), -(2, "the little yellow dog barked at the cat") -)).toDF("id", "text") - -val pipeline = PretrainedPipeline("match_chunks", lang="en") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_chunks,en,public/models) -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 5 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| sentence| token| pos| chunk| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|The book has many...|[[document, 0, 25...|[[document, 0, 25...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|[[chunk, 0, 7, Th...| -| 2|the little yellow...|[[document, 0, 38...|[[document, 0, 38...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[chunk, 0, 20, t...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("chunk.result").show(false) - -/* -+--------------------------------+ -|result | -+--------------------------------+ -|[The book] | -|[the little yellow dog, the cat]| -+--------------------------------+ -*/ - -{% endhighlight %} - -
- -## French - -{:.table-model-big} -| Pipelines | Name | -| ----------------------- | --------------------- | -| [Explain Document Large](#french-explain_document_lg) | `explain_document_lg` | -| [Explain Document Medium](#french-explain_document_md) | `explain_document_md` | -| [Entity Recognizer Large](#french-entity_recognizer_lg) | `entity_recognizer_lg` | -| [Entity Recognizer Medium](#french-entity_recognizer_md) | `entity_recognizer_md` | - -{:.table-model-big} -|Feature | Description| -|---|----| -|**NER**|Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities -|**Lemma**|Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura` -|**POS**| Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/fr_gsd/index.html) -|**Size**| Model size indicator, **md** and **lg**. The large pipeline uses **glove_840B_300** and the medium uses **glove_6B_300** WordEmbeddings - -
- -### French explain_document_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_lg", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() +```scala +import com.johnsnowlabs.nlp.pretrained.ResourceDownloader +ResourceDownloader.showAvailableAnnotators() /* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,fr,public/models) -testData: org.apache.spark.sql.DataFrame = [id: bigint, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: bigint, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[token, 0, 12, C...|[[pos, 0, 12, ADV...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[token, 0, 7, Em...|[[pos, 0, 7, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ +AlbertEmbeddings +AlbertForTokenClassification +AssertionDLModel +... +XlmRoBertaSentenceEmbeddings +XlnetEmbeddings */ +``` -annotation.select("entities.result").show(false) - -/*+-------------------------------------------------------------------------------------------------------------+ -|result | -+-------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+-------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -### French explain_document_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_md", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_md,fr,public/models) -testData: org.apache.spark.sql.DataFrame = [id: bigint, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: bigint, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[token, 0, 12, C...|[[pos, 0, 12, ADV...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[token, 0, 7, Em...|[[pos, 0, 7, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -|result | -+----------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, au CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+----------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -### French entity_recognizer_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_lg", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-------------------------------------------------------------------------------------------------------------+ -|result | -+-------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+-------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -### French entity_recognizer_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_md", lang="fr") - -val testData = spark.createDataFrame(Seq( -(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."), -(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...| -| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/*+-------------------------------------------------------------------------------------------------------------+ -|result | -+----------------------------------------------------------------------------------------------------------------+ -|[Quentin Tarantino] | -|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, au CHU d'Amiens4, Françoise Noguès, Sécurité sociale]| -+----------------------------------------------------------------------------------------------------------------+ -*/ - -{% endhighlight %} - -
- -## Italian - -{:.table-model-big} -| Pipelines | Name | -| ----------------------- | --------------------- | -| [Explain Document Large](#italian-explain_document_lg) | `explain_document_lg` | -| [Explain Document Medium](#italian-explain_document_md) | `explain_document_md` | -| [Entity Recognizer Large](#italian-entity_recognizer_lg) | `entity_recognizer_lg` | -| [Entity Recognizer Medium](#italian-entity_recognizer_md) | `entity_recognizer_md` | - -{:.table-model-big} -|Feature | Description| -|---|----| -|**NER**|Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities -|**Lemma**|Trained by **Lemmatizer** annotator on **DXC Technology** dataset -|**POS**| Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/it_isdt/index.html) -|**Size**| Model size indicator, **md** and **lg**. The large pipeline uses **glove_840B_300** and the medium uses **glove_6B_300** WordEmbeddings - -
- -### Italian explain_document_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_lg", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[token, 0, 1, La...|[[pos, 0, 1, DET,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 3, 6, FI...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[token, 0, 4, Re...|[[pos, 0, 4, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-----------------------------------+ -|result | -+-----------------------------------+ -|[FIFA, Zidane, Materazzi] | -|[Reims, Domani, Mondiali femminili]| -+-----------------------------------+ -*/ - -{% endhighlight %} - -
- -### Italian explain_document_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("explain_document_md", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[token, 0, 1, La...|[[pos, 0, 1, DET,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 9, La...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[token, 0, 4, Re...|[[pos, 0, 4, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-------------------------------+ -|result | -+-------------------------------+ -|[La FIFA, Zidane, Materazzi]| -|[Reims, Domani, Mondiali] | -+-------------------------------+ -*/ - -{% endhighlight %} - -
- -### Italian entity_recognizer_lg - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_lg", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 3, 6, FI...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-----------------------------------+ -|result | -+-----------------------------------+ -|[FIFA, Zidane, Materazzi] | -|[Reims, Domani, Mondiali femminili]| -+-----------------------------------+ -*/ - -{% endhighlight %} - -
- -### Italian entity_recognizer_md - -{% highlight scala %} - -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP - -SparkNLP.version() - -val pipeline = PretrainedPipeline("entity_recognizer_md", lang="it") - -val testData = spark.createDataFrame(Seq( -(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"), -(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.") -)).toDF("id", "text") - -val annotation = pipeline.transform(testData) - -annotation.show() - -/* -import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline -import com.johnsnowlabs.nlp.SparkNLP -2.0.8 -pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models) -testData: org.apache.spark.sql.DataFrame = [id: int, text: string] -annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields] -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| id| text| document| token| sentence| embeddings| ner| entities| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 9, La...| -| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...| -+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ -*/ - -annotation.select("entities.result").show(false) - -/* -+-------------------------------+ -|result | -+-------------------------------+ -|[La FIFA, Zidane, Materazzi]| -|[Reims, Domani, Mondiali] | -+-------------------------------+ -*/ - -{% endhighlight %} - -
- -## Spanish - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_es_2.4.0_2.4_1581977077084.zip) | -| Explain Document Medium | `explain_document_md` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_es_2.4.0_2.4_1581976836224.zip) | -| Explain Document Large | `explain_document_lg` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_2.4.0_2.4_1581975536033.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_2.4.0_2.4_1581978479912.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_es_2.4.0_2.4_1581978260094.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_2.4.0_2.4_1581977172660.zip) | - -{:.table-model-big} -| Feature | Description | -|:----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| **Lemma** | Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura` | -| **POS** | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/es_gsd/index.html) | -| **NER** | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities | -|**Size**| Model size indicator, **sm**, **md**, and **lg**. The small pipelines use **glove_100d**, the medium pipelines use **glove_6B_300**, and large pipelines use **glove_840B_300** WordEmbeddings - -
- -## Russian - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_2.4.4_2.4_1584017142719.zip) | -| Explain Document Medium | `explain_document_md` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_ru_2.4.4_2.4_1584016917220.zip) | -| Explain Document Large | `explain_document_lg` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_ru_2.4.4_2.4_1584015824836.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_ru_2.4.4_2.4_1584018543619.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_2.4.4_2.4_1584018332357.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_2.4.4_2.4_1584017227871.zip) | - -{:.table-model-big} -| Feature | Description | -|:----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| **Lemma** | Trained by **Lemmatizer** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/ru_gsd/index.html)| -| **POS** | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/ru_gsd/index.html) | -| **NER** | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities | - -
- -## Dutch - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_nl_2.5.0_2.4_1588546621618.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_2.5.0_2.4_1588546605329.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_2.5.0_2.4_1588612556770.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_nl_2.5.0_2.4_1588546655907.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_nl_2.5.0_2.4_1588546645304.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_nl_2.5.0_2.4_1588612569958.zip) | - -
- -## Norwegian - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_2.5.0_2.4_1588784132955.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_no_2.5.0_2.4_1588783879809.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_no_2.5.0_2.4_1588782610672.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_no_2.5.0_2.4_1588794567766.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_2.5.0_2.4_1588794357614.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_2.5.0_2.4_1588793261642.zip) | - -
- -## Polish - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_2.5.0_2.4_1588531081173.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pl_2.5.0_2.4_1588530841737.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pl_2.5.0_2.4_1588529695577.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pl_2.5.0_2.4_1588532616080.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pl_2.5.0_2.4_1588532376753.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pl_2.5.0_2.4_1588531171903.zip) | - -
- -## Portuguese - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| Explain Document Small | `explain_document_sm` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_2.5.0_2.4_1588501423743.zip) | -| Explain Document Medium | `explain_document_md` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pt_2.5.0_2.4_1588501189804.zip) | -| Explain Document Large | `explain_document_lg` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pt_2.5.0_2.4_1588500056427.zip) | -| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pt_2.5.0_2.4_1588502815900.zip) | -| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pt_2.5.0_2.4_1588502606198.zip) | -| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pt_2.5.0_2.4_1588501526324.zip) | - -
- -## Multi-language - -{:.table-model-big} -| Pipeline | Name | Build | lang | Description | Offline | -|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------| -| LanguageDetectorDL | `detect_language_7` | 2.5.2 | `xx` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_7_xx_2.5.0_2.4_1591875676774.zip) | -| LanguageDetectorDL | `detect_language_20` | 2.5.2 | `xx` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_20_xx_2.5.0_2.4_1591875683182.zip) | - -* The model with 7 languages: Czech, German, English, Spanish, French, Italy, and Slovak -* The model with 20 languages: Bulgarian, Czech, German, Greek, English, Spanish, Finnish, French, Croatian, Hungarian, Italy, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Turkish, and Ukrainian - -
- -## How to use - -### Online - -To use Spark NLP pretrained pipelines, you can call `PretrainedPipeline` with pipeline's name and its language (default is `en`): - -{% highlight python %} - -pipeline = PretrainedPipeline('explain_document_dl', lang='en') - -{% endhighlight %} - -Same in Scala - -{% highlight scala %} - -val pipeline = PretrainedPipeline("explain_document_dl", lang="en") - -{% endhighlight %} - -
- -### Offline - -If you have any trouble using online pipelines or models in your environment (maybe it's air-gapped), you can directly download them for `offline` use. - -After downloading offline models/pipelines and extracting them, here is how you can use them iside your code (the path could be a shared storage like HDFS in a cluster): - -{% highlight scala %} -val advancedPipeline = PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/") -// To use the loaded Pipeline for prediction -advancedPipeline.transform(predictionDF) - -{% endhighlight %} +#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
\ No newline at end of file