-```
-
-You can set image-version, master-machine-type, worker-machine-type,
-master-boot-disk-size, worker-boot-disk-size, num-workers as your needs.
-If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components.
-And, you should enable gateway.
-Don't forget to set the maven coordinates for the jar in properties.
-
-```bash
-gcloud dataproc clusters create ${CLUSTER_NAME} \
- --region=${REGION} \
- --zone=${ZONE} \
- --image-version=2.0 \
- --master-machine-type=n1-standard-4 \
- --worker-machine-type=n1-standard-2 \
- --master-boot-disk-size=128GB \
- --worker-boot-disk-size=128GB \
- --num-workers=2 \
- --bucket=${BUCKET_NAME} \
- --optional-components=JUPYTER \
- --enable-component-gateway \
- --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
- --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
- --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
-```
-
-2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
+### Python
-3. Now, you can attach your notebook to the cluster and use the Spark NLP!
+Spark NLP supports Python 3.7.x and above depending on your major PySpark version.
+Check all available installations for Python in our official [documentation](https://sparknlp.org/docs/en/install#python)
-## Spark NLP Configuration
-You can change the following Spark NLP configurations via Spark Configuration:
+### Compiled JARs
+To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation
-| Property Name | Default | Meaning |
-|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
-| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS |
-| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory |
-| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
-| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
-| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
-| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
-| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
-| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. |
-| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. |
-| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. |
-| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. |
+## Platform-Specific Instructions
+For detailed instructions on how to use Spark NLP on supported platforms, please refer to our official documentation:
-### How to set Spark NLP Configuration
+| Platform | Supported Language(s) |
+|-------------------------|-----------------------|
+| [Apache Zeppelin](https://sparknlp.org/docs/en/install#apache-zeppelin) | Scala, Python |
+| [Jupyter Notebook](https://sparknlp.org/docs/en/install#jupter-notebook) | Python |
+| [Google Colab Notebook](https://sparknlp.org/docs/en/install#google-colab-notebook) | Python |
+| [Kaggle Kernel](https://sparknlp.org/docs/en/install#kaggle-kernel) | Python |
+| [Databricks Cluster](https://sparknlp.org/docs/en/install#databricks-cluster) | Scala, Python |
+| [EMR Cluster](https://sparknlp.org/docs/en/install#emr-cluster) | Scala, Python |
+| [GCP Dataproc Cluster](https://sparknlp.org/docs/en/install#gcp-dataproc) | Scala, Python |
-**SparkSession:**
-
-You can use `.config()` during SparkSession creation to set Spark NLP configurations.
-
-```python
-from pyspark.sql import SparkSession
-
-spark = SparkSession.builder
- .master("local[*]")
- .config("spark.driver.memory", "16G")
- .config("spark.driver.maxResultSize", "0")
- .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
- .config("spark.kryoserializer.buffer.max", "2000m")
- .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
- .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
- .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0")
- .getOrCreate()
-```
-
-**spark-shell:**
-
-```sh
-spark-shell \
- --driver-memory 16g \
- --conf spark.driver.maxResultSize=0 \
- --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
- --conf spark.kryoserializer.buffer.max=2000M \
- --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
- --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
-```
-**pyspark:**
+### Offline
-```sh
-pyspark \
- --driver-memory 16g \
- --conf spark.driver.maxResultSize=0 \
- --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
- --conf spark.kryoserializer.buffer.max=2000M \
- --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
- --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
- --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
-```
-
-**Databricks:**
-
-On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:
+Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
+Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation
+to use Spark NLP offline
-```bash
-spark.kryoserializer.buffer.max 2000M
-spark.serializer org.apache.spark.serializer.KryoSerializer
-spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE
-spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE
-spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS
-```
+## Advanced Settings
-NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.
+You can change Spark NLP configurations via Spark properties configuration.
+Please check [these instructions](https://sparknlp.org/docs/en/install#sparknlp-properties) from our official documentation.
### S3 Integration
@@ -991,302 +225,24 @@ In Spark NLP we can define S3 locations to:
- Export log files of training models
- Store tensorflow graphs used in `NerDLApproach`
-**Logging:**
-
-To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path
-
-```bash
-spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs")
-spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
-spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
-spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket")
-spark.conf.set("spark.jsl.settings.aws.region", "my-region")
-```
-
-Now you can check the log on your S3 path defined in *spark.jsl.settings.annotator.log_folder* property.
-Make sure to use the prefix *s3://*, otherwise it will use the default configuration.
-
-**Tensorflow Graphs:**
-
-To reference S3 location for downloading graphs. We need to set up AWS credentials
-
-```bash
-spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
-spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
-spark.conf.set("spark.jsl.settings.aws.region", "my-region")
-```
-
-**MFA Configuration:**
-
-In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token
-to the configuration as shown in the examples below
-For logging:
-
-```bash
-spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN")
-```
-
-An example of a bash script that gets temporal AWS credentials can be
-found [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh)
-This script requires three arguments:
-
-```bash
-./aws_tmp_credentials.sh iam_user duration serial_number
-```
-
-## Pipelines and Models
-
-### Pipelines
-
-**Quick example:**
-
-```scala
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
- (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
- (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
-)).toDF("id", "text")
-
-val pipeline = PretrainedPipeline("explain_document_dl", lang = "en")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.5.0
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| checked| lemma| stem| pos| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|Google has announ...|[[document, 0, 10...|[[token, 0, 5, Go...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
-| 2|The Paris metro w...|[[document, 0, 11...|[[token, 0, 2, Th...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 4, 8, Pa...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-+----------------------------------+
-|result |
-+----------------------------------+
-|[Google, TensorFlow] |
-|[Donald John Trump, United States]|
-+----------------------------------+
-*/
-```
-
-#### Showing Available Pipelines
-
-There are functions in Spark NLP that will list all the available Pipelines
-of a particular language for you:
-
-```scala
-import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
-
-ResourceDownloader.showPublicPipelines(lang = "en")
-/*
-+--------------------------------------------+------+---------+
-| Pipeline | lang | version |
-+--------------------------------------------+------+---------+
-| dependency_parse | en | 2.0.2 |
-| analyze_sentiment_ml | en | 2.0.2 |
-| check_spelling | en | 2.1.0 |
-| match_datetime | en | 2.1.0 |
- ...
-| explain_document_ml | en | 3.1.3 |
-+--------------------------------------------+------+---------+
-*/
-```
-
-Or if we want to check for a particular version:
-
-```scala
-import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
-
-ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
-/*
-+---------------------------------------+------+---------+
-| Pipeline | lang | version |
-+---------------------------------------+------+---------+
-| dependency_parse | en | 2.0.2 |
- ...
-| clean_slang | en | 3.0.0 |
-| clean_pattern | en | 3.0.0 |
-| check_spelling | en | 3.0.0 |
-| dependency_parse | en | 3.0.0 |
-+---------------------------------------+------+---------+
-*/
-```
-
-#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more
-
-### Models
+Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.
-**Some selected languages:
-** `Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu`
+## Documentation
-**Quick online example:**
-
-```python
-# load NER model trained by deep learning approach and GloVe word embeddings
-ner_dl = NerDLModel.pretrained('ner_dl')
-# load NER model trained by deep learning approach and BERT word embeddings
-ner_bert = NerDLModel.pretrained('ner_dl_bert')
-```
-
-```scala
-// load French POS tagger model trained by Universal Dependencies
-val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang = "fr")
-// load Italian LemmatizerModel
-val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang = "it")
-````
-
-**Quick offline example:**
-
-- Loading `PerceptronModel` annotator model inside Spark NLP Pipeline
-
-```scala
-val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
- .setInputCols("document", "token")
- .setOutputCol("pos")
-```
-
-#### Showing Available Models
-
-There are functions in Spark NLP that will list all the available Models
-of a particular Annotator and language for you:
-
-```scala
-import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
-
-ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en")
-/*
-+---------------------------------------------+------+---------+
-| Model | lang | version |
-+---------------------------------------------+------+---------+
-| onto_100 | en | 2.1.0 |
-| onto_300 | en | 2.1.0 |
-| ner_dl_bert | en | 2.2.0 |
-| onto_100 | en | 2.4.0 |
-| ner_conll_elmo | en | 3.2.2 |
-+---------------------------------------------+------+---------+
-*/
-```
-
-Or if we want to check for a particular version:
-
-```scala
-import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
-
-ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en", version = "3.1.0")
-/*
-+----------------------------+------+---------+
-| Model | lang | version |
-+----------------------------+------+---------+
-| onto_100 | en | 2.1.0 |
-| ner_aspect_based_sentiment | en | 2.6.2 |
-| ner_weibo_glove_840B_300d | en | 2.6.2 |
-| nerdl_atis_840b_300d | en | 2.7.1 |
-| nerdl_snips_100d | en | 2.7.3 |
-+----------------------------+------+---------+
-*/
-```
-
-And to see a list of available annotators, you can use:
-
-```scala
-import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
-
-ResourceDownloader.showAvailableAnnotators()
-/*
-AlbertEmbeddings
-AlbertForTokenClassification
-AssertionDLModel
-...
-XlmRoBertaSentenceEmbeddings
-XlnetEmbeddings
-*/
-```
-
-#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
-
-## Offline
-
-Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
-If you are behind a proxy or a firewall with no access to the Maven repository (to download packages) or/and no access
-to S3 (to automatically download models and pipelines), you can simply follow the instructions to have Spark NLP without
-any limitations offline:
-
-- Instead of using the Maven package, you need to load our Fat JAR
-- Instead of using PretrainedPipeline for pretrained pipelines or the `.pretrained()` function to download pretrained
- models, you will need to manually download your pipeline/model from [Models Hub](https://sparknlp.org/models),
- extract it, and load it.
-
-Example of `SparkSession` with Fat JAR to have Spark NLP offline:
-
-```python
-spark = SparkSession.builder
- .appName("Spark NLP")
- .master("local[*]")
- .config("spark.driver.memory", "16G")
- .config("spark.driver.maxResultSize", "0")
- .config("spark.kryoserializer.buffer.max", "2000M")
- .config("spark.jars", "/tmp/spark-nlp-assembly-5.4.0.jar")
- .getOrCreate()
-```
-
-- You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases),
- please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark
- version (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x)
-- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need
- to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
- i.e., `hdfs:///tmp/spark-nlp-assembly-5.4.0.jar`)
-
-Example of using pretrained Models and Pipelines in offline:
-
-```python
-# instead of using pretrained() for online:
-# french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang="fr")
-# you download this model, extract it, and use .load
-french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
- .setInputCols("document", "token")
- .setOutputCol("pos")
-
-# example for pipelines
-# instead of using PretrainedPipeline
-# pipeline = PretrainedPipeline('explain_document_dl', lang='en')
-# you download this pipeline, extract it, and use PipelineModel
-PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
-```
-
-- Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most
- recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you
-- If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup
- you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (
- i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`)
-
-## Examples
+### Examples
Need more **examples**? Check out our dedicated [Spark NLP Examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
repository to showcase all Spark NLP use cases!
Also, don't forget to check [Spark NLP in Action](https://sparknlp.org/demo) built by Streamlit.
-### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
+#### All examples: [spark-nlp/examples](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples)
-## FAQ
+### FAQ
[Check our Articles and Videos page here](https://sparknlp.org/learn)
-## Citation
+### Citation
We have published a [paper](https://www.sciencedirect.com/science/article/pii/S2665963821000063) that you can cite for
the Spark NLP library:
@@ -1307,6 +263,15 @@ the Spark NLP library:
}
```
+## Community support
+
+- [Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q) For live discussion with the Spark NLP community and the team
+- [GitHub](https://github.com/JohnSnowLabs/spark-nlp) Bug reports, feature requests, and contributions
+- [Discussions](https://github.com/JohnSnowLabs/spark-nlp/discussions) Engage with other community members, share ideas,
+ and show off how you use Spark NLP!
+- [Medium](https://medium.com/spark-nlp) Spark NLP articles
+- [YouTube](https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos) Spark NLP video tutorials
+
## Contributing
We appreciate any sort of contributions:
diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
index 21b4f372614dd6..c6e75a2a846237 100755
--- a/docs/_data/navigation.yml
+++ b/docs/_data/navigation.yml
@@ -36,6 +36,12 @@ sparknlp:
url: /docs/en/quickstart
- title: Install Spark NLP
url: /docs/en/install
+ - title: Advanced Settings
+ url: /docs/en/advanced_settings
+ - title: Features
+ url: /docs/en/features
+ - title: Pipelines and Models
+ url: /docs/en/pipelines
- title: General Concepts
url: /docs/en/concepts
- title: Annotators
diff --git a/docs/en/advanced_settings.md b/docs/en/advanced_settings.md
new file mode 100644
index 00000000000000..84c8dc5751187e
--- /dev/null
+++ b/docs/en/advanced_settings.md
@@ -0,0 +1,142 @@
+---
+layout: docs
+header: true
+seotitle: Spark NLP - Advanced Settings
+title: Spark NLP - Advanced Settings
+permalink: /docs/en/advanced_settings
+key: docs-install
+modify_date: "2024-07-04"
+show_nav: true
+sidebar:
+ nav: sparknlp
+---
+
+
+
+## SparkNLP Properties
+
+You can change the following Spark NLP configurations via Spark Configuration:
+
+| Property Name | Default | Meaning |
+|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
+| `spark.jsl.settings.storage.cluster_tmp_dir` | `hadoop.tmp.dir` | The location to use on a cluster for temporarily files such as unpacking indexes for WordEmbeddings. By default, this locations is the location of `hadoop.tmp.dir` set via Hadoop configuration for Apache Spark. NOTE: `S3` is not supported and it must be local, HDFS, or DBFS |
+| `spark.jsl.settings.annotator.log_folder` | `~/annotator_logs` | The location to save logs from annotators during training such as `NerDLApproach`, `ClassifierDLApproach`, `SentimentDLApproach`, `MultiClassifierDLApproach`, etc. By default, it will be in User's Home directory under `annotator_logs` directory |
+| `spark.jsl.settings.aws.credentials.access_key_id` | `None` | Your AWS access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
+| `spark.jsl.settings.aws.credentials.secret_access_key` | `None` | Your AWS secret access key to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
+| `spark.jsl.settings.aws.credentials.session_token` | `None` | Your AWS MFA session token to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
+| `spark.jsl.settings.aws.s3_bucket` | `None` | Your AWS S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
+| `spark.jsl.settings.aws.region` | `None` | Your AWS region to use your S3 bucket to store log files of training models or access tensorflow graphs used in `NerDLApproach` |
+| `spark.jsl.settings.onnx.gpuDeviceId` | `0` | Constructs CUDA execution provider options for the specified non-negative device id. |
+| `spark.jsl.settings.onnx.intraOpNumThreads` | `6` | Sets the size of the CPU thread pool used for executing a single graph, if executing on a CPU. |
+| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. |
+| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. |
+
+### How to set Spark NLP Configuration
+
+**SparkSession:**
+
+You can use `.config()` during SparkSession creation to set Spark NLP configurations.
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder
+ .master("local[*]")
+ .config("spark.driver.memory", "16G")
+ .config("spark.driver.maxResultSize", "0")
+ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+ .config("spark.kryoserializer.buffer.max", "2000m")
+ .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
+ .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
+ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0")
+ .getOrCreate()
+```
+
+**spark-shell:**
+
+```sh
+spark-shell \
+ --driver-memory 16g \
+ --conf spark.driver.maxResultSize=0 \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
+ --conf spark.kryoserializer.buffer.max=2000M \
+ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
+ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+**pyspark:**
+
+```sh
+pyspark \
+ --driver-memory 16g \
+ --conf spark.driver.maxResultSize=0 \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
+ --conf spark.kryoserializer.buffer.max=2000M \
+ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
+ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+**Databricks:**
+
+On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:
+
+```bash
+spark.kryoserializer.buffer.max 2000M
+spark.serializer org.apache.spark.serializer.KryoSerializer
+spark.jsl.settings.pretrained.cache_folder dbfs:/PATH_TO_CACHE
+spark.jsl.settings.storage.cluster_tmp_dir dbfs:/PATH_TO_STORAGE
+spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS
+```
+
+NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.
+
+
+### S3 Integration
+
+**Logging:**
+
+To configure S3 path for logging while training models. We need to set up AWS credentials as well as an S3 path
+
+```bash
+spark.conf.set("spark.jsl.settings.annotator.log_folder", "s3://my/s3/path/logs")
+spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
+spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
+spark.conf.set("spark.jsl.settings.aws.s3_bucket", "my.bucket")
+spark.conf.set("spark.jsl.settings.aws.region", "my-region")
+```
+
+Now you can check the log on your S3 path defined in *spark.jsl.settings.annotator.log_folder* property.
+Make sure to use the prefix *s3://*, otherwise it will use the default configuration.
+
+**Tensorflow Graphs:**
+
+To reference S3 location for downloading graphs. We need to set up AWS credentials
+
+```bash
+spark.conf.set("spark.jsl.settings.aws.credentials.access_key_id", "MY_KEY_ID")
+spark.conf.set("spark.jsl.settings.aws.credentials.secret_access_key", "MY_SECRET_ACCESS_KEY")
+spark.conf.set("spark.jsl.settings.aws.region", "my-region")
+```
+
+**MFA Configuration:**
+
+In case your AWS account is configured with MFA. You will need first to get temporal credentials and add session token
+to the configuration as shown in the examples below
+For logging:
+
+```bash
+spark.conf.set("spark.jsl.settings.aws.credentials.session_token", "MY_TOKEN")
+```
+
+An example of a bash script that gets temporal AWS credentials can be
+found [here](https://github.com/JohnSnowLabs/spark-nlp/blob/master/scripts/aws_tmp_credentials.sh)
+This script requires three arguments:
+
+```bash
+./aws_tmp_credentials.sh iam_user duration serial_number
+```
+
+
\ No newline at end of file
diff --git a/docs/en/features.md b/docs/en/features.md
new file mode 100644
index 00000000000000..1a9a5b80470828
--- /dev/null
+++ b/docs/en/features.md
@@ -0,0 +1,120 @@
+---
+layout: docs
+header: true
+seotitle: Spark NLP - Features
+title: Spark NLP - Features
+permalink: /docs/en/features
+key: docs-install
+modify_date: "2024-07-03"
+show_nav: true
+sidebar:
+ nav: sparknlp
+---
+
+
+
+
+## Text Preprocessing
+- Tokenization
+- Trainable Word Segmentation
+- Stop Words Removal
+- Token Normalizer
+- Document Normalizer
+- Document & Text Splitter
+- Stemmer
+- Lemmatizer
+- NGrams
+- Regex Matching
+- Text Matching
+- Spell Checker (ML and DL models)
+
+## Parsing and Analysis
+- Chunking
+- Date Matcher
+- Sentence Detector
+- Deep Sentence Detector (Deep learning)
+- Dependency parsing (Labeled/unlabeled)
+- SpanBertCorefModel (Coreference Resolution)
+- Part-of-speech tagging
+- Named entity recognition (Deep learning)
+- Unsupervised keywords extraction
+- Language Detection & Identification (up to 375 languages)
+
+## Sentiment and Classification
+- Sentiment Detection (ML models)
+- Multi-class & Multi-label Sentiment analysis (Deep learning)
+- Multi-class Text Classification (Deep learning)
+- Zero-Shot NER Model
+- Zero-Shot Text Classification by Transformers (ZSL)
+
+## Embeddings
+- Word Embeddings (GloVe and Word2Vec)
+- Doc2Vec (based on Word2Vec)
+- BERT Embeddings (TF Hub & HuggingFace models)
+- DistilBERT Embeddings (HuggingFace models)
+- CamemBERT Embeddings (HuggingFace models)
+- RoBERTa Embeddings (HuggingFace models)
+- DeBERTa Embeddings (HuggingFace v2 & v3 models)
+- XLM-RoBERTa Embeddings (HuggingFace models)
+- Longformer Embeddings (HuggingFace models)
+- ALBERT Embeddings (TF Hub & HuggingFace models)
+- XLNet Embeddings
+- ELMO Embeddings (TF Hub models)
+- Universal Sentence Encoder (TF Hub models)
+- BERT Sentence Embeddings (TF Hub & HuggingFace models)
+- RoBerta Sentence Embeddings (HuggingFace models)
+- XLM-RoBerta Sentence Embeddings (HuggingFace models)
+- INSTRUCTOR Embeddings (HuggingFace models)
+- E5 Embeddings (HuggingFace models)
+- MPNet Embeddings (HuggingFace models)
+- UAE Embeddings (HuggingFace models)
+- OpenAI Embeddings
+- Sentence & Chunk Embeddings
+
+## Classification and Question Answering Models
+- BERT for Token & Sequence Classification & Question Answering
+- DistilBERT for Token & Sequence Classification & Question Answering
+- CamemBERT for Token & Sequence Classification & Question Answering
+- ALBERT for Token & Sequence Classification & Question Answering
+- RoBERTa for Token & Sequence Classification & Question Answering
+- DeBERTa for Token & Sequence Classification & Question Answering
+- XLM-RoBERTa for Token & Sequence Classification & Question Answering
+- Longformer for Token & Sequence Classification & Question Answering
+- MPnet for Token & Sequence Classification & Question Answering
+- XLNet for Token & Sequence Classification
+
+## Machine Translation and Generation
+- Neural Machine Translation (MarianMT)
+- Many-to-Many multilingual translation model (Facebook M2M100)
+- Table Question Answering (TAPAS)
+- Text-To-Text Transfer Transformer (Google T5)
+- Generative Pre-trained Transformer 2 (OpenAI GPT2)
+- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART)
+- Chat and Conversational LLMs (Facebook Llama-2)
+
+## Image and Speech
+- Vision Transformer (Google ViT)
+- Swin Image Classification (Microsoft Swin Transformer)
+- ConvNext Image Classification (Facebook ConvNext)
+- Vision Encoder Decoder for image-to-text like captioning
+- Zero-Shot Image Classification by OpenAI's CLIP
+- Automatic Speech Recognition (Wav2Vec2)
+- Automatic Speech Recognition (HuBERT)
+- Automatic Speech Recognition (OpenAI Whisper)
+
+## Integration and Interoperability
+- Easy [ONNX](https://github.com/JohnSnowLabs/spark-nlp/tree/feature/SPARKNLP-1015-Modernizing-GitHub-repo/examples/python/transformers/onnx), [OpenVINO](https://github.com/JohnSnowLabs/spark-nlp/tree/feature/SPARKNLP-1015-Modernizing-GitHub-repo/examples/python/transformers/openvino), and [TensorFlow](https://github.com/JohnSnowLabs/spark-nlp/tree/feature/SPARKNLP-1015-Modernizing-GitHub-repo/examples/python/transformers) integrations
+- Full integration with Spark ML functions
+- GPU Support
+
+## Pre-trained Models
+- +31000 pre-trained models in +200 languages!
+- +6000 pre-trained pipelines in +200 languages!
+
+#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
+
+## Multi-lingual Support
+- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian,
+ Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
+
+
\ No newline at end of file
diff --git a/docs/en/install.md b/docs/en/install.md
index 4bc861a2c0d496..3d32683830df96 100644
--- a/docs/en/install.md
+++ b/docs/en/install.md
@@ -5,7 +5,7 @@ seotitle: Spark NLP - Installation
title: Spark NLP - Installation
permalink: /docs/en/install
key: docs-install
-modify_date: "2023-05-10"
+modify_date: "2024-07-04"
show_nav: true
sidebar:
nav: sparknlp
@@ -35,6 +35,14 @@ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
spark-shell --jars spark-nlp-assembly-5.4.0.jar
```
+**GPU (optional):**
+
+Spark NLP 5.4.0 is built with ONNX 1.17.0 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support:
+
+- NVIDIA® GPU drivers version 450.80.02 or higher
+- CUDA® Toolkit 11.2
+- cuDNN SDK 8.1.0
+
## Python
@@ -95,15 +103,73 @@ spark = SparkSession.builder \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0") \
.getOrCreate()
```
+If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course,
+you'll have to put the jars in a reachable location for all driver and executor nodes.
+
+### Python without explicit Pyspark installation
+
+### Pip/Conda
+
+If you installed pyspark through pip/conda, you can install `spark-nlp` through the same channel.
+
+Pip:
+
+```bash
+pip install spark-nlp==5.4.0
+```
+
+Conda:
+
+```bash
+conda install -c johnsnowlabs spark-nlp
+```
+
+PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/) /
+Anaconda [spark-nlp package](https://anaconda.org/JohnSnowLabs/spark-nlp)
+
+Then you'll have to create a SparkSession either from Spark NLP:
+
+```python
+import sparknlp
+
+spark = sparknlp.start()
+```
+
+**Quick example:**
+
+```python
+import sparknlp
+from sparknlp.pretrained import PretrainedPipeline
+
+# create or get Spark Session
+
+spark = sparknlp.start()
+
+sparknlp.version()
+spark.version
+
+# download, load and annotate a text by pre-trained pipeline
+
+pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
+result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo')
+```
## Scala and Java
+To use Spark NLP you need the following requirements:
+
+- Java 8 and 11
+- Apache Spark 3.5.x, 3.4.x, 3.3.x, 3.2.x, 3.1.x, 3.0.x
+
#### Maven
**spark-nlp** on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x
+The `spark-nlp` has been published to
+the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp).
+
```xml
@@ -240,6 +306,81 @@ as expected.
+
+## Command line
+
+Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x
+This steps require internet connection.
+
+#### Apache Spark 3.x (3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x - Scala 2.12)
+
+```sh
+# CPU
+
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+The `spark-nlp` has been published to
+the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp).
+
+```sh
+# GPU
+
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0
+
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0
+
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.4.0
+
+```
+
+The `spark-nlp-gpu` has been published to
+the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu).
+
+```sh
+# AArch64
+
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0
+
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0
+
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.4.0
+
+```
+
+The `spark-nlp-aarch64` has been published to
+the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64).
+
+```sh
+# M1/M2 (Apple Silicon)
+
+spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0
+
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0
+
+spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.4.0
+
+```
+
+The `spark-nlp-silicon` has been published to
+the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon).
+
+**NOTE**: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following
+set in your SparkSession:
+
+```sh
+spark-shell \
+ --driver-memory 16g \
+ --conf spark.kryoserializer.buffer.max=2000M \
+ --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+## Installation for M1 & M2 Chips
+
### Scala and Java for M1
Adding Spark NLP to your Scala or Java project is easy:
@@ -370,6 +511,258 @@ Run the following code in Kaggle Kernel and start using spark-nlp right away.
+## Apache Zeppelin
+
+Use either one of the following options
+
+- Add the following Maven Coordinates to the interpreter's library list
+
+```bash
+com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
+ available to driver path
+
+## Python in Zeppelin
+
+Apart from the previous step, install the python module through pip
+
+```bash
+pip install spark-nlp==5.4.0
+```
+
+Or you can install `spark-nlp` from inside Zeppelin by using Conda:
+
+```bash
+python.conda install -c johnsnowlabs spark-nlp
+```
+
+Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose.
+
+Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and
+install the pip library with (e.g. `python3`).
+
+An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as
+shown earlier since it includes both scala and python side installation.
+
+## Jupyter Notebook
+
+**Recommended:**
+
+The easiest way to get this done on Linux and macOS is to simply install `spark-nlp` and `pyspark` PyPI packages and
+launch the Jupyter from the same Python environment:
+
+```sh
+$ conda create -n sparknlp python=3.8 -y
+$ conda activate sparknlp
+# spark-nlp by default is based on pyspark 3.x
+$ pip install spark-nlp==5.4.0 pyspark==3.3.1 jupyter
+$ jupyter notebook
+```
+
+Then you can use `python3` kernel to run your code with creating SparkSession via `spark = sparknlp.start()`.
+
+**Optional:**
+
+If you are in different operating systems and require to make Jupyter Notebook run by using pyspark, you can follow
+these steps:
+
+```bash
+export SPARK_HOME=/path/to/your/spark/folder
+export PYSPARK_PYTHON=python3
+export PYSPARK_DRIVER_PYTHON=jupyter
+export PYSPARK_DRIVER_PYTHON_OPTS=notebook
+
+pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
+
+If not using pyspark at all, you'll have to run the instructions
+pointed [here](#python-without-explicit-pyspark-installation)
+
+## Databricks Cluster
+
+1. Create a cluster if you don't have one already
+
+2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:
+
+ ```bash
+ spark.kryoserializer.buffer.max 2000M
+ spark.serializer org.apache.spark.serializer.KryoSerializer
+ ```
+
+3. In `Libraries` tab inside your cluster you need to follow these steps:
+
+ 3.1. Install New -> PyPI -> `spark-nlp==5.4.0` -> Install
+
+ 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0` -> Install
+
+4. Now you can attach your notebook to the cluster and use Spark NLP!
+
+NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark
+NLP Maven package name (Maven Coordinate) for your runtime from
+our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
+
+## EMR Cluster
+
+To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software
+configuration.
+
+A sample of your bootstrap script
+
+```.sh
+#!/bin/bash
+set -x -e
+
+echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
+export HADOOP_CONF_DIR=/etc/hadoop/conf
+export SPARK_JARS_DIR=/usr/lib/spark/jars
+export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc
+
+sudo python3 -m pip install awscli boto spark-nlp
+
+set +x
+exit 0
+
+```
+
+A sample of your software configuration in JSON on S3 (must be public access):
+
+```.json
+[{
+ "Classification": "spark-env",
+ "Configurations": [{
+ "Classification": "export",
+ "Properties": {
+ "PYSPARK_PYTHON": "/usr/bin/python3"
+ }
+ }]
+},
+{
+ "Classification": "spark-defaults",
+ "Properties": {
+ "spark.yarn.stagingDir": "hdfs:///tmp",
+ "spark.yarn.preserve.staging.files": "true",
+ "spark.kryoserializer.buffer.max": "2000M",
+ "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
+ "spark.driver.maxResultSize": "0",
+ "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0"
+ }
+}]
+```
+
+A sample of AWS CLI to launch EMR cluster:
+
+```.sh
+aws emr create-cluster \
+--name "Spark NLP 5.4.0" \
+--release-label emr-6.2.0 \
+--applications Name=Hadoop Name=Spark Name=Hive \
+--instance-type m4.4xlarge \
+--instance-count 3 \
+--use-default-roles \
+--log-uri "s3://
/" \
+--bootstrap-actions Path=s3:///emr-bootstrap.sh,Name=custome \
+--configurations "https:///sparknlp-config.json" \
+--ec2-attributes KeyName=,EmrManagedMasterSecurityGroup=,EmrManagedSlaveSecurityGroup= \
+--profile
+```
+
+## GCP Dataproc
+
+1. Create a cluster if you don't have one already as follows.
+
+At gcloud shell:
+
+```bash
+gcloud services enable dataproc.googleapis.com \
+ compute.googleapis.com \
+ storage-component.googleapis.com \
+ bigquery.googleapis.com \
+ bigquerystorage.googleapis.com
+```
+
+```bash
+REGION=
+```
+
+```bash
+BUCKET_NAME=
+gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME}
+```
+
+```bash
+REGION=
+ZONE=
+CLUSTER_NAME=
+BUCKET_NAME=
+```
+
+You can set image-version, master-machine-type, worker-machine-type,
+master-boot-disk-size, worker-boot-disk-size, num-workers as your needs.
+If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components.
+And, you should enable gateway.
+Don't forget to set the maven coordinates for the jar in properties.
+
+```bash
+gcloud dataproc clusters create ${CLUSTER_NAME} \
+ --region=${REGION} \
+ --zone=${ZONE} \
+ --image-version=2.0 \
+ --master-machine-type=n1-standard-4 \
+ --worker-machine-type=n1-standard-2 \
+ --master-boot-disk-size=128GB \
+ --worker-boot-disk-size=128GB \
+ --num-workers=2 \
+ --bucket=${BUCKET_NAME} \
+ --optional-components=JUPYTER \
+ --enable-component-gateway \
+ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \
+ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \
+ --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
+```
+
+2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.
+
+3. Now, you can attach your notebook to the cluster and use the Spark NLP!
+
+
+## Apache Spark Support
+
+Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+
+| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
+|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
+| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
+| 5.3.x | YES | YES | YES | YES | YES | YES | NO | NO |
+| 5.2.x | YES | YES | YES | YES | YES | YES | NO | NO |
+| 5.1.x | Partially | YES | YES | YES | YES | YES | NO | NO |
+| 5.0.x | YES | YES | YES | YES | YES | YES | NO | NO |
+| 4.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
+| 4.3.x | NO | NO | YES | YES | YES | YES | NO | NO |
+| 4.2.x | NO | NO | YES | YES | YES | YES | NO | NO |
+| 4.1.x | NO | NO | YES | YES | YES | YES | NO | NO |
+| 4.0.x | NO | NO | YES | YES | YES | YES | NO | NO |
+
+Find out more about `Spark NLP` versions from our [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases).
+
+## Scala and Python Support
+
+| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
+|-----------|------------|------------|------------|------------|------------|------------|------------|
+| 5.3.x | NO | YES | YES | YES | YES | NO | YES |
+| 5.2.x | NO | YES | YES | YES | YES | NO | YES |
+| 5.1.x | NO | YES | YES | YES | YES | NO | YES |
+| 5.0.x | NO | YES | YES | YES | YES | NO | YES |
+| 4.4.x | NO | YES | YES | YES | YES | NO | YES |
+| 4.3.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.2.x | YES | YES | YES | YES | YES | NO | YES |
+| 4.1.x | YES | YES | YES | YES | NO | NO | YES |
+| 4.0.x | YES | YES | YES | YES | NO | NO | YES |
+
+
## Databricks Support
Spark NLP 5.4.0 has been tested and is compatible with the following runtimes:
@@ -867,4 +1260,44 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
- Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you
- If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`)
+
+## Compiled JARs
+
+### Build from source
+
+#### spark-nlp
+
+- FAT-JAR for CPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+
+```bash
+sbt assembly
+```
+
+- FAT-JAR for GPU on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+
+```bash
+sbt -Dis_gpu=true assembly
+```
+
+- FAT-JAR for M! on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
+
+```bash
+sbt -Dis_silicon=true assembly
+```
+
+### Using the jar manually
+
+If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it
+from [Maven Central](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp).
+
+To add JARs to spark programs use the `--jars` option:
+
+```sh
+spark-shell --jars spark-nlp.jar
+```
+
+The preferred way to use the library when running spark programs is using the `--packages` option as specified in
+the `spark-packages` section.
+
+
diff --git a/docs/en/pipelines.md b/docs/en/pipelines.md
index 43728d43863270..0204f8c62b88f9 100644
--- a/docs/en/pipelines.md
+++ b/docs/en/pipelines.md
@@ -5,7 +5,7 @@ seotitle: Spark NLP - Pipelines
title: Spark NLP - Pipelines
permalink: /docs/en/pipelines
key: docs-pipelines
-modify_date: "2021-11-20"
+modify_date: "2024-07-04"
show_nav: true
sidebar:
nav: sparknlp
@@ -13,96 +13,24 @@ sidebar:
-Pretrained Pipelines have moved to Models Hub.
-Please follow this link for the updated list of all models and pipelines:
-[Models Hub](https://sparknlp.org/models)
-{:.success}
-
-
-
-## English
-
-**NOTE:**
-`noncontrib` pipelines are compatible with `Windows` operating systems.
-
-{:.table-model-big}
-| Pipelines | Name |
-| -------------------- | ---------------------- |
-| [Explain Document ML](#explaindocumentml) | `explain_document_ml`
-| [Explain Document DL](#explaindocumentdl) | `explain_document_dl`
-| [Explain Document DL Win]() | `explain_document_dl_noncontrib`
-| Explain Document DL Fast | `explain_document_dl_fast`
-| Explain Document DL Fast Win | `explain_document_dl_fast_noncontrib` |
-| [Recognize Entities DL](#recognizeentitiesdl) | `recognize_entities_dl` |
-| Recognize Entities DL Win | `recognize_entities_dl_noncontrib` |
-| [OntoNotes Entities Small](#ontorecognizeentitiessm) | `onto_recognize_entities_sm` |
-| [OntoNotes Entities Large](#ontorecognizeentitieslg) | `onto_recognize_entities_lg` |
-| [Match Datetime](#matchdatetime) | `match_datetime` |
-| [Match Pattern](#matchpattern) | `match_pattern` |
-| [Match Chunk](#matchchunks) | `match_chunks` |
-| Match Phrases | `match_phrases`|
-| Clean Stop | `clean_stop`|
-| Clean Pattern | `clean_pattern`|
-| Clean Slang | `clean_slang`|
-| Check Spelling | `check_spelling`|
-| Analyze Sentiment | `analyze_sentiment` |
-| Analyze Sentiment DL | `analyze_sentimentdl_use_imdb` |
-| Analyze Sentiment DL | `analyze_sentimentdl_use_twitter` |
-| Dependency Parse | `dependency_parse` |
-
-
-
-### explain_document_ml
-
-{% highlight scala %}
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
-(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
-(2, "The Paris metro will soon enter the 21st century, ditching single-use paper tickets for rechargeable electronic cards.")
-)).toDF("id", "text")
+## Pipelines and Models
-val pipeline = PretrainedPipeline("explain_document_ml", lang="en")
+### Pipelines
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-2.0.8
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_ml,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 7 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| sentence| token| checked| lemmas| stems| pos|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...|
-| 2|The Paris metro w...|[[document, 0, 11...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### explain_document_dl
-
-{% highlight scala %}
+**Quick example:**
+```scala
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
SparkNLP.version()
val testData = spark.createDataFrame(Seq(
-(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
-(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
+ (1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
+ (2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")
-val pipeline = PretrainedPipeline("explain_document_dl", lang="en")
+val pipeline = PretrainedPipeline("explain_document_dl", lang = "en")
val annotation = pipeline.transform(testData)
@@ -110,7 +38,7 @@ annotation.show()
/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
+2.5.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields]
@@ -132,888 +60,141 @@ annotation.select("entities.result").show(false)
|[Donald John Trump, United States]|
+----------------------------------+
*/
+```
-{% endhighlight %}
+#### Showing Available Pipelines
-
+There are functions in Spark NLP that will list all the available Pipelines
+of a particular language for you:
-### recognize_entities_dl
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
-(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
-(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
-)).toDF("id", "text")
-
-val pipeline = PretrainedPipeline("recognize_entities_dl", lang="en")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(entity_recognizer_dl,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| sentence| token| embeddings| ner| ner_converter|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|Google has announ...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 5, Go...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
-| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 16, D...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
+```scala
+import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
+ResourceDownloader.showPublicPipelines(lang = "en")
/*
-+----------------------------------+
-|result |
-+----------------------------------+
-|[Google, TensorFlow] |
-|[Donald John Trump, United States]|
-+----------------------------------+
++--------------------------------------------+------+---------+
+| Pipeline | lang | version |
++--------------------------------------------+------+---------+
+| dependency_parse | en | 2.0.2 |
+| analyze_sentiment_ml | en | 2.0.2 |
+| check_spelling | en | 2.1.0 |
+| match_datetime | en | 2.1.0 |
+ ...
+| explain_document_ml | en | 3.1.3 |
++--------------------------------------------+------+---------+
*/
+```
-{% endhighlight %}
-
-
+Or if we want to check for a particular version:
-### onto_recognize_entities_sm
-
-Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **OntoNotes** corpus and supports the identification of 18 entities.
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
-(1, "Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament. "),
-(2, "A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.")
-)).toDF("id", "text")
-
-val pipeline = PretrainedPipeline("onto_recognize_entities_sm", lang="en")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.1.0
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(onto_recognize_entities_sm,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|Johnson first ent...|[[document, 0, 17...|[[token, 0, 6, Jo...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 6, Jo...|
-| 2|A little less tha...|[[document, 0, 22...|[[token, 0, 0, A,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 32, A...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
+```scala
+import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
+ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
/*
-+---------------------------------------------------------------------------------+
-|result |
-+---------------------------------------------------------------------------------+
-|[Johnson, first, 2001, Parliament, eight years, London, 2008 to 2016, Parliament]|
-|[A little less than a decade later, dozens] |
-+---------------------------------------------------------------------------------+
++---------------------------------------+------+---------+
+| Pipeline | lang | version |
++---------------------------------------+------+---------+
+| dependency_parse | en | 2.0.2 |
+ ...
+| clean_slang | en | 3.0.0 |
+| clean_pattern | en | 3.0.0 |
+| check_spelling | en | 3.0.0 |
+| dependency_parse | en | 3.0.0 |
++---------------------------------------+------+---------+
*/
+```
-{% endhighlight %}
+#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more
-
+### Models
-### onto_recognize_entities_lg
+**Some selected languages:
+** `Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu`
-Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **OntoNotes** corpus and supports the identification of 18 entities.
+**Quick online example:**
-{% highlight scala %}
+```python
+# load NER model trained by deep learning approach and GloVe word embeddings
+ner_dl = NerDLModel.pretrained('ner_dl')
+# load NER model trained by deep learning approach and BERT word embeddings
+ner_bert = NerDLModel.pretrained('ner_dl_bert')
+```
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
+```scala
+// load French POS tagger model trained by Universal Dependencies
+val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang = "fr")
+// load Italian LemmatizerModel
+val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang = "it")
+````
-SparkNLP.version()
+**Quick offline example:**
-val testData = spark.createDataFrame(Seq(
-(1, "Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London, from 2008 to 2016, before rejoining Parliament. "),
-(2, "A little less than a decade later, dozens of self-driving startups have cropped up while automakers around the world clamor, wallet in hand, to secure their place in the fast-moving world of fully automated transportation.")
-)).toDF("id", "text")
-
-val pipeline = PretrainedPipeline("onto_recognize_entities_lg", lang="en")
+- Loading `PerceptronModel` annotator model inside Spark NLP Pipeline
-val annotation = pipeline.transform(testData)
+```scala
+val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
+ .setInputCols("document", "token")
+ .setOutputCol("pos")
+```
-annotation.show()
+#### Showing Available Models
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.1.0
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(onto_recognize_entities_lg,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 6 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|Johnson first ent...|[[document, 0, 17...|[[token, 0, 6, Jo...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 6, Jo...|
-| 2|A little less tha...|[[document, 0, 22...|[[token, 0, 0, A,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 32, A...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
+There are functions in Spark NLP that will list all the available Models
+of a particular Annotator and language for you:
-annotation.select("entities.result").show(false)
+```scala
+import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
+ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en")
/*
-+-------------------------------------------------------------------------------+
-|result |
-+-------------------------------------------------------------------------------+
-|[Johnson, first, 2001, Parliament, eight years, London, 2008, 2016, Parliament]|
-|[A little less than a decade later, dozens] |
-+-------------------------------------------------------------------------------+
++---------------------------------------------+------+---------+
+| Model | lang | version |
++---------------------------------------------+------+---------+
+| onto_100 | en | 2.1.0 |
+| onto_300 | en | 2.1.0 |
+| ner_dl_bert | en | 2.2.0 |
+| onto_100 | en | 2.4.0 |
+| ner_conll_elmo | en | 3.2.2 |
++---------------------------------------------+------+---------+
*/
+```
-{% endhighlight %}
-
-
-
-### match_datetime
-
-#### DateMatcher yyyy/MM/dd
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
-(1, "I would like to come over and see you in 01/02/2019."),
-(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
-)).toDF("id", "text")
+Or if we want to check for a particular version:
-val pipeline = PretrainedPipeline("match_datetime", lang="en")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
+```scala
+import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
+ResourceDownloader.showPublicModels(annotator = "NerDLModel", lang = "en", version = "3.1.0")
/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_datetime,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 4 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| sentence| token| date|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|I would like to c...|[[document, 0, 51...|[[document, 0, 51...|[[token, 0, 0, I,...|[[date, 41, 50, 2...|
-| 2|Donald John Trump...|[[document, 0, 92...|[[document, 0, 92...|[[token, 0, 5, Do...|[[date, 24, 36, 1...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+
++----------------------------+------+---------+
+| Model | lang | version |
++----------------------------+------+---------+
+| onto_100 | en | 2.1.0 |
+| ner_aspect_based_sentiment | en | 2.6.2 |
+| ner_weibo_glove_840B_300d | en | 2.6.2 |
+| nerdl_atis_840b_300d | en | 2.7.1 |
+| nerdl_snips_100d | en | 2.7.3 |
++----------------------------+------+---------+
*/
+```
-annotation.select("date.result").show(false)
+And to see a list of available annotators, you can use:
-/*
-+------------+
-|result |
-+------------+
-|[2019/01/02]|
-|[1946/06/14]|
-+------------+
-*/
-
-{% endhighlight %}
-
-
-
-### match_pattern
-
-RegexMatcher (match phone numbers)
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
-(1, "You should call Mr. Jon Doe at +33 1 79 01 22 89")
-)).toDF("id", "text")
-
-val pipeline = PretrainedPipeline("match_pattern", lang="en")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_pattern,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 4 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| sentence| token| regex|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|You should call M...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 2, Yo...|[[chunk, 31, 47, ...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("regex.result").show(false)
-
-/*
-+-------------------+
-|result |
-+-------------------+
-|[+33 1 79 01 22 89]|
-+-------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### match_chunks
-
-The pipeline uses regex `
?/*+`
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val testData = spark.createDataFrame(Seq(
-(1, "The book has many chapters"),
-(2, "the little yellow dog barked at the cat")
-)).toDF("id", "text")
-
-val pipeline = PretrainedPipeline("match_chunks", lang="en")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(match_chunks,en,public/models)
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 5 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| sentence| token| pos| chunk|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|The book has many...|[[document, 0, 25...|[[document, 0, 25...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|[[chunk, 0, 7, Th...|
-| 2|the little yellow...|[[document, 0, 38...|[[document, 0, 38...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[chunk, 0, 20, t...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("chunk.result").show(false)
-
-/*
-+--------------------------------+
-|result |
-+--------------------------------+
-|[The book] |
-|[the little yellow dog, the cat]|
-+--------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-## French
-
-{:.table-model-big}
-| Pipelines | Name |
-| ----------------------- | --------------------- |
-| [Explain Document Large](#french-explain_document_lg) | `explain_document_lg` |
-| [Explain Document Medium](#french-explain_document_md) | `explain_document_md` |
-| [Entity Recognizer Large](#french-entity_recognizer_lg) | `entity_recognizer_lg` |
-| [Entity Recognizer Medium](#french-entity_recognizer_md) | `entity_recognizer_md` |
-
-{:.table-model-big}
-|Feature | Description|
-|---|----|
-|**NER**|Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities
-|**Lemma**|Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura`
-|**POS**| Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/fr_gsd/index.html)
-|**Size**| Model size indicator, **md** and **lg**. The large pipeline uses **glove_840B_300** and the medium uses **glove_6B_300** WordEmbeddings
-
-
-
-### French explain_document_lg
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("explain_document_lg", lang="fr")
-
-val testData = spark.createDataFrame(Seq(
-(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."),
-(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
+```scala
+import com.johnsnowlabs.nlp.pretrained.ResourceDownloader
+ResourceDownloader.showAvailableAnnotators()
/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,fr,public/models)
-testData: org.apache.spark.sql.DataFrame = [id: bigint, text: string]
-annotation: org.apache.spark.sql.DataFrame = [id: bigint, text: string ... 8 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[token, 0, 12, C...|[[pos, 0, 12, ADV...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...|
-| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[token, 0, 7, Em...|[[pos, 0, 7, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
+AlbertEmbeddings
+AlbertForTokenClassification
+AssertionDLModel
+...
+XlmRoBertaSentenceEmbeddings
+XlnetEmbeddings
*/
+```
-annotation.select("entities.result").show(false)
-
-/*+-------------------------------------------------------------------------------------------------------------+
-|result |
-+-------------------------------------------------------------------------------------------------------------+
-|[Quentin Tarantino] |
-|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, CHU d'Amiens4, Françoise Noguès, Sécurité sociale]|
-+-------------------------------------------------------------------------------------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### French explain_document_md
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("explain_document_md", lang="fr")
-
-val testData = spark.createDataFrame(Seq(
-(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."),
-(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_md,fr,public/models)
-testData: org.apache.spark.sql.DataFrame = [id: bigint, text: string]
-annotation: org.apache.spark.sql.DataFrame = [id: bigint, text: string ... 8 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[token, 0, 12, C...|[[pos, 0, 12, ADV...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...|
-| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[token, 0, 7, Em...|[[pos, 0, 7, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-|result |
-+----------------------------------------------------------------------------------------------------------------+
-|[Quentin Tarantino] |
-|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, au CHU d'Amiens4, Françoise Noguès, Sécurité sociale]|
-+----------------------------------------------------------------------------------------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### French entity_recognizer_lg
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("entity_recognizer_lg", lang="fr")
-
-val testData = spark.createDataFrame(Seq(
-(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."),
-(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...|
-| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-+-------------------------------------------------------------------------------------------------------------+
-|result |
-+-------------------------------------------------------------------------------------------------------------+
-|[Quentin Tarantino] |
-|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, CHU d'Amiens4, Françoise Noguès, Sécurité sociale]|
-+-------------------------------------------------------------------------------------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### French entity_recognizer_md
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("entity_recognizer_md", lang="fr")
-
-val testData = spark.createDataFrame(Seq(
-(1, "Contrairement à Quentin Tarantino, le cinéma français ne repart pas les mains vides de la compétition cannoise."),
-(2, "Emmanuel Jean-Michel Frédéric Macron est le fils de Jean-Michel Macron, né en 1950, médecin, professeur de neurologie au CHU d'Amiens4 et responsable d'enseignement à la faculté de médecine de cette même ville5, et de Françoise Noguès, médecin conseil à la Sécurité sociale")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 0|Contrairement à Q...|[[document, 0, 11...|[[token, 0, 12, C...|[[document, 0, 11...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 16, 32, ...|
-| 1|Emmanuel Jean-Mic...|[[document, 0, 27...|[[token, 0, 7, Em...|[[document, 0, 27...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 35, E...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*+-------------------------------------------------------------------------------------------------------------+
-|result |
-+----------------------------------------------------------------------------------------------------------------+
-|[Quentin Tarantino] |
-|[Emmanuel Jean-Michel Frédéric Macron, Jean-Michel Macron, au CHU d'Amiens4, Françoise Noguès, Sécurité sociale]|
-+----------------------------------------------------------------------------------------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-## Italian
-
-{:.table-model-big}
-| Pipelines | Name |
-| ----------------------- | --------------------- |
-| [Explain Document Large](#italian-explain_document_lg) | `explain_document_lg` |
-| [Explain Document Medium](#italian-explain_document_md) | `explain_document_md` |
-| [Entity Recognizer Large](#italian-entity_recognizer_lg) | `entity_recognizer_lg` |
-| [Entity Recognizer Medium](#italian-entity_recognizer_md) | `entity_recognizer_md` |
-
-{:.table-model-big}
-|Feature | Description|
-|---|----|
-|**NER**|Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities
-|**Lemma**|Trained by **Lemmatizer** annotator on **DXC Technology** dataset
-|**POS**| Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/it_isdt/index.html)
-|**Size**| Model size indicator, **md** and **lg**. The large pipeline uses **glove_840B_300** and the medium uses **glove_6B_300** WordEmbeddings
-
-
-
-### Italian explain_document_lg
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("explain_document_lg", lang="it")
-
-val testData = spark.createDataFrame(Seq(
-(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"),
-(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models)
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[token, 0, 1, La...|[[pos, 0, 1, DET,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 3, 6, FI...|
-| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[token, 0, 4, Re...|[[pos, 0, 4, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-+-----------------------------------+
-|result |
-+-----------------------------------+
-|[FIFA, Zidane, Materazzi] |
-|[Reims, Domani, Mondiali femminili]|
-+-----------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### Italian explain_document_md
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("explain_document_md", lang="it")
-
-val testData = spark.createDataFrame(Seq(
-(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"),
-(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models)
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| lemma| pos| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[token, 0, 1, La...|[[pos, 0, 1, DET,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 9, La...|
-| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[token, 0, 4, Re...|[[pos, 0, 4, PROP...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-+-------------------------------+
-|result |
-+-------------------------------+
-|[La FIFA, Zidane, Materazzi]|
-|[Reims, Domani, Mondiali] |
-+-------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### Italian entity_recognizer_lg
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("entity_recognizer_lg", lang="it")
-
-val testData = spark.createDataFrame(Seq(
-(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"),
-(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models)
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 3, 6, FI...|
-| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-+-----------------------------------+
-|result |
-+-----------------------------------+
-|[FIFA, Zidane, Materazzi] |
-|[Reims, Domani, Mondiali femminili]|
-+-----------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-### Italian entity_recognizer_md
-
-{% highlight scala %}
-
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-
-SparkNLP.version()
-
-val pipeline = PretrainedPipeline("entity_recognizer_md", lang="it")
-
-val testData = spark.createDataFrame(Seq(
-(1, "La FIFA ha deciso: tre giornate a Zidane, due a Materazzi"),
-(2, "Reims, 13 giugno 2019 – Domani può essere la giornata decisiva per il passaggio agli ottavi di finale dei Mondiali femminili.")
-)).toDF("id", "text")
-
-val annotation = pipeline.transform(testData)
-
-annotation.show()
-
-/*
-import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
-import com.johnsnowlabs.nlp.SparkNLP
-2.0.8
-pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_lg,it,public/models)
-testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
-annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 8 more fields]
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| id| text| document| token| sentence| embeddings| ner| entities|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-| 1|La FIFA ha deciso...|[[document, 0, 56...|[[token, 0, 1, La...|[[document, 0, 56...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 9, La...|
-| 2|Reims, 13 giugno ...|[[document, 0, 12...|[[token, 0, 4, Re...|[[document, 0, 12...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Re...|
-+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
-*/
-
-annotation.select("entities.result").show(false)
-
-/*
-+-------------------------------+
-|result |
-+-------------------------------+
-|[La FIFA, Zidane, Materazzi]|
-|[Reims, Domani, Mondiali] |
-+-------------------------------+
-*/
-
-{% endhighlight %}
-
-
-
-## Spanish
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| Explain Document Small | `explain_document_sm` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_es_2.4.0_2.4_1581977077084.zip) |
-| Explain Document Medium | `explain_document_md` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_es_2.4.0_2.4_1581976836224.zip) |
-| Explain Document Large | `explain_document_lg` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_es_2.4.0_2.4_1581975536033.zip) |
-| Entity Recognizer Small | `entity_recognizer_sm` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_es_2.4.0_2.4_1581978479912.zip) |
-| Entity Recognizer Medium | `entity_recognizer_md` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_es_2.4.0_2.4_1581978260094.zip) |
-| Entity Recognizer Large | `entity_recognizer_lg` | 2.4.0 | `es` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_es_2.4.0_2.4_1581977172660.zip) |
-
-{:.table-model-big}
-| Feature | Description |
-|:----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **Lemma** | Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura` |
-| **POS** | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/es_gsd/index.html) |
-| **NER** | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities |
-|**Size**| Model size indicator, **sm**, **md**, and **lg**. The small pipelines use **glove_100d**, the medium pipelines use **glove_6B_300**, and large pipelines use **glove_840B_300** WordEmbeddings
-
-
-
-## Russian
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| Explain Document Small | `explain_document_sm` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_ru_2.4.4_2.4_1584017142719.zip) |
-| Explain Document Medium | `explain_document_md` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_ru_2.4.4_2.4_1584016917220.zip) |
-| Explain Document Large | `explain_document_lg` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_ru_2.4.4_2.4_1584015824836.zip) |
-| Entity Recognizer Small | `entity_recognizer_sm` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_ru_2.4.4_2.4_1584018543619.zip) |
-| Entity Recognizer Medium | `entity_recognizer_md` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_ru_2.4.4_2.4_1584018332357.zip) |
-| Entity Recognizer Large | `entity_recognizer_lg` | 2.4.4 | `ru` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_ru_2.4.4_2.4_1584017227871.zip) |
-
-{:.table-model-big}
-| Feature | Description |
-|:----------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **Lemma** | Trained by **Lemmatizer** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/ru_gsd/index.html)|
-| **POS** | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/ru_gsd/index.html) |
-| **NER** | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities |
-
-
-
-## Dutch
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| Explain Document Small | `explain_document_sm` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_nl_2.5.0_2.4_1588546621618.zip) |
-| Explain Document Medium | `explain_document_md` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_nl_2.5.0_2.4_1588546605329.zip) |
-| Explain Document Large | `explain_document_lg` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_nl_2.5.0_2.4_1588612556770.zip) |
-| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_nl_2.5.0_2.4_1588546655907.zip) |
-| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_nl_2.5.0_2.4_1588546645304.zip) |
-| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `nl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_nl_2.5.0_2.4_1588612569958.zip) |
-
-
-
-## Norwegian
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| Explain Document Small | `explain_document_sm` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_no_2.5.0_2.4_1588784132955.zip) |
-| Explain Document Medium | `explain_document_md` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_no_2.5.0_2.4_1588783879809.zip) |
-| Explain Document Large | `explain_document_lg` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_no_2.5.0_2.4_1588782610672.zip) |
-| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_no_2.5.0_2.4_1588794567766.zip) |
-| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_no_2.5.0_2.4_1588794357614.zip) |
-| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `no` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_no_2.5.0_2.4_1588793261642.zip) |
-
-
-
-## Polish
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| Explain Document Small | `explain_document_sm` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pl_2.5.0_2.4_1588531081173.zip) |
-| Explain Document Medium | `explain_document_md` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pl_2.5.0_2.4_1588530841737.zip) |
-| Explain Document Large | `explain_document_lg` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pl_2.5.0_2.4_1588529695577.zip) |
-| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pl_2.5.0_2.4_1588532616080.zip) |
-| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pl_2.5.0_2.4_1588532376753.zip) |
-| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `pl` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pl_2.5.0_2.4_1588531171903.zip) |
-
-
-
-## Portuguese
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| Explain Document Small | `explain_document_sm` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_sm_pt_2.5.0_2.4_1588501423743.zip) |
-| Explain Document Medium | `explain_document_md` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_md_pt_2.5.0_2.4_1588501189804.zip) |
-| Explain Document Large | `explain_document_lg` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/explain_document_lg_pt_2.5.0_2.4_1588500056427.zip) |
-| Entity Recognizer Small | `entity_recognizer_sm` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_sm_pt_2.5.0_2.4_1588502815900.zip) |
-| Entity Recognizer Medium | `entity_recognizer_md` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_md_pt_2.5.0_2.4_1588502606198.zip) |
-| Entity Recognizer Large | `entity_recognizer_lg` | 2.5.0 | `pt` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/entity_recognizer_lg_pt_2.5.0_2.4_1588501526324.zip) |
-
-
-
-## Multi-language
-
-{:.table-model-big}
-| Pipeline | Name | Build | lang | Description | Offline |
-|:-------------------------|:-----------------------|:-------|:-------|:----------|:----------|
-| LanguageDetectorDL | `detect_language_7` | 2.5.2 | `xx` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_7_xx_2.5.0_2.4_1591875676774.zip) |
-| LanguageDetectorDL | `detect_language_20` | 2.5.2 | `xx` | | [Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/detect_language_20_xx_2.5.0_2.4_1591875683182.zip) |
-
-* The model with 7 languages: Czech, German, English, Spanish, French, Italy, and Slovak
-* The model with 20 languages: Bulgarian, Czech, German, Greek, English, Spanish, Finnish, French, Croatian, Hungarian, Italy, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Turkish, and Ukrainian
-
-
-
-## How to use
-
-### Online
-
-To use Spark NLP pretrained pipelines, you can call `PretrainedPipeline` with pipeline's name and its language (default is `en`):
-
-{% highlight python %}
-
-pipeline = PretrainedPipeline('explain_document_dl', lang='en')
-
-{% endhighlight %}
-
-Same in Scala
-
-{% highlight scala %}
-
-val pipeline = PretrainedPipeline("explain_document_dl", lang="en")
-
-{% endhighlight %}
-
-
-
-### Offline
-
-If you have any trouble using online pipelines or models in your environment (maybe it's air-gapped), you can directly download them for `offline` use.
-
-After downloading offline models/pipelines and extracting them, here is how you can use them iside your code (the path could be a shared storage like HDFS in a cluster):
-
-{% highlight scala %}
-val advancedPipeline = PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
-// To use the loaded Pipeline for prediction
-advancedPipeline.transform(predictionDF)
-
-{% endhighlight %}
+#### Please check out our Models Hub for the full list of [pre-trained models](https://sparknlp.org/models) with examples, demo, benchmark, and more
\ No newline at end of file