SPARKNLP-742: Improve Examples Folder (#13575)

* SPARKNLP-742: Improve Examples Folder - Improved README - Added Docker Examples - Fixed notebook colab links - Cleaned metadata of notebooks * Fix Lift Warnings
JohnSnowLabs · Mar 14, 2023 · fc5a313 · fc5a313
1 parent c129251
commit fc5a313
Show file tree

Hide file tree

Showing 137 changed files with 7,859 additions and 34,783 deletions.
diff --git a/examples/README.md b/examples/README.md
@@ -1,10 +1,71 @@
 # Spark NLP Examples
 
-Under construction
+This is the directory for examples on how to use Spark NLP in various environments.
 
-Required maintained examples
+These include examples for Python, Scala, Java and Docker.
 
-- Python
-- Scala
-- Java
-- Docker
+For an introduction into using Spark NLP, take a look at the [Quick
+Start](python/quick_start.ipynb). If you are planning to use Spark NLP on Google Colab,
+see [Quick Start on Google Colab](python/quick_start_google_colab.ipynb). The notebook
+[Spark NLP Basics](python/annotation/text/english/spark-nlp-basics) covers the basics of
+Spark NLP.
+
+For more use-cases and advanced examples, take a look at the following table of contents.
+
+## Table Of Contents
+
+- [Python Examples](python)
+  - [Using Annotators](python/annotation)
+    - [Audio Processing](python/annotation/audio)
+    - [Image Processing](python/annotation/image)
+    - [Text Processing](python/annotation/text)
+      - [Chinese](python/annotation/text/chinese)
+      - [English](python/annotation/text/english)
+        - [Assembling Documents](python/annotation/text/english/document-assembler)
+        - [Assembling Tokens to Documents](python/annotation/text/english/token-assembler)
+        - [Chunking](python/annotation/text/english/chunking)
+        - [Co-reference Resolution](python/annotation/text/english/coreference-resolution)
+        - [Document Normalization](python/annotation/text/english/document-normalizer)
+        - [Embeddings](python/annotation/text/english/embeddings)
+        - [Graph Extraction](python/annotation/text/english/graph-extraction)
+        - [Keyword Extraction](python/annotation/text/english/keyword-extraction)
+        - [Language Detection](python/annotation/text/english/language-detection)
+        - [Matching text using Regex](python/annotation/text/english/regex-matcher)
+        - [Model Downloader](python/annotation/text/english/model-downloader)
+        - [Named Entity Recognition](python/annotation/text/english/named-entity-recognition)
+        - [Pretrained Pipelines](python/annotation/text/english/pretrained-pipelines)
+        - [Question Answering](python/annotation/text/english/question-answering)
+        - [Sentence Detection](python/annotation/text/english/sentence-detection)
+        - [Sentiment Detection](python/annotation/text/english/sentiment-detection)
+        - [Stemming](python/annotation/text/english/stemmer)
+        - [Stop Words Cleaning](python/annotation/text/english/stop-words)
+        - [Text Matching](python/annotation/text/english/text-matcher-pipeline)
+        - [Text Similarity](python/annotation/text/english/text-similarity)
+        - [Tokenization Using Regex](python/annotation/text/english/regex-tokenizer)
+      - [French](python/annotation/text/french)
+      - [German](python/annotation/text/german)
+      - [Italian](python/annotation/text/italian)
+      - [Multilingual](python/annotation/text/multilingual)
+      - [Portuguese](python/annotation/text/portuguese)
+      - [Spanish](python/annotation/text/spanish)
+  - [Training Annotators](python/training)
+    - [Chinese](python/training/chinese)
+    - [English](python/training/english)
+      - [Document Embeddings with Doc2Vec](python/training/english/doc2vec)
+      - [Matching Entities with EntityRuler](python/training/english/entity-ruler)
+      - [Named Entity Recognition with CRF](python/training/english/crf-ner)
+      - [Named Entity Recognition with Deep Learning](python/training/english/dl-ner)
+        - [Creating NerDL Graphs](python/training/english/dl-ner/nerdl-graph)
+      - [Sentiment Analysis](python/training/english/sentiment-detection)
+      - [Text Classification](python/training/english/classification)
+      - [Word embeddings with Word2Vec](python/training/english/word2vec)
+    - [French](python/training/french)
+    - [Italian](python/training/italian)
+  - [Transformers in Spark NLP](python/transformers)
+  - [Logging](python/logging)
+- [Scala Examples](scala)
+  - [Training Annotators](scala/training)
+  - [Using Annotators](scala/annotation)
+- [Java Examples](java)
+- [SparkNLP Setup with Docker](docker)
+- [Utilities](util)
diff --git a/examples/docker/README.md b/examples/docker/README.md
@@ -0,0 +1,95 @@
+# Running Spark NLP in Docker
+
+These example Dockerfiles get get you started with using Spark NLP in a Docker
+container.
+
+The following examples set up Jupyter and Scala shells. If you want to run a shell
+inside the containers instead, you can specify `bash` at the end of the `docker run`
+commands.
+
+## Jupyter Notebook (CPU)
+
+The Dockerfile [SparkNLP-CPU.Dockerfile](SparkNLP-CPU.Dockerfile) sets up a docker
+container with Jupyter Notebook. It is based on the official [Jupyter Docker
+Images](https://jupyter-docker-stacks.readthedocs.io/en/latest/). To run the notebook on
+the default port 8888, we can run
+
+```bash
+# Build the Docker Image
+docker build -f SparkNLP-CPU.Dockerfile -t sparknlp:latest .
+
+# Run the container and mount the current directory
+docker run -it --name sparknlp-container \
+    -p 8888:8888 \
+    -v "${PWD}":/home/johnsnow/work \
+    sparknlp:latest
+```
+
+### With GPU Support
+
+If you have compatible NVIDIA GPU, you can use it to leverage better performance on our
+machine learning models. Docker provides support for GPU accelerated containers with
+[nvidia-docker](https://github.com/NVIDIA/nvidia-docker). The linked repository contains
+instructions on how to set it up for your system. (Note that on Windows, using WSL 2
+with Docker is
+[recommended](https://www.docker.com/blog/wsl-2-gpu-support-for-docker-desktop-on-nvidia-gpus/))
+
+After setting it up, we can use the Dockerfile
+[SparkNLP-GPU.Dockerfile](SparkNLP-GPU.Dockerfile) to create an image with CUDA
+support. Containers based on this image will then have access to Spark NLP with GPU
+acceleration.
+
+The commands to set it up could look like this:
+
+```bash
+# Build the image
+docker build -f SparkNLP-GPU.Dockerfile -t sparknlp-gpu:latest .
+
+# Start a container with GPU support and mount the current folder
+docker run -it --init --name sparknlp-gpu-container \
+  -p 8888:8888 \
+  -v "${PWD}":/home/johnsnow/work \
+  --gpus all \
+  --ipc=host \
+  sparknlp-gpu:latest
+```
+
+*NOTE*: After running the container, don't forget to start Spark NLP with
+`sparknlp.start(gpu=True)`! This will set up the right dependencies in Spark.
+
+## Scala Spark Shell
+
+To run Spark NLP in a Scala Spark Shell, we can use the same Dockerfile from Section
+[Jupyter Notebook (CPU)](#jupyter-notebook-cpu). However, instead of using the default
+entrypoint, we can specify the spark-shell as the command:
+
+```bash
+# Run the container, mount the current directory and run spark-shell with Spark NLP
+docker run -it --name sparknlp-container \
+    -v "${PWD}":/home/johnsnow/work \
+    sparknlp:latest \
+    /usr/local/spark/bin/spark-shell \
+    --conf "spark.driver.memory"="4g" \
+    --conf "spark.serializer"="org.apache.spark.serializer.KryoSerializer" \
+    --conf "spark.kryoserializer.buffer.max"="2000M" \
+    --conf "spark.driver.maxResultSize"="0" \
+    --packages "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1"
+```
+
+To run the shell with GPU support, we use the image from [Jupyter Notebook with GPU
+support](#with-gpu-support) and specify the correct package (`spark-nlp-gpu`).
+
+```bash
+# Run the container, mount the current directory and run spark-shell with Spark NLP GPU
+docker run -it --name sparknlp-container \
+    -v "${PWD}":/home/johnsnow/work \
+    --gpus all \
+    --ipc=host \
+    sparknlp-gpu:latest \
+    /usr/local/bin/spark-shell \
+    --conf "spark.driver.memory"="4g" \
+    --conf "spark.serializer"="org.apache.spark.serializer.KryoSerializer" \
+    --conf "spark.kryoserializer.buffer.max"="2000M" \
+    --conf "spark.driver.maxResultSize"="0" \
+    --packages "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1"
+```
diff --git a/examples/docker/SparkNLP-CPU.Dockerfile b/examples/docker/SparkNLP-CPU.Dockerfile
@@ -0,0 +1,11 @@
+FROM jupyter/pyspark-notebook:java-11.0.15
+
+ARG SPARKNLP_VERSION=4.3.1
+RUN pip install --no-cache-dir spark-nlp==${SPARKNLP_VERSION}
+
+# Create a new user
+ENV NB_USER=johnsnow
+ENV CHOWN_HOME=yes
+ENV CHOWN_HOME_OPTS="-R"
+
+WORKDIR /home/${NB_USER}
diff --git a/examples/docker/SparkNLP-GPU.Dockerfile b/examples/docker/SparkNLP-GPU.Dockerfile
@@ -0,0 +1,39 @@
+FROM tensorflow/tensorflow:2.7.4-gpu
+
+# Fetch keys for apt
+RUN rm /etc/apt/sources.list.d/cuda.list && \
+    apt-key del 7fa2af80 && \
+    apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
+
+# Install Java Dependency
+RUN apt-get update && \
+    apt-get -y --no-install-recommends install openjdk-8-jre \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install Spark NLP and dependencies
+ARG SPARKNLP_VERSION=4.3.1
+ARG PYSPARK_VERSION=3.3.0
+RUN pip install --no-cache-dir \
+    pyspark==${PYSPARK_VERSION} spark-nlp==${SPARKNLP_VERSION} pandas numpy jupyterlab
+
+# Create Local User
+ENV NB_USER johnsnow
+ENV NB_UID 1000
+
+RUN adduser --disabled-password \
+    --gecos "Default user" \
+    --uid ${NB_UID} \
+    ${NB_USER}
+
+ENV HOME /home/${NB_USER}
+RUN chown -R ${NB_UID} ${HOME}
+
+ENV PYSPARK_PYTHON=python3
+ENV PYSPARK_DRIVER_PYTHON=python3
+
+USER ${NB_USER}
+WORKDIR ${HOME}
+
+EXPOSE 8888
+CMD ["jupyter", "lab", "--ip", "0.0.0.0"]