Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARKNLP-742: Improve Examples Folder #13575

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 67 additions & 6 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,71 @@
# Spark NLP Examples

Under construction
This is the directory for examples on how to use Spark NLP in various environments.

Required maintained examples
These include examples for Python, Scala, Java and Docker.

- Python
- Scala
- Java
- Docker
For an introduction into using Spark NLP, take a look at the [Quick
Start](python/quick_start.ipynb). If you are planning to use Spark NLP on Google Colab,
see [Quick Start on Google Colab](python/quick_start_google_colab.ipynb). The notebook
[Spark NLP Basics](python/annotation/text/english/spark-nlp-basics) covers the basics of
Spark NLP.

For more use-cases and advanced examples, take a look at the following table of contents.

## Table Of Contents

- [Python Examples](python)
- [Using Annotators](python/annotation)
- [Audio Processing](python/annotation/audio)
- [Image Processing](python/annotation/image)
- [Text Processing](python/annotation/text)
- [Chinese](python/annotation/text/chinese)
- [English](python/annotation/text/english)
- [Assembling Documents](python/annotation/text/english/document-assembler)
- [Assembling Tokens to Documents](python/annotation/text/english/token-assembler)
- [Chunking](python/annotation/text/english/chunking)
- [Co-reference Resolution](python/annotation/text/english/coreference-resolution)
- [Document Normalization](python/annotation/text/english/document-normalizer)
- [Embeddings](python/annotation/text/english/embeddings)
- [Graph Extraction](python/annotation/text/english/graph-extraction)
- [Keyword Extraction](python/annotation/text/english/keyword-extraction)
- [Language Detection](python/annotation/text/english/language-detection)
- [Matching text using Regex](python/annotation/text/english/regex-matcher)
- [Model Downloader](python/annotation/text/english/model-downloader)
- [Named Entity Recognition](python/annotation/text/english/named-entity-recognition)
- [Pretrained Pipelines](python/annotation/text/english/pretrained-pipelines)
- [Question Answering](python/annotation/text/english/question-answering)
- [Sentence Detection](python/annotation/text/english/sentence-detection)
- [Sentiment Detection](python/annotation/text/english/sentiment-detection)
- [Stemming](python/annotation/text/english/stemmer)
- [Stop Words Cleaning](python/annotation/text/english/stop-words)
- [Text Matching](python/annotation/text/english/text-matcher-pipeline)
- [Text Similarity](python/annotation/text/english/text-similarity)
- [Tokenization Using Regex](python/annotation/text/english/regex-tokenizer)
- [French](python/annotation/text/french)
- [German](python/annotation/text/german)
- [Italian](python/annotation/text/italian)
- [Multilingual](python/annotation/text/multilingual)
- [Portuguese](python/annotation/text/portuguese)
- [Spanish](python/annotation/text/spanish)
- [Training Annotators](python/training)
- [Chinese](python/training/chinese)
- [English](python/training/english)
- [Document Embeddings with Doc2Vec](python/training/english/doc2vec)
- [Matching Entities with EntityRuler](python/training/english/entity-ruler)
- [Named Entity Recognition with CRF](python/training/english/crf-ner)
- [Named Entity Recognition with Deep Learning](python/training/english/dl-ner)
- [Creating NerDL Graphs](python/training/english/dl-ner/nerdl-graph)
- [Sentiment Analysis](python/training/english/sentiment-detection)
- [Text Classification](python/training/english/classification)
- [Word embeddings with Word2Vec](python/training/english/word2vec)
- [French](python/training/french)
- [Italian](python/training/italian)
- [Transformers in Spark NLP](python/transformers)
- [Logging](python/logging)
- [Scala Examples](scala)
- [Training Annotators](scala/training)
- [Using Annotators](scala/annotation)
- [Java Examples](java)
- [SparkNLP Setup with Docker](docker)
- [Utilities](util)
95 changes: 95 additions & 0 deletions examples/docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Running Spark NLP in Docker

These example Dockerfiles get get you started with using Spark NLP in a Docker
container.

The following examples set up Jupyter and Scala shells. If you want to run a shell
inside the containers instead, you can specify `bash` at the end of the `docker run`
commands.

## Jupyter Notebook (CPU)

The Dockerfile [SparkNLP-CPU.Dockerfile](SparkNLP-CPU.Dockerfile) sets up a docker
container with Jupyter Notebook. It is based on the official [Jupyter Docker
Images](https://jupyter-docker-stacks.readthedocs.io/en/latest/). To run the notebook on
the default port 8888, we can run

```bash
# Build the Docker Image
docker build -f SparkNLP-CPU.Dockerfile -t sparknlp:latest .

# Run the container and mount the current directory
docker run -it --name sparknlp-container \
-p 8888:8888 \
-v "${PWD}":/home/johnsnow/work \
sparknlp:latest
```

### With GPU Support

If you have compatible NVIDIA GPU, you can use it to leverage better performance on our
machine learning models. Docker provides support for GPU accelerated containers with
[nvidia-docker](https://github.com/NVIDIA/nvidia-docker). The linked repository contains
instructions on how to set it up for your system. (Note that on Windows, using WSL 2
with Docker is
[recommended](https://www.docker.com/blog/wsl-2-gpu-support-for-docker-desktop-on-nvidia-gpus/))

After setting it up, we can use the Dockerfile
[SparkNLP-GPU.Dockerfile](SparkNLP-GPU.Dockerfile) to create an image with CUDA
support. Containers based on this image will then have access to Spark NLP with GPU
acceleration.

The commands to set it up could look like this:

```bash
# Build the image
docker build -f SparkNLP-GPU.Dockerfile -t sparknlp-gpu:latest .

# Start a container with GPU support and mount the current folder
docker run -it --init --name sparknlp-gpu-container \
-p 8888:8888 \
-v "${PWD}":/home/johnsnow/work \
--gpus all \
--ipc=host \
sparknlp-gpu:latest
```

*NOTE*: After running the container, don't forget to start Spark NLP with
`sparknlp.start(gpu=True)`! This will set up the right dependencies in Spark.

## Scala Spark Shell

To run Spark NLP in a Scala Spark Shell, we can use the same Dockerfile from Section
[Jupyter Notebook (CPU)](#jupyter-notebook-cpu). However, instead of using the default
entrypoint, we can specify the spark-shell as the command:

```bash
# Run the container, mount the current directory and run spark-shell with Spark NLP
docker run -it --name sparknlp-container \
-v "${PWD}":/home/johnsnow/work \
sparknlp:latest \
/usr/local/spark/bin/spark-shell \
--conf "spark.driver.memory"="4g" \
--conf "spark.serializer"="org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryoserializer.buffer.max"="2000M" \
--conf "spark.driver.maxResultSize"="0" \
--packages "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1"
```

To run the shell with GPU support, we use the image from [Jupyter Notebook with GPU
support](#with-gpu-support) and specify the correct package (`spark-nlp-gpu`).

```bash
# Run the container, mount the current directory and run spark-shell with Spark NLP GPU
docker run -it --name sparknlp-container \
-v "${PWD}":/home/johnsnow/work \
--gpus all \
--ipc=host \
sparknlp-gpu:latest \
/usr/local/bin/spark-shell \
--conf "spark.driver.memory"="4g" \
--conf "spark.serializer"="org.apache.spark.serializer.KryoSerializer" \
--conf "spark.kryoserializer.buffer.max"="2000M" \
--conf "spark.driver.maxResultSize"="0" \
--packages "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1"
```
11 changes: 11 additions & 0 deletions examples/docker/SparkNLP-CPU.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM jupyter/pyspark-notebook:java-11.0.15

ARG SPARKNLP_VERSION=4.3.1
RUN pip install --no-cache-dir spark-nlp==${SPARKNLP_VERSION}

# Create a new user
ENV NB_USER=johnsnow
ENV CHOWN_HOME=yes
ENV CHOWN_HOME_OPTS="-R"

WORKDIR /home/${NB_USER}
39 changes: 39 additions & 0 deletions examples/docker/SparkNLP-GPU.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
FROM tensorflow/tensorflow:2.7.4-gpu

# Fetch keys for apt
RUN rm /etc/apt/sources.list.d/cuda.list && \
apt-key del 7fa2af80 && \
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

# Install Java Dependency
RUN apt-get update && \
maziyarpanahi marked this conversation as resolved.
Show resolved Hide resolved
apt-get -y --no-install-recommends install openjdk-8-jre \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Install Spark NLP and dependencies
ARG SPARKNLP_VERSION=4.3.1
ARG PYSPARK_VERSION=3.3.0
RUN pip install --no-cache-dir \
maziyarpanahi marked this conversation as resolved.
Show resolved Hide resolved
pyspark==${PYSPARK_VERSION} spark-nlp==${SPARKNLP_VERSION} pandas numpy jupyterlab

# Create Local User
ENV NB_USER johnsnow
ENV NB_UID 1000

RUN adduser --disabled-password \
--gecos "Default user" \
--uid ${NB_UID} \
${NB_USER}

ENV HOME /home/${NB_USER}
RUN chown -R ${NB_UID} ${HOME}

ENV PYSPARK_PYTHON=python3
ENV PYSPARK_DRIVER_PYTHON=python3

USER ${NB_USER}
WORKDIR ${HOME}

EXPOSE 8888
CMD ["jupyter", "lab", "--ip", "0.0.0.0"]
Loading