-
Notifications
You must be signed in to change notification settings - Fork 717
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SPARKNLP-742: Improve Examples Folder (#13575)
* SPARKNLP-742: Improve Examples Folder - Improved README - Added Docker Examples - Fixed notebook colab links - Cleaned metadata of notebooks * Fix Lift Warnings
- Loading branch information
Showing
137 changed files
with
7,859 additions
and
34,783 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,71 @@ | ||
# Spark NLP Examples | ||
|
||
Under construction | ||
This is the directory for examples on how to use Spark NLP in various environments. | ||
|
||
Required maintained examples | ||
These include examples for Python, Scala, Java and Docker. | ||
|
||
- Python | ||
- Scala | ||
- Java | ||
- Docker | ||
For an introduction into using Spark NLP, take a look at the [Quick | ||
Start](python/quick_start.ipynb). If you are planning to use Spark NLP on Google Colab, | ||
see [Quick Start on Google Colab](python/quick_start_google_colab.ipynb). The notebook | ||
[Spark NLP Basics](python/annotation/text/english/spark-nlp-basics) covers the basics of | ||
Spark NLP. | ||
|
||
For more use-cases and advanced examples, take a look at the following table of contents. | ||
|
||
## Table Of Contents | ||
|
||
- [Python Examples](python) | ||
- [Using Annotators](python/annotation) | ||
- [Audio Processing](python/annotation/audio) | ||
- [Image Processing](python/annotation/image) | ||
- [Text Processing](python/annotation/text) | ||
- [Chinese](python/annotation/text/chinese) | ||
- [English](python/annotation/text/english) | ||
- [Assembling Documents](python/annotation/text/english/document-assembler) | ||
- [Assembling Tokens to Documents](python/annotation/text/english/token-assembler) | ||
- [Chunking](python/annotation/text/english/chunking) | ||
- [Co-reference Resolution](python/annotation/text/english/coreference-resolution) | ||
- [Document Normalization](python/annotation/text/english/document-normalizer) | ||
- [Embeddings](python/annotation/text/english/embeddings) | ||
- [Graph Extraction](python/annotation/text/english/graph-extraction) | ||
- [Keyword Extraction](python/annotation/text/english/keyword-extraction) | ||
- [Language Detection](python/annotation/text/english/language-detection) | ||
- [Matching text using Regex](python/annotation/text/english/regex-matcher) | ||
- [Model Downloader](python/annotation/text/english/model-downloader) | ||
- [Named Entity Recognition](python/annotation/text/english/named-entity-recognition) | ||
- [Pretrained Pipelines](python/annotation/text/english/pretrained-pipelines) | ||
- [Question Answering](python/annotation/text/english/question-answering) | ||
- [Sentence Detection](python/annotation/text/english/sentence-detection) | ||
- [Sentiment Detection](python/annotation/text/english/sentiment-detection) | ||
- [Stemming](python/annotation/text/english/stemmer) | ||
- [Stop Words Cleaning](python/annotation/text/english/stop-words) | ||
- [Text Matching](python/annotation/text/english/text-matcher-pipeline) | ||
- [Text Similarity](python/annotation/text/english/text-similarity) | ||
- [Tokenization Using Regex](python/annotation/text/english/regex-tokenizer) | ||
- [French](python/annotation/text/french) | ||
- [German](python/annotation/text/german) | ||
- [Italian](python/annotation/text/italian) | ||
- [Multilingual](python/annotation/text/multilingual) | ||
- [Portuguese](python/annotation/text/portuguese) | ||
- [Spanish](python/annotation/text/spanish) | ||
- [Training Annotators](python/training) | ||
- [Chinese](python/training/chinese) | ||
- [English](python/training/english) | ||
- [Document Embeddings with Doc2Vec](python/training/english/doc2vec) | ||
- [Matching Entities with EntityRuler](python/training/english/entity-ruler) | ||
- [Named Entity Recognition with CRF](python/training/english/crf-ner) | ||
- [Named Entity Recognition with Deep Learning](python/training/english/dl-ner) | ||
- [Creating NerDL Graphs](python/training/english/dl-ner/nerdl-graph) | ||
- [Sentiment Analysis](python/training/english/sentiment-detection) | ||
- [Text Classification](python/training/english/classification) | ||
- [Word embeddings with Word2Vec](python/training/english/word2vec) | ||
- [French](python/training/french) | ||
- [Italian](python/training/italian) | ||
- [Transformers in Spark NLP](python/transformers) | ||
- [Logging](python/logging) | ||
- [Scala Examples](scala) | ||
- [Training Annotators](scala/training) | ||
- [Using Annotators](scala/annotation) | ||
- [Java Examples](java) | ||
- [SparkNLP Setup with Docker](docker) | ||
- [Utilities](util) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
# Running Spark NLP in Docker | ||
|
||
These example Dockerfiles get get you started with using Spark NLP in a Docker | ||
container. | ||
|
||
The following examples set up Jupyter and Scala shells. If you want to run a shell | ||
inside the containers instead, you can specify `bash` at the end of the `docker run` | ||
commands. | ||
|
||
## Jupyter Notebook (CPU) | ||
|
||
The Dockerfile [SparkNLP-CPU.Dockerfile](SparkNLP-CPU.Dockerfile) sets up a docker | ||
container with Jupyter Notebook. It is based on the official [Jupyter Docker | ||
Images](https://jupyter-docker-stacks.readthedocs.io/en/latest/). To run the notebook on | ||
the default port 8888, we can run | ||
|
||
```bash | ||
# Build the Docker Image | ||
docker build -f SparkNLP-CPU.Dockerfile -t sparknlp:latest . | ||
|
||
# Run the container and mount the current directory | ||
docker run -it --name sparknlp-container \ | ||
-p 8888:8888 \ | ||
-v "${PWD}":/home/johnsnow/work \ | ||
sparknlp:latest | ||
``` | ||
|
||
### With GPU Support | ||
|
||
If you have compatible NVIDIA GPU, you can use it to leverage better performance on our | ||
machine learning models. Docker provides support for GPU accelerated containers with | ||
[nvidia-docker](https://github.com/NVIDIA/nvidia-docker). The linked repository contains | ||
instructions on how to set it up for your system. (Note that on Windows, using WSL 2 | ||
with Docker is | ||
[recommended](https://www.docker.com/blog/wsl-2-gpu-support-for-docker-desktop-on-nvidia-gpus/)) | ||
|
||
After setting it up, we can use the Dockerfile | ||
[SparkNLP-GPU.Dockerfile](SparkNLP-GPU.Dockerfile) to create an image with CUDA | ||
support. Containers based on this image will then have access to Spark NLP with GPU | ||
acceleration. | ||
|
||
The commands to set it up could look like this: | ||
|
||
```bash | ||
# Build the image | ||
docker build -f SparkNLP-GPU.Dockerfile -t sparknlp-gpu:latest . | ||
|
||
# Start a container with GPU support and mount the current folder | ||
docker run -it --init --name sparknlp-gpu-container \ | ||
-p 8888:8888 \ | ||
-v "${PWD}":/home/johnsnow/work \ | ||
--gpus all \ | ||
--ipc=host \ | ||
sparknlp-gpu:latest | ||
``` | ||
|
||
*NOTE*: After running the container, don't forget to start Spark NLP with | ||
`sparknlp.start(gpu=True)`! This will set up the right dependencies in Spark. | ||
|
||
## Scala Spark Shell | ||
|
||
To run Spark NLP in a Scala Spark Shell, we can use the same Dockerfile from Section | ||
[Jupyter Notebook (CPU)](#jupyter-notebook-cpu). However, instead of using the default | ||
entrypoint, we can specify the spark-shell as the command: | ||
|
||
```bash | ||
# Run the container, mount the current directory and run spark-shell with Spark NLP | ||
docker run -it --name sparknlp-container \ | ||
-v "${PWD}":/home/johnsnow/work \ | ||
sparknlp:latest \ | ||
/usr/local/spark/bin/spark-shell \ | ||
--conf "spark.driver.memory"="4g" \ | ||
--conf "spark.serializer"="org.apache.spark.serializer.KryoSerializer" \ | ||
--conf "spark.kryoserializer.buffer.max"="2000M" \ | ||
--conf "spark.driver.maxResultSize"="0" \ | ||
--packages "com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.1" | ||
``` | ||
|
||
To run the shell with GPU support, we use the image from [Jupyter Notebook with GPU | ||
support](#with-gpu-support) and specify the correct package (`spark-nlp-gpu`). | ||
|
||
```bash | ||
# Run the container, mount the current directory and run spark-shell with Spark NLP GPU | ||
docker run -it --name sparknlp-container \ | ||
-v "${PWD}":/home/johnsnow/work \ | ||
--gpus all \ | ||
--ipc=host \ | ||
sparknlp-gpu:latest \ | ||
/usr/local/bin/spark-shell \ | ||
--conf "spark.driver.memory"="4g" \ | ||
--conf "spark.serializer"="org.apache.spark.serializer.KryoSerializer" \ | ||
--conf "spark.kryoserializer.buffer.max"="2000M" \ | ||
--conf "spark.driver.maxResultSize"="0" \ | ||
--packages "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.3.1" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
FROM jupyter/pyspark-notebook:java-11.0.15 | ||
|
||
ARG SPARKNLP_VERSION=4.3.1 | ||
RUN pip install --no-cache-dir spark-nlp==${SPARKNLP_VERSION} | ||
|
||
# Create a new user | ||
ENV NB_USER=johnsnow | ||
ENV CHOWN_HOME=yes | ||
ENV CHOWN_HOME_OPTS="-R" | ||
|
||
WORKDIR /home/${NB_USER} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
FROM tensorflow/tensorflow:2.7.4-gpu | ||
|
||
# Fetch keys for apt | ||
RUN rm /etc/apt/sources.list.d/cuda.list && \ | ||
apt-key del 7fa2af80 && \ | ||
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub | ||
|
||
# Install Java Dependency | ||
RUN apt-get update && \ | ||
apt-get -y --no-install-recommends install openjdk-8-jre \ | ||
&& apt-get clean \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
# Install Spark NLP and dependencies | ||
ARG SPARKNLP_VERSION=4.3.1 | ||
ARG PYSPARK_VERSION=3.3.0 | ||
RUN pip install --no-cache-dir \ | ||
pyspark==${PYSPARK_VERSION} spark-nlp==${SPARKNLP_VERSION} pandas numpy jupyterlab | ||
|
||
# Create Local User | ||
ENV NB_USER johnsnow | ||
ENV NB_UID 1000 | ||
|
||
RUN adduser --disabled-password \ | ||
--gecos "Default user" \ | ||
--uid ${NB_UID} \ | ||
${NB_USER} | ||
|
||
ENV HOME /home/${NB_USER} | ||
RUN chown -R ${NB_UID} ${HOME} | ||
|
||
ENV PYSPARK_PYTHON=python3 | ||
ENV PYSPARK_DRIVER_PYTHON=python3 | ||
|
||
USER ${NB_USER} | ||
WORKDIR ${HOME} | ||
|
||
EXPOSE 8888 | ||
CMD ["jupyter", "lab", "--ip", "0.0.0.0"] |
Oops, something went wrong.