diff --git a/docs/en/transformer_entries/ZeroShotNer.md b/docs/en/transformer_entries/ZeroShotNer.md index f9623d9330980c..1d7a96e1528921 100644 --- a/docs/en/transformer_entries/ZeroShotNer.md +++ b/docs/en/transformer_entries/ZeroShotNer.md @@ -11,6 +11,9 @@ used to recognize entities. The definitions of entities is given by a dictionary specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering. +For more extended examples see the +[Examples](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb). + Pretrained models can be loaded with `pretrained` of the companion object: ```scala diff --git a/examples/bak2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb b/examples/bak2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb deleted file mode 100644 index cd9972a1a0bb43..00000000000000 --- a/examples/bak2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb +++ /dev/null @@ -1,7131 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "sXatvRX899i0" - }, - "source": [ - "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9XsAEBYVxeB-" - }, - "source": [ - "\n", - "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Xj5fx5ir-wMt" - }, - "source": [ - "# **Text Preprocessing with Spark NLP**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "H_SG0VCrix5p" - }, - "source": [ - "**Note** Read this article if you want to understand the basic concepts in Spark NLP.\n", - "\n", - "https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MfkkKkbVF309" - }, - "source": [ - "## **0. Colab Setup**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "iMkMQtZNF2n-" - }, - "outputs": [], - "source": [ - "!pip install -q pyspark==3.3.0 spark-nlp==4.3.0" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SS07N80gEtSt" - }, - "source": [ - "### **1. Annotators and Transformer Concepts**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "g3_ic8K7E0sy" - }, - "source": [ - "In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.\n", - "In Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel\n", - "AnnotatorApproach extends Estimators from Spark ML, which are meant to be trained through fit(), and AnnotatorModel extends Transformers which are meant to transform data frames through transform().\n", - "Some of Spark NLP annotators have a Model suffix and some do not. The model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers but do not contain the suffix Model since they are not trained, annotators. Model annotators have a pre-trained() on its static object, to retrieve the public pre-trained version of a model.\n", - "Long story short, if it trains on a DataFrame and produces a model, it’s an AnnotatorApproach; and if it transforms one DataFrame into another DataFrame through some models, it’s an AnnotatorModel (e.g. WordEmbeddingsModel) and it doesn’t take Model suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x6SaPXwtFBM-" - }, - "source": [ - "By convention, there are three possible names:\n", - "\n", - "**Approach** — Trainable annotator\n", - "\n", - "**Model** — Trained annotator\n", - "\n", - "**nothing** — Either a non-trainable annotator with pre-processing\n", - "step or shorthand for a model\n", - "\n", - "So for example, Stemmer doesn’t say Approach nor Model, however, it is a Model. On the other hand, Tokenizer doesn’t say Approach nor Model, but it has a TokenizerModel(). Because it is not “training” anything, but it is doing some preprocessing before converting into a Model.\n", - "When in doubt, please refer to official documentation and API reference.\n", - "Even though we will do many hands-on practices in the following articles, let us give you a glimpse to let you understand the difference between AnnotatorApproach and AnnotatorModel.\n", - "As stated above, Tokenizer is an AnnotatorModel. So we need to call fit() and then transform()." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ALiQ2TsOFNyc" - }, - "source": [ - "Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.\n", - "\n", - "- Split text into sentences\n", - "- Tokenize\n", - "- Normalize\n", - "- Get word embeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K0Yy8L-pFb27" - }, - "source": [ - "![image.png]()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "suLa96N7Fijt" - }, - "source": [ - "**What’s actually happening under the hood?**\n", - "\n", - "When we fit() on the pipeline with Spark data frame (df), its text column is fed into DocumentAssembler() transformer at first and then a new column “document” is created in Document type (AnnotatorType). As we mentioned before, this transformer is basically the initial entry point to Spark NLP for any Spark data frame. Then its document column is fed into SentenceDetector() (AnnotatorApproach) and the text is split into an array of sentences and a new column “sentences” in Document type is created. Then “sentences” column is fed into Tokenizer() (AnnotatorModel) and each sentence is tokenized and a new column “token” in Token type is created. And so on. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 254 - }, - "executionInfo": { - "elapsed": 25398, - "status": "ok", - "timestamp": 1664906807242, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "SDasO3DbKu2Z", - "outputId": "41f67d0d-9012-4c34-f111-b57a8109c482" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Spark NLP version: 4.3.0\n", - "Apache Spark version: 3.3.0\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "

SparkSession - in-memory

\n", - " \n", - "
\n", - "

SparkContext

\n", - "\n", - "

Spark UI

\n", - "\n", - "
\n", - "
Version
\n", - "
v3.3.0
\n", - "
Master
\n", - "
local[*]
\n", - "
AppName
\n", - "
Spark NLP
\n", - "
\n", - "
\n", - " \n", - "
\n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import sparknlp\n", - "\n", - "spark = sparknlp.start()\n", - "\n", - "print(\"Spark NLP version: \", sparknlp.version())\n", - "print(\"Apache Spark version: \", spark.version)\n", - "\n", - "spark" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ab6V51l_nPyR" - }, - "source": [ - "### **Create Spark Dataframe**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 7168, - "status": "ok", - "timestamp": 1664906873787, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "qj-Q4LzMGCtQ", - "outputId": "19f8857d-f625-4bf0-b9ba-d39a8927bd30" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------------------------------------+\n", - "|text |\n", - "+------------------------------------------------+\n", - "|Peter Parker is a nice guy and lives in New York|\n", - "+------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "text = 'Peter Parker is a nice guy and lives in New York'\n", - "\n", - "spark_df = spark.createDataFrame([[text]]).toDF(\"text\")\n", - "\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 384, - "status": "ok", - "timestamp": 1664906880132, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "SGpOOlBHkPxP", - "outputId": "910c6eab-6d75-4602-ab45-61161cd96efd" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------------+\n", - "| text|\n", - "+--------------------------------------------------------+\n", - "| Peter Parker is a nice guy and lives in New York.|\n", - "|Bruce Wayne is also a nice guy and lives in Gotham City.|\n", - "+--------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql.types import StringType, IntegerType\n", - "\n", - "# if you want to create a spark datafarme from a list of strings\n", - "\n", - "text_list = ['Peter Parker is a nice guy and lives in New York.', 'Bruce Wayne is also a nice guy and lives in Gotham City.']\n", - "\n", - "spark.createDataFrame(text_list, StringType()).toDF(\"text\").show(truncate=80)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 875, - "status": "ok", - "timestamp": 1664906889797, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Hz6bw9a6t_Br", - "outputId": "db84014a-c753-43e1-b04f-71df6f27206b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------------+\n", - "| text|\n", - "+--------------------------------------------------------+\n", - "| Peter Parker is a nice guy and lives in New York.|\n", - "|Bruce Wayne is also a nice guy and lives in Gotham City.|\n", - "+--------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql import Row\n", - "\n", - "spark.createDataFrame(list(map(lambda x: Row(text=x), text_list))).show(truncate=80)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ggnHbf1rGn1H" - }, - "outputs": [], - "source": [ - "!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/annotation/english/spark-nlp-basics/sample-sentences-en.txt" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 6, - "status": "ok", - "timestamp": 1664906901964, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "GaOe_G-OnBVg", - "outputId": "9e7ba134-c689-47a9-f415-6a4c0fd06459" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Peter is a very good person.\n", - "My life in Russia is very interesting.\n", - "John and Peter are brothers. However they don't support each other that much.\n", - "Lucas Nogal Dunbercker is no longer happy. He has a good car though.\n", - "Europe is very culture rich. There are huge churches! and big houses!\n" - ] - } - ], - "source": [ - "with open('./sample-sentences-en.txt') as f:\n", - " print (f.read())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 809, - "status": "ok", - "timestamp": 1664906904383, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "wdrkFmcVGW-o", - "outputId": "5adabd53-3d5c-4e69-f585-ffad494ee1e1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+\n", - "|text |\n", - "+-----------------------------------------------------------------------------+\n", - "|Peter is a very good person. |\n", - "|My life in Russia is very interesting. |\n", - "|John and Peter are brothers. However they don't support each other that much.|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |\n", - "+-----------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')\n", - "\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 320, - "status": "ok", - "timestamp": 1664906907432, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "IzBvpIZtGrLX", - "outputId": "14e297b9-f3d1-4a33-93cd-edb0f03d90ea" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+\n", - "|text |\n", - "+-----------------------------------------------------------------------------+\n", - "|Peter is a very good person. |\n", - "|My life in Russia is very interesting. |\n", - "|John and Peter are brothers. However they don't support each other that much.|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |\n", - "+-----------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df.select('text').show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 749, - "status": "ok", - "timestamp": 1664906917726, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "s66VfRkXK9l3", - "outputId": "1dd0843c-f272-45c0-900e-91a048fe0cd4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------------------+------------------------------+\n", - "| path| text|\n", - "+------------------------------+------------------------------+\n", - "|file:/content/sample-senten...|Peter is a very good person...|\n", - "+------------------------------+------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "textFiles = spark.sparkContext.wholeTextFiles(\"./*.txt\",4)\n", - " \n", - "spark_df_folder = textFiles.toDF(schema=['path','text'])\n", - "\n", - "spark_df_folder.show(truncate=30)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 779, - "status": "ok", - "timestamp": 1664906921013, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "myuUF6vVV1cJ", - "outputId": "4d47df4a-7d98-4820-bd50-7b7d55f02287" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(text=\"Peter is a very good person.\\nMy life in Russia is very interesting.\\nJohn and Peter are brothers. However they don't support each other that much.\\nLucas Nogal Dunbercker is no longer happy. He has a good car though.\\nEurope is very culture rich. There are huge churches! and big houses!\")]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "spark_df_folder.select('text').take(1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 325, - "status": "ok", - "timestamp": 1664906923956, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "s3K_q2ptmFnH", - "outputId": "a61f23b5-1525-40c0-ddcc-d451e3b34ad0" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(text=\"Peter is a very good person.\\nMy life in Russia is very interesting.\\nJohn and Peter are brothers. However they don't support each other that much.\\nLucas Nogal Dunbercker is no longer happy. He has a good car though.\\nEurope is very culture rich. There are huge churches! and big houses!\")]" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "spark_df_folder.select('text').collect()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZTcOJCrieNQK" - }, - "source": [ - "### **Transformers**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AwbTXq-keP7V" - }, - "source": [ - "What are we going to do if our DataFrame doesn’t have columns in those type? Here comes transformers. In Spark NLP, we have five different transformers that are mainly used for getting the data in or transform the data from one AnnotatorType to another. Here is the list of transformers:\n", - "\n", - "| **Transformers** | **Description** |\n", - "| - | - |\n", - "|**DocumentAssembler** |To get through the NLP process, we need to get raw data annotated. This is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.\n", - "|**TokenAssembler** |This transformer reconstructs a Document type annotation from tokens, usually after these have been, lemmatized, normalized, spell checked, etc, to use this document annotation in further annotators.\n", - "|**Doc2Chunk** | Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.\n", - "|**Chunk2Doc** |Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.\n", - "|**Finisher** |Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into a string.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tOBEa23Odr-C" - }, - "source": [ - "Each annotator accepts certain types of columns and outputs new columns in another type (we call this AnnotatorType).\n", - "\n", - "In Spark NLP, we have the following types: \n", - "\n", - ">`Document`, `token`, `chunk`, `pos`, `word_embeddings`, `date`, `entity`, `sentiment`, `named_entity`, `dependency`, `labeled_dependency`. \n", - "\n", - "That is, the DataFrame you have needs to have a column from one of these types if that column will be fed into an annotator; otherwise, you’d need to use one of the Spark NLP transformers." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ss59Lk4ULNRT" - }, - "source": [ - "## **Document Assembler**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "74S30BktM8p-" - }, - "source": [ - "In Spark NLP, we have five different transformers that are mainly used for getting the data in or transform the data from one AnnotatorType to another." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TtyKgt_iM_5C" - }, - "source": [ - "That is, the DataFrame you have needs to have a column from one of these types if that column will be fed into an annotator; otherwise, you’d need to use one of the Spark NLP transformers. Here is the list of transformers: DocumentAssembler, TokenAssembler, Doc2Chunk, Chunk2Doc, and the Finisher.\n", - "\n", - "So, let’s start with DocumentAssembler(), an entry point to Spark NLP annotators." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "GLiXAucCLWS3" - }, - "source": [ - "To get through the process in Spark NLP, we need to get raw data transformed into Document type at first. \n", - "\n", - "DocumentAssembler() is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.\n", - "\n", - "DocumentAssembler() comes from sparknlp.base class and has the following settable parameters. See the full list here and the source code here.\n", - "\n", - "\n", - "| Parametre | Value | Description |\n", - "| - | - | - |\n", - "|**setInputCol()*** |String |The name of the column that will be converted. We can specify only one column here. It can read either a String column or an Array.|\n", - "|**setOutputCol()*** |optional|The name of the column in Document type that is generated. We can specify only one column here. Default is '**document**'.|\n", - "|**setIdCol()*** |optional|String type column with id information|\n", - "|**setMetadataCol()*** |optional|Map type column with metadata information.|\n", - "|**setCleanupMode()***|optional| Cleaning up options|\n", - "\n", - "\n", - "possible values for setCleanupMode :\n", - " ```\n", - " disabled: Source kept as original. This is a default.\n", - " inplace: removes new lines and tabs.\n", - " inplace_full: removes new lines and tabs but also those which were converted to strings (i.e. \\n)\n", - " shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.\n", - " shrink_full: remove new lines and tabs, including stringified values, plus shrinking spaces and blank lines.\n", - " ```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 320, - "status": "ok", - "timestamp": 1664906952096, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "7aV9M823mWXF", - "outputId": "59905c91-1204-43b9-cee0-78ce859dfa15" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------+\n", - "| text|\n", - "+--------------------+\n", - "|Peter is a very g...|\n", - "|My life in Russia...|\n", - "|John and Peter ar...|\n", - "|Lucas Nogal Dunbe...|\n", - "|Europe is very cu...|\n", - "+--------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 314, - "status": "ok", - "timestamp": 1664906954191, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "p8kMdclKHr-O", - "outputId": "fdfc7e98-f73b-4cde-bba9-c5f88650572b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+\n", - "|text |\n", - "+-----------------------------------------------------------------------------+\n", - "|Peter is a very good person. |\n", - "|My life in Russia is very interesting. |\n", - "|John and Peter are brothers. However they don't support each other that much.|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |\n", - "+-----------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1621, - "status": "ok", - "timestamp": 1664906957934, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "E76Z1SCPLOyy", - "outputId": "e42f3a1d-aac2-44b8-9246-a5cc3f324d4d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+\n", - "|text |document |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+\n", - "|Peter is a very good person. |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |\n", - "|My life in Russia is very interesting. |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |\n", - "|John and Peter are brothers. However they don't support each other that much.|[{document, 0, 76, John and Peter are brothers. However they don't support each other that much., {sentence -> 0}, []}]|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |[{document, 0, 67, Lucas Nogal Dunbercker is no longer happy. He has a good car though., {sentence -> 0}, []}] |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |[{document, 0, 68, Europe is very culture rich. There are huge churches! and big houses!, {sentence -> 0}, []}] |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "from sparknlp.base import *\n", - "\n", - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\\\n", - " .setCleanupMode(\"shrink\")\n", - "\n", - "doc_df = documentAssembler.transform(spark_df)\n", - "\n", - "doc_df.show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4gAyd6D0MSDF" - }, - "source": [ - "At first, we define DocumentAssembler with desired parameters and then transform the data frame with it. The most important point to pay attention to here is that you need to use a String or String[Array] type column in .setInputCol(). So it doesn’t have to be named as text. You just use the column name as it is." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 350, - "status": "ok", - "timestamp": 1664906960636, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Ui6Ufm_fMS5h", - "outputId": "e5aabf5c-00e6-45ad-f202-7d936db11cf3" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "root\n", - " |-- text: string (nullable = true)\n", - " |-- document: array (nullable = true)\n", - " | |-- element: struct (containsNull = true)\n", - " | | |-- annotatorType: string (nullable = true)\n", - " | | |-- begin: integer (nullable = false)\n", - " | | |-- end: integer (nullable = false)\n", - " | | |-- result: string (nullable = true)\n", - " | | |-- metadata: map (nullable = true)\n", - " | | | |-- key: string\n", - " | | | |-- value: string (valueContainsNull = true)\n", - " | | |-- embeddings: array (nullable = true)\n", - " | | | |-- element: float (containsNull = false)\n", - "\n" - ] - } - ], - "source": [ - "doc_df.printSchema()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 494, - "status": "ok", - "timestamp": 1664906963110, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "UHX6sM47NIVP", - "outputId": "670364b6-ee51-4b22-d8dc-690aa019c913" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------------------------------------------+-----+----+\n", - "|result |begin|end |\n", - "+-------------------------------------------------------------------------------+-----+----+\n", - "|[Peter is a very good person.] |[0] |[27]|\n", - "|[My life in Russia is very interesting.] |[0] |[37]|\n", - "|[John and Peter are brothers. However they don't support each other that much.]|[0] |[76]|\n", - "|[Lucas Nogal Dunbercker is no longer happy. He has a good car though.] |[0] |[67]|\n", - "|[Europe is very culture rich. There are huge churches! and big houses!] |[0] |[68]|\n", - "+-------------------------------------------------------------------------------+-----+----+\n", - "\n" - ] - } - ], - "source": [ - "doc_df.select('document.result','document.begin','document.end').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1zb-mdaNMbS5" - }, - "source": [ - "The new column is in an array of struct type and has the parameters shown above. The annotators and transformers all come with universal metadata that would be filled down the road depending on the annotators being used. Unless you want to append other Spark NLP annotators to DocumentAssembler(), you don’t need to know what all these parameters mean for now. So we will talk about them in the following articles. You can access all these parameters with {column name}.{parameter name}.\n", - "\n", - "Let’s print out the first item’s result." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 397, - "status": "ok", - "timestamp": 1664906966360, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "r-EWE7TIMb69", - "outputId": "ec346baf-8012-450c-d468-bfc391ce7dca" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter is a very good person.'])]" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "doc_df.select(\"document.result\").take(1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SiogyzI-MjsI" - }, - "source": [ - "If we would like to flatten the document column, we can do as follows.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 332, - "status": "ok", - "timestamp": 1664906968073, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ocMbESMGMeJA", - "outputId": "6f3898cd-04d6-4e17-d5c7-1332519d9680" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------+-----+---+-----------------------------------------------------------------------------+---------------+----------+\n", - "|annotatorType|begin|end|result |metadata |embeddings|\n", - "+-------------+-----+---+-----------------------------------------------------------------------------+---------------+----------+\n", - "|document |0 |27 |Peter is a very good person. |{sentence -> 0}|[] |\n", - "|document |0 |37 |My life in Russia is very interesting. |{sentence -> 0}|[] |\n", - "|document |0 |76 |John and Peter are brothers. However they don't support each other that much.|{sentence -> 0}|[] |\n", - "|document |0 |67 |Lucas Nogal Dunbercker is no longer happy. He has a good car though. |{sentence -> 0}|[] |\n", - "|document |0 |68 |Europe is very culture rich. There are huge churches! and big houses! |{sentence -> 0}|[] |\n", - "+-------------+-----+---+-----------------------------------------------------------------------------+---------------+----------+\n", - "\n" - ] - } - ], - "source": [ - "import pyspark.sql.functions as F\n", - "\n", - "doc_df.withColumn(\n", - " \"tmp\", \n", - " F.explode(\"document\"))\\\n", - " .select(\"tmp.*\")\\\n", - " .show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yYxUBF6vMl3o" - }, - "source": [ - "## **Sentence Detector**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "H8YL-VNMcfQx" - }, - "source": [ - "Finds sentence bounds in raw text. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NeFCrak7chlb" - }, - "source": [ - "| Parametre | Value | Description |\n", - "| - | - | - |\n", - "|**setCustomBounds()*** |String |Custom sentence separator text e.g. `[\"\\n\"]`|\n", - "|**setUseCustomOnly()*** |Bool|Use only custom bounds without considering those of Pragmatic Segmenter. Defaults to false. Needs customBounds.|\n", - "|**setUseAbbreviations*** |Bool| Whether to consider abbreviation strategies for better accuracy but slower performance. Defaults to true.|\n", - "|**setExplodeSentences*** |Bool|Whether to split sentences into different Dataset rows. Useful for higher parallelism in fat rows. Defaults to false.|\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hfMNS_fXb3mx" - }, - "outputs": [], - "source": [ - "from sparknlp.annotator import *\n", - "\n", - "# we feed the document column coming from Document Assembler\n", - "\n", - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentences')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 11, - "status": "ok", - "timestamp": 1664906972136, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "atHSGu_oIK6C", - "outputId": "7195ac7d-557f-45f3-df05-19541626e1ce" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='SentenceDetector_176a5418c081', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='SentenceDetector_176a5418c081', name='useAbbreviations', doc='whether to apply abbreviations at sentence detection'): True,\n", - " Param(parent='SentenceDetector_176a5418c081', name='detectLists', doc='whether detect lists during sentence detection'): True,\n", - " Param(parent='SentenceDetector_176a5418c081', name='useCustomBoundsOnly', doc='Only utilize custom bounds in sentence detection'): False,\n", - " Param(parent='SentenceDetector_176a5418c081', name='customBounds', doc='characters used to explicitly mark sentence bounds'): [],\n", - " Param(parent='SentenceDetector_176a5418c081', name='customBoundsStrategy', doc='How to return matched custom bounds'): 'none',\n", - " Param(parent='SentenceDetector_176a5418c081', name='explodeSentences', doc='whether to explode each sentence into a different row, for better parallelization. Defaults to false.'): False,\n", - " Param(parent='SentenceDetector_176a5418c081', name='minLength', doc='Set the minimum allowed length for each sentence.'): 0,\n", - " Param(parent='SentenceDetector_176a5418c081', name='maxLength', doc='Set the maximum allowed length for each sentence'): 99999,\n", - " Param(parent='SentenceDetector_176a5418c081', name='inputCols', doc='previous annotations columns, if renamed'): ['document'],\n", - " Param(parent='SentenceDetector_176a5418c081', name='outputCol', doc='output annotation column. can be left default.'): 'sentences'}" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sentenceDetector.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 6, - "status": "ok", - "timestamp": 1664906974637, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "uNGl8ndXnTKA", - "outputId": "e932e19b-b368-429f-d502-54e0a14f5ac6" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+\n", - "|text |document |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+\n", - "|Peter is a very good person. |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |\n", - "|My life in Russia is very interesting. |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |\n", - "|John and Peter are brothers. However they don't support each other that much.|[{document, 0, 76, John and Peter are brothers. However they don't support each other that much., {sentence -> 0}, []}]|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |[{document, 0, 67, Lucas Nogal Dunbercker is no longer happy. He has a good car though., {sentence -> 0}, []}] |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |[{document, 0, 68, Europe is very culture rich. There are huge churches! and big houses!, {sentence -> 0}, []}] |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "doc_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 372, - "status": "ok", - "timestamp": 1664906976633, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "yMcJmxwyeii3", - "outputId": "2ab9888a-d14c-4238-a2c9-b33a4da961cb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|text |document |sentences |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|Peter is a very good person. |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |\n", - "|My life in Russia is very interesting. |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |\n", - "|John and Peter are brothers. However they don't support each other that much.|[{document, 0, 76, John and Peter are brothers. However they don't support each other that much., {sentence -> 0}, []}]|[{document, 0, 27, John and Peter are brothers., {sentence -> 0}, []}, {document, 29, 76, However they don't support each other that much., {sentence -> 1}, []}] |\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |[{document, 0, 67, Lucas Nogal Dunbercker is no longer happy. He has a good car though., {sentence -> 0}, []}] |[{document, 0, 41, Lucas Nogal Dunbercker is no longer happy., {sentence -> 0}, []}, {document, 43, 67, He has a good car though., {sentence -> 1}, []}] |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |[{document, 0, 68, Europe is very culture rich. There are huge churches! and big houses!, {sentence -> 0}, []}] |[{document, 0, 27, Europe is very culture rich., {sentence -> 0}, []}, {document, 29, 52, There are huge churches!, {sentence -> 1}, []}, {document, 54, 68, and big houses!, {sentence -> 2}, []}]|\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sent_df = sentenceDetector.transform(doc_df)\n", - "\n", - "sent_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 375, - "status": "ok", - "timestamp": 1664906980245, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "uJrXlWNWfSs2", - "outputId": "0bad6f54-787e-4da5-e0be-b9d97ae881d2" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(sentences=[Row(annotatorType='document', begin=0, end=27, result='Peter is a very good person.', metadata={'sentence': '0'}, embeddings=[])]),\n", - " Row(sentences=[Row(annotatorType='document', begin=0, end=37, result='My life in Russia is very interesting.', metadata={'sentence': '0'}, embeddings=[])]),\n", - " Row(sentences=[Row(annotatorType='document', begin=0, end=27, result='John and Peter are brothers.', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='document', begin=29, end=76, result=\"However they don't support each other that much.\", metadata={'sentence': '1'}, embeddings=[])])]" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sent_df.select('sentences').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5, - "status": "ok", - "timestamp": 1664906981870, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "9lfy4uxkntFx", - "outputId": "1f628bb9-c0a3-4e83-ab86-f5a98d9b7731" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter is a very good person.']),\n", - " Row(result=['My life in Russia is very interesting.']),\n", - " Row(result=['John and Peter are brothers.', \"However they don't support each other that much.\"])]" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sent_df.select('sentences.result').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 344, - "status": "ok", - "timestamp": 1664906983898, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "2rXfNjSKpcxs", - "outputId": "21fbda5a-19c6-4b07-fb29-e884fbf89606" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(metadata=[{'sentence': '0'}]),\n", - " Row(metadata=[{'sentence': '0'}]),\n", - " Row(metadata=[{'sentence': '0'}, {'sentence': '1'}])]" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sent_df.select('sentences.metadata').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 53 - }, - "executionInfo": { - "elapsed": 327, - "status": "ok", - "timestamp": 1664906985855, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Ly4eTKK0Th60", - "outputId": "69203295-ffb4-44c0-97c4-2956c1adf3c2" - }, - "outputs": [ - { - "data": { - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "string" - }, - "text/plain": [ - "'The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months.'" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text ='The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months.'\n", - "text\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 352, - "status": "ok", - "timestamp": 1664906987812, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "q3wTJSVnULpX", - "outputId": "acb5d5d6-3d2d-4ff1-ef3d-cbdd98b6c914" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|text |\n", - "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months.|\n", - "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df = spark.createDataFrame([[text]]).toDF(\"text\")\n", - "\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 327, - "status": "ok", - "timestamp": 1664906990629, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "bkF2C2_GVK2S", - "outputId": "6bcb8ec7-3434-455c-c6f1-685a434cc068" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------+\n", - "| text|\n", - "+--------------------------------------------------+\n", - "|The patient was prescribed 1 capsule of Advil f...|\n", - "+--------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df.show(truncate=50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 443, - "status": "ok", - "timestamp": 1664906992370, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "UASqrLkWV2sJ", - "outputId": "f6122be9-799a-45f7-9a87-bc3cf63ba1ff" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+\n", - "| text| document| sentences|\n", - "+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+\n", - "|The patient was prescribed 1 capsule of Advil f...|[{document, 0, 334, The patient was prescribed ...|[{document, 0, 56, The patient was prescribed 1...|\n", - "+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "doc_df = documentAssembler.transform(spark_df)\n", - "\n", - "sent_df = sentenceDetector.transform(doc_df)\n", - "\n", - "sent_df.show(truncate=50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5, - "status": "ok", - "timestamp": 1664906993791, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "rYSQKC4lWf_m", - "outputId": "5c8506c1-0e86-4c98-f146-d93f1f4d909b" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['The patient was prescribed 1 capsule of Advil for 5 days.', 'He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.', 'It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months.'])]" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sent_df.select('sentences.result').take(1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 310, - "status": "ok", - "timestamp": 1664906996354, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ZBjcZbj8WwxC", - "outputId": "789880fc-2f8c-474c-820a-adb752c62814" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "SentenceDetector_176a5418c081" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# setExplodeSentences: Whether to split sentences into different Dataset rows. Useful for higher parallelism in fat rows. Defaults to false.\n", - "\n", - "sentenceDetector.setExplodeSentences(True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 915, - "status": "ok", - "timestamp": 1664906999664, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "hJ6b7ZEXW8UY", - "outputId": "fb0fd3d6-96fd-40c4-cd8e-1f5e23e02a46" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+\n", - "| text| document| sentences|\n", - "+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+\n", - "|The patient was prescribed 1 capsule of Advil f...|[{document, 0, 334, The patient was prescribed ...|[{document, 0, 56, The patient was prescribed 1...|\n", - "|The patient was prescribed 1 capsule of Advil f...|[{document, 0, 334, The patient was prescribed ...|[{document, 58, 240, He was seen by the endocri...|\n", - "|The patient was prescribed 1 capsule of Advil f...|[{document, 0, 334, The patient was prescribed ...|[{document, 242, 334, It was determined that al...|\n", - "+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sent_df = sentenceDetector.transform(doc_df)\n", - "\n", - "sent_df.show(truncate=50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 785, - "status": "ok", - "timestamp": 1664907003094, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "9UEVo1RTs3ok", - "outputId": "24f69afe-7b1a-4855-8011-3abf06fcdc86" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|result |\n", - "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|[The patient was prescribed 1 capsule of Advil for 5 days.] |\n", - "|[He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.]|\n", - "|[It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months.] |\n", - "+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sent_df.select('sentences.result').show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 770, - "status": "ok", - "timestamp": 1664907005439, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "AYvdI1mkysgJ", - "outputId": "7aa84046-c6de-4f38-b13b-92a28feb3936" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|col |\n", - "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|The patient was prescribed 1 capsule of Advil for 5 days. |\n", - "|He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.|\n", - "|It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months. |\n", - "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql import functions as F\n", - "\n", - "sent_df.select(F.explode('sentences.result')).show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vpSNJ1Z4tzbE" - }, - "source": [ - "**`.setCustomBounds([r\"\\\\.\", \";\"])`**\n", - "\n", - "**`.setCustomBoundsStrategy(\"append\")`**\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "A2zB7ICWrZUd" - }, - "outputs": [], - "source": [ - "text = [\n", - " [\"Peter is a very good person.\"],\n", - " [\"My life in Russia is very interesting.\"], \n", - " [\"John and Peter are brothers. However; they don't support each other that much.\"],\n", - " [\"Lucas Nogal Dunbercker is no longer happy. He has a good car though.\"],\n", - " [\"Europe is very culture rich. There are huge churches! and big houses!\"]\n", - " ]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 674, - "status": "ok", - "timestamp": 1664907010630, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "kDeWXEcvr1rz", - "outputId": "046f7314-b4fc-4904-f800-b4c1d8367152" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------------------------------------------------------------------+\n", - "|text |\n", - "+------------------------------------------------------------------------------+\n", - "|Peter is a very good person. |\n", - "|My life in Russia is very interesting. |\n", - "|John and Peter are brothers. However; they don't support each other that much.|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |\n", - "+------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df = spark.createDataFrame(text).toDF(\"text\")\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "EaeWEgBUsc9I" - }, - "outputs": [], - "source": [ - "doc_df = documentAssembler.transform(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 451, - "status": "ok", - "timestamp": 1664907013585, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "awALJxI2rNXa", - "outputId": "d4a016bb-c77b-47fd-c8f6-f4ddcc60c38e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|text |document |sentences |\n", - "+------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|Peter is a very good person. |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |\n", - "|My life in Russia is very interesting. |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |\n", - "|John and Peter are brothers. However; they don't support each other that much.|[{document, 0, 77, John and Peter are brothers. However; they don't support each other that much., {sentence -> 0}, []}]|[{document, 0, 27, John and Peter are brothers., {sentence -> 0}, []}, {document, 29, 36, However;, {sentence -> 1}, []}, {document, 38, 77, they don't support each other that much., {sentence -> 2}, []}]|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |[{document, 0, 67, Lucas Nogal Dunbercker is no longer happy. He has a good car though., {sentence -> 0}, []}] |[{document, 0, 41, Lucas Nogal Dunbercker is no longer happy., {sentence -> 0}, []}, {document, 43, 67, He has a good car though., {sentence -> 1}, []}] |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |[{document, 0, 68, Europe is very culture rich. There are huge churches! and big houses!, {sentence -> 0}, []}] |[{document, 0, 27, Europe is very culture rich., {sentence -> 0}, []}, {document, 29, 52, There are huge churches!, {sentence -> 1}, []}, {document, 54, 68, and big houses!, {sentence -> 2}, []}] |\n", - "+------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentences')\\\n", - " .setCustomBounds([r\"\\.\", \";\", \"!\"])\\\n", - " .setCustomBoundsStrategy(\"append\")\n", - " \n", - "sent_df = sentenceDetector.transform(doc_df)\n", - "sent_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 737, - "status": "ok", - "timestamp": 1664907017689, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "-RHMB4glq5pS", - "outputId": "c67832b4-09d2-4c0f-8e8b-10bf9340f0d2" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter is a very good person.']),\n", - " Row(result=['My life in Russia is very interesting.']),\n", - " Row(result=['John and Peter are brothers.', 'However;', \"they don't support each other that much.\"]),\n", - " Row(result=['Lucas Nogal Dunbercker is no longer happy.', 'He has a good car though.']),\n", - " Row(result=['Europe is very culture rich.', 'There are huge churches!', 'and big houses!'])]" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sent_df.select('sentences.result').take(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bg9a63CTv51L" - }, - "source": [ - "**`.setCustomBoundsStrategy(\"prepend\")`**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 472, - "status": "ok", - "timestamp": 1664907019385, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "YW8bMUNxszaF", - "outputId": "db78ae9d-9594-49d0-aa4b-45bbedad9e20" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "textn", - "|text |document |sentences |\nn", - "|Peter is a very good person. |[{document, 0, 27, Peter is a very good person., {sentence -> 0}, []}] |[{document, 0, 26, Peter is a very good person, {sentence -> 0}, []}, {document, 27, 27, ., {sentence -> 1}, []}] |\n", - "|My life in Russia is very interesting. |[{document, 0, 37, My life in Russia is very interesting., {sentence -> 0}, []}] |[{document, 0, 36, My life in Russia is very interesting, {sentence -> 0}, []}, {document, 37, 37, ., {sentence -> 1}, []}] |\n", - "|John and Peter are brothers. However; they don't support each other that much.|[{document, 0, 77, John and Peter are brothers. However; they don't support each other that much., {sentence -> 0}, []}]|[{document, 0, 26, John and Peter are brothers, {sentence -> 0}, []}, {document, 27, 27, ., {sentence -> 1}, []}, {document, 29, 35, However, {sentence -> 2}, []}, {document, 36, 36, ;, {sentence -> 3}, []}, {document, 38, 76, they don't support each other that much, {sentence -> 4}, []}, {document, 77, 77, ., {sentence -> 5}, []}]|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |[{document, 0, 67, Lucas Nogal Dunbercker is no longer happy. He has a good car though., {sentence -> 0}, []}] |[{document, 0, 40, Lucas Nogal Dunbercker is no longer happy, {sentence -> 0}, []}, {document, 41, 41, ., {sentence -> 1}, []}, {document, 43, 66, He has a good car though, {sentence -> 2}, []}, {document, 67, 67, ., {sentence -> 3}, []}] |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |[{document, 0, 68, Europe is very culture rich. There are huge churches! and big houses!, {sentence -> 0}, []}] |[{document, 0, 26, Europe is very culture rich, {sentence -> 0}, []}, {document, 27, 27, ., {sentence -> 1}, []}, {document, 29, 51, There are huge churches, {sentence -> 2}, []}, {document, 52, 52, !, {sentence -> 3}, []}, {document, 54, 67, and big houses, {sentence -> 4}, []}, {document, 68, 68, !, {sentence -> 5}, []}] |\n", - "+------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentences')\\\n", - " .setCustomBounds([r\"\\.\", \";\", \"!\"])\\\n", - " .setCustomBoundsStrategy(\"prepend\")\n", - "\n", - "sent_df = sentenceDetector.transform(doc_df)\n", - "sent_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 315, - "status": "ok", - "timestamp": 1664907021159, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "IT4E13iooMXA", - "outputId": "85fdccd2-94b4-4b6a-f922-8365f1e09fdd" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter is a very good person', '.']),\n", - " Row(result=['My life in Russia is very interesting', '.']),\n", - " Row(result=['John and Peter are brothers', '.', 'However', ';', \"they don't support each other that much\", '.']),\n", - " Row(result=['Lucas Nogal Dunbercker is no longer happy', '.', 'He has a good car though', '.']),\n", - " Row(result=['Europe is very culture rich', '.', 'There are huge churches', '!', 'and big houses', '!'])]" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sent_df.select('sentences.result').take(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sk-r3ZiTxrAM" - }, - "source": [ - "The separation of the sentences is determined according to the characters we set with custom bound. When we use `append`, sentences are differentiated according to the characters, if `prepend` is used, it also determines the characters as separate sentences." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pRiC65UdZ7LC" - }, - "source": [ - "### **Sentence Detector DL**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "YIlgqqgLzZLl" - }, - "outputs": [], - "source": [ - "text ='The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day. It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months.'\n", - "\n", - "spark_df = spark.createDataFrame([[text]]).toDF(\"text\")\n", - "\n", - "doc_df = documentAssembler.transform(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 16870, - "status": "ok", - "timestamp": 1664907042811, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Vsesd8t4aA0W", - "outputId": "61b225ef-6043-4d78-c2e7-f5340b607d57" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence_detector_dl download started this may take some time.\n", - "Approximate size to download 354.6 KB\n", - "[OK!]\n", - "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|col |\n", - "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "|The patient was prescribed 1 capsule of Advil for 5 days. |\n", - "|He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.|\n", - "|It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months. |\n", - "+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sentencerDL = SentenceDetectorDLModel\\\n", - " .pretrained(\"sentence_detector_dl\", \"en\") \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"sentences\")\\\n", - "\n", - "sent_dl_df = sentencerDL.transform(doc_df)\n", - "\n", - "sent_dl_df.select(F.explode('sentences.result')).show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 4186, - "status": "ok", - "timestamp": 1664907046986, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "kKZ6YKJebE_0", - "outputId": "5b8f6f81-7a05-4c5b-db87-41837fe174a4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence_detector_dl download started this may take some time.\n", - "Approximate size to download 354.6 KB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documenter = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentences')\\\n", - " \n", - "sentencerDL = SentenceDetectorDLModel\\\n", - " .pretrained(\"sentence_detector_dl\", \"en\") \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"sentences\")\n", - "\n", - "\n", - "sd_pipeline = PipelineModel(stages=[documenter, sentenceDetector])\n", - "sd_model = LightPipeline(sd_pipeline)\n", - "\n", - "\n", - "# DL version\n", - "sd_dl_pipeline = PipelineModel(stages=[documenter, sentencerDL])\n", - "sd_dl_model = LightPipeline(sd_dl_pipeline)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 28, - "status": "ok", - "timestamp": 1664907046986, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ylVTYULea6B-", - "outputId": "4c408635-70df-4488-8f9d-65f3f27e1dd1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0\t0\t51\tJohn loves Mary.Mary loves Peter\n", - "Peter loves Helen .\n", - "1\t52\t68\tHelen loves John;\n", - "2\t71\t98\tTotal: four people involved.\n" - ] - } - ], - "source": [ - "text = \"\"\"John loves Mary.Mary loves Peter\n", - "Peter loves Helen .Helen loves John; \n", - "Total: four people involved.\"\"\"\n", - "\n", - "# sd_model\n", - "for anno in sd_model.fullAnnotate(text)[0][\"sentences\"]:\n", - " print(\"{}\\t{}\\t{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.begin, anno.end, anno.result))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 437, - "status": "ok", - "timestamp": 1664907047402, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "SlSnXKZ3bmjR", - "outputId": "831e2ce0-923b-4e8d-f38d-9979b8aa6e73" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0\t0\t15\tJohn loves Mary.\n", - "1\t16\t31\tMary loves Peter\n", - "2\t33\t51\tPeter loves Helen .\n", - "3\t52\t68\tHelen loves John;\n", - "4\t71\t98\tTotal: four people involved.\n" - ] - } - ], - "source": [ - "# sd_dl_model\n", - "for anno in sd_dl_model.fullAnnotate(text)[0][\"sentences\"]:\n", - " print(\"{}\\t{}\\t{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.begin, anno.end, anno.result))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ELZyacNqbgeq" - }, - "source": [ - "## Tokenizer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tnkBJxAqbu2b" - }, - "source": [ - "Identifies tokens with tokenization open standards. It is an **Annotator Approach, so it requires .fit()**.\n", - "\n", - "A few rules will help customizing it if defaults do not fit user needs.\n", - "\n", - "setExceptions(StringArray): List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split.\n", - "\n", - "| Parametre | Value | Description |\n", - "| - | - | - |\n", - "|**addException()*** |String |Add a single exception.|\n", - "|**setExceptionsPath()*** |String|Path to txt file with list of token exceptions.|\n", - "|**caseSensitiveExceptions*** |Bool| Whether to follow case sensitiveness for matching exceptions in text.|\n", - "|**contextChars()*** |StringArray|List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns.|\n", - "|**splitChars()*** |StringArray|List of 1 character string to split tokens inside, such as hyphens. Ignored if using infix, prefix or suffix patterns.|\n", - "|**splitPattern()*** |String|pattern to separate from the inside of tokens. takes priority over splitChars. setTargetPattern: Basic regex rule to identify a candidate for tokenization. Defaults to \\S+ which means anything not a space.|\n", - "|**setSuffixPattern()*** ||Regex to identify subtokens that are in the end of the token. Regex has to end with \\z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis|\n", - "|**setPrefixPattern()*** ||Regex to identify subtokens that come in the beginning of the token. Regex has to start with \\A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis.|\n", - "|**addInfixPattern()*** ||Add an extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).|\n", - "|**minLength()*** ||Set the minimum allowed legth for each token.|\n", - "|**maxLength()*** ||Set the maximum allowed legth for each token.|\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "AyfeGRPKNrDF" - }, - "outputs": [], - "source": [ - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 10, - "status": "ok", - "timestamp": 1664907115890, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "XKIVvy07J-8F", - "outputId": "4d849602-6021-4bd7-d3e5-ac5367bfaa83" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='Tokenizer_6d8479c7cd62', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='targetPattern', doc='pattern to grab from text as token candidates. Defaults \\\\S+'): '\\\\S+',\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='contextChars', doc='character list used to separate from token boundaries'): ['.',\n", - " ',',\n", - " ';',\n", - " ':',\n", - " '!',\n", - " '?',\n", - " '*',\n", - " '-',\n", - " '(',\n", - " ')',\n", - " '\"',\n", - " \"'\"],\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='caseSensitiveExceptions', doc='Whether to care for case sensitiveness in exceptions'): True,\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='minLength', doc='Set the minimum allowed length for each token'): 0,\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='maxLength', doc='Set the maximum allowed length for each token'): 99999,\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='inputCols', doc='previous annotations columns, if renamed'): ['document'],\n", - " Param(parent='Tokenizer_6d8479c7cd62', name='outputCol', doc='output annotation column. can be left default.'): 'token'}" - ] - }, - "execution_count": 49, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tokenizer.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "v-GQJmCWcYsL" - }, - "outputs": [], - "source": [ - "text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'\n", - "\n", - "spark_df = spark.createDataFrame([[text]]).toDF(\"text\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 769, - "status": "ok", - "timestamp": 1664907129360, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "iAgf0v4ucraf", - "outputId": "ebfd89b0-6b4b-4931-ab85-4ce6e2457944" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "| text| document| token|\n", - "+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "|Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!|[{document, 0, 78, Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail...|[{token, 0, 4, Peter, {sentence -> 0}, []}, {token, 6, 11, Parker, {sentence -> 0}, []}, {token, ...|\n", - "+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "doc_df = documentAssembler.transform(spark_df)\n", - "\n", - "token_df = tokenizer.fit(doc_df).transform(doc_df)\n", - "\n", - "token_df.show(truncate=100)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 408, - "status": "ok", - "timestamp": 1664907134196, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "sQ8eoN1mdEn-", - "outputId": "9be8ca18-b29c-46b5-92bd-45b5d38bb670" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter', 'Parker', '(', 'Spiderman', ')', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New', 'York', 'but', 'has', 'no', 'e-mail', '!'])]" - ] - }, - "execution_count": 52, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "token_df.select('token.result').take(1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JTlinLhmfKTw" - }, - "outputs": [], - "source": [ - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\") \\\n", - " .setSplitChars(['-']) \\\n", - " .setContextChars(['?', '!']) \\\n", - " .addException(\"New York\") \\" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 431, - "status": "ok", - "timestamp": 1664907145292, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "KY5V7FcXfSrs", - "outputId": "bc00da7c-b9ae-45e4-8bfa-e18358139fa1" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter', 'Parker', '(Spiderman)', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New York', 'but', 'has', 'no', 'e', 'mail', '!'])]" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "token_df = tokenizer.fit(doc_df).transform(doc_df)\n", - "\n", - "token_df.select('token.result').take(1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AOn8d1tcBkK3" - }, - "source": [ - "## Regex Tokenizer" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 800, - "status": "ok", - "timestamp": 1664907175605, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Y4ruucl4BnLJ", - "outputId": "2fc56369-51c5-4aa7-c1db-2126fa4585bb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "textn", - "|text |document |sentence |regexToken |\nn", - "|1. T1-T2 DATE**[12/24/13] $1.99 () (10/12), ph+ 90%|[{document, 0, 50, 1. T1-T2 DATE**[12/24/13] $1.99 () (10/12), ph+ 90%, {sentence -> 0}, []}]|[{document, 0, 50, 1. T1-T2 DATE**[12/24/13] $1.99 () (10/12), ph+ 90%, {sentence -> 0}, []}]|[{token, 0, 0, 1, {sentence -> 0}, []}, {token, 2, 2, ., {sentence -> 0}, []}, {token, 4, 5, T1, {sentence -> 0}, []}, {token, 7, 7, -, {sentence -> 0}, []}, {token, 9, 10, T2, {sentence -> 0}, []}, {token, 12, 15, DATE, {sentence -> 0}, []}, {token, 17, 17, *, {sentence -> 0}, []}, {token, 19, 19, *, {sentence -> 0}, []}, {token, 21, 21, [, {sentence -> 0}, []}, {token, 23, 30, 12/24/13, {sentence -> 0}, []}, {token, 32, 32, ], {sentence -> 0}, []}, {token, 35, 35, $, {sentence -> 0}, []}, {token, 37, 37, 1, {sentence -> 0}, []}, {token, 39, 39, ., {sentence -> 0}, []}, {token, 41, 42, 99, {sentence -> 0}, []}, {token, 44, 45, (), {sentence -> 0}, []}, {token, 47, 53, (10/12), {sentence -> 0}, []}, {token, 55, 55, ,, {sentence -> 0}, []}, {token, 57, 58, ph, {sentence -> 0}, []}, {token, 60, 60, +, {sentence -> 0}, []}, {token, 62, 63, 90, {sentence -> 0}, []}, {token, 65, 65, %, {sentence -> 0}, []}]|\nn", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql.types import StringType\n", - "\n", - "content = \"1. T1-T2 DATE**[12/24/13] $1.99 () (10/12), ph+ 90%\"\n", - "pattern = \"\\\\s+|(?=[-.:;*+,$&%\\\\[\\\\]])|(?<=[-.:;*+,$&%\\\\[\\\\]])\"\n", - "\n", - "df = spark.createDataFrame([content], StringType()).withColumnRenamed(\"value\", \"text\")\n", - "\n", - "documenter = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentence')\n", - "\n", - "regexTokenizer = RegexTokenizer() \\\n", - " .setInputCols([\"sentence\"]) \\\n", - " .setOutputCol(\"regexToken\") \\\n", - " .setPattern(pattern) \\\n", - " .setPositionalMask(False)\n", - "\n", - "docPatternRemoverPipeline = Pipeline().setStages([documenter,\n", - " sentenceDetector,\n", - " regexTokenizer])\n", - "\n", - "result = docPatternRemoverPipeline.fit(df).transform(df)\n", - "result.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 739 - }, - "executionInfo": { - "elapsed": 800, - "status": "ok", - "timestamp": 1664907181497, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "APA8zhR-F1WB", - "outputId": "dd8ef1a3-f06b-4614-d96c-3307893a7b2e" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
regexToken
01
1.
2T1
3-
4T2
5DATE
6*
7*
8[
912/24/13
10]
11$
121
13.
1499
15()
16(10/12)
17,
18ph
19+
2090
21%
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " regexToken\n", - "0 1\n", - "1 .\n", - "2 T1\n", - "3 -\n", - "4 T2\n", - "5 DATE\n", - "6 *\n", - "7 *\n", - "8 [\n", - "9 12/24/13\n", - "10 ]\n", - "11 $\n", - "12 1\n", - "13 .\n", - "14 99\n", - "15 ()\n", - "16 (10/12)\n", - "17 ,\n", - "18 ph\n", - "19 +\n", - "20 90\n", - "21 %" - ] - }, - "execution_count": 57, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pyspark.sql.functions as F\n", - "\n", - "result_df = result.select(F.explode(result.regexToken.result).alias('regexToken')).toPandas()\n", - "result_df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "l_LM44ZzgYhs" - }, - "source": [ - "## Stacking Spark NLP Annotators in Spark ML Pipeline" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bm0mUMQMhFPU" - }, - "source": [ - "Spark NLP provides an easy API to integrate with Spark ML Pipelines and all the Spark NLP annotators and transformers can be used within Spark ML Pipelines. So, it’s better to explain Pipeline concept through Spark ML official documentation.\n", - "\n", - "What is a Pipeline anyway? In machine learning, it is common to run a sequence of algorithms to process and learn from data. \n", - "\n", - "Apache Spark ML represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.\n", - "\n", - "In simple terms, a pipeline chains multiple Transformers and Estimators together to specify an ML workflow. We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow.\n", - "\n", - "The figure below is for the training time usage of a Pipeline." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jK5AAYQqhRlG" - }, - "source": [ - "![image.png]()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dwLlY7i4hhq1" - }, - "source": [ - "A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage. With the help of Pipelines, we can ensure that training and test data go through identical feature processing steps.\n", - "\n", - "Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.\n", - "\n", - "- Split text into sentences\n", - "- Tokenize\n", - "\n", - "And here is how we code this pipeline up in Spark NLP." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_2mZXDVehhDU" - }, - "outputs": [], - "source": [ - "from pyspark.ml import Pipeline\n", - "\n", - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentences')\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"sentences\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " sentenceDetector,\n", - " tokenizer])\n", - "\n", - "spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')\n", - "\n", - "pipelineModel = nlpPipeline.fit(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 374, - "status": "ok", - "timestamp": 1664907213434, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "9Rq_CRWN6Zge", - "outputId": "22f861c9-ea47-4f60-9817-acb0e9c8cd53" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+\n", - "|text |\n", - "+-----------------------------------------------------------------------------+\n", - "|Peter is a very good person. |\n", - "|My life in Russia is very interesting. |\n", - "|John and Peter are brothers. However they don't support each other that much.|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |\n", - "+-----------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')\n", - "\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JuhTX4-Vk-cd" - }, - "outputs": [], - "source": [ - "result = pipelineModel.transform(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 448, - "status": "ok", - "timestamp": 1664907217901, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "iaWf94QPlT51", - "outputId": "7a44f67b-af29-48c5-f830-203b51459e6e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| text| document| sentences| token|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| Peter is a very good person.|[{document, 0, 27, Peter is a very go...|[{document, 0, 27, Peter is a very go...|[{token, 0, 4, Peter, {sentence -> 0}...|\n", - "| My life in Russia is very interesting.|[{document, 0, 37, My life in Russia ...|[{document, 0, 37, My life in Russia ...|[{token, 0, 1, My, {sentence -> 0}, [...|\n", - "|John and Peter are brothers. However ...|[{document, 0, 76, John and Peter are...|[{document, 0, 27, John and Peter are...|[{token, 0, 3, John, {sentence -> 0},...|\n", - "|Lucas Nogal Dunbercker is no longer h...|[{document, 0, 67, Lucas Nogal Dunber...|[{document, 0, 41, Lucas Nogal Dunber...|[{token, 0, 4, Lucas, {sentence -> 0}...|\n", - "|Europe is very culture rich. There ar...|[{document, 0, 68, Europe is very cul...|[{document, 0, 27, Europe is very cul...|[{token, 0, 5, Europe, {sentence -> 0...|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result.show(truncate=40)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 4, - "status": "ok", - "timestamp": 1664907219318, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "zfz0_-eFlXzk", - "outputId": "13fe22c1-11a7-451b-dfc3-2bc50a653bd8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "root\n", - " |-- text: string (nullable = true)\n", - " |-- document: array (nullable = true)\n", - " | |-- element: struct (containsNull = true)\n", - " | | |-- annotatorType: string (nullable = true)\n", - " | | |-- begin: integer (nullable = false)\n", - " | | |-- end: integer (nullable = false)\n", - " | | |-- result: string (nullable = true)\n", - " | | |-- metadata: map (nullable = true)\n", - " | | | |-- key: string\n", - " | | | |-- value: string (valueContainsNull = true)\n", - " | | |-- embeddings: array (nullable = true)\n", - " | | | |-- element: float (containsNull = false)\n", - " |-- sentences: array (nullable = true)\n", - " | |-- element: struct (containsNull = true)\n", - " | | |-- annotatorType: string (nullable = true)\n", - " | | |-- begin: integer (nullable = false)\n", - " | | |-- end: integer (nullable = false)\n", - " | | |-- result: string (nullable = true)\n", - " | | |-- metadata: map (nullable = true)\n", - " | | | |-- key: string\n", - " | | | |-- value: string (valueContainsNull = true)\n", - " | | |-- embeddings: array (nullable = true)\n", - " | | | |-- element: float (containsNull = false)\n", - " |-- token: array (nullable = true)\n", - " | |-- element: struct (containsNull = true)\n", - " | | |-- annotatorType: string (nullable = true)\n", - " | | |-- begin: integer (nullable = false)\n", - " | | |-- end: integer (nullable = false)\n", - " | | |-- result: string (nullable = true)\n", - " | | |-- metadata: map (nullable = true)\n", - " | | | |-- key: string\n", - " | | | |-- value: string (valueContainsNull = true)\n", - " | | |-- embeddings: array (nullable = true)\n", - " | | | |-- element: float (containsNull = false)\n", - "\n" - ] - } - ], - "source": [ - "result.printSchema()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 323, - "status": "ok", - "timestamp": 1664907221583, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "599Y4hQsl_mF", - "outputId": "a64afdac-956c-43f9-a386-f27fc17f0dc9" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter is a very good person.']),\n", - " Row(result=['My life in Russia is very interesting.']),\n", - " Row(result=['John and Peter are brothers.', \"However they don't support each other that much.\"])]" - ] - }, - "execution_count": 63, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('sentences.result').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 8, - "status": "ok", - "timestamp": 1664907223220, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ehzhHXu6luaF", - "outputId": "b79b06aa-3f1a-47c4-d688-f1c78052f2c5" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Row(token=[Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=7, result='and', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=13, result='Peter', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=17, result='are', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=26, result='brothers', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=27, end=27, result='.', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=29, end=35, result='However', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=37, end=40, result='they', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=42, end=46, result=\"don't\", metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=48, end=54, result='support', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=56, end=59, result='each', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=61, end=65, result='other', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=67, end=70, result='that', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=72, end=75, result='much', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=76, end=76, result='.', metadata={'sentence': '1'}, embeddings=[])])" - ] - }, - "execution_count": 64, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('token').take(3)[2]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "42dSp9dGmtmr" - }, - "source": [ - "## Normalizer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "spOjcducnAsR" - }, - "source": [ - "Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary\n", - "\n", - "`setCleanupPatterns(patterns)`: Regular expressions list for normalization, defaults [^A-Za-z]\n", - "\n", - "`setLowercase(value)`: lowercase tokens, default false\n", - "\n", - "`setSlangDictionary(path)`: txt file with delimited words to be transformed into something else\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 35 - }, - "executionInfo": { - "elapsed": 523, - "status": "ok", - "timestamp": 1664907226445, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "h6XKX2l7_Jqk", - "outputId": "49463712-e54d-4651-ca8e-5686901f2c7c" - }, - "outputs": [ - { - "data": { - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "string" - }, - "text/plain": [ - "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" - ] - }, - "execution_count": 65, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import string\n", - "string.punctuation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6hq2ZBWl_WMu" - }, - "outputs": [], - "source": [ - "from sparknlp.base import *\n", - "from sparknlp.annotator import *\n", - "\n", - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - " \n", - "normalizer = Normalizer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"normalized\")\\\n", - " .setLowercase(True)\\\n", - " .setCleanupPatterns([\"[^\\w\\d\\s]\"]) # remove punctuations (keep alphanumeric chars)\n", - " # if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " normalizer])\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 358, - "status": "ok", - "timestamp": 1664907247428, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "25YyJJYXppji", - "outputId": "6de6f1d5-668e-44d2-ea63-0245d72f5902" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[DocumentAssembler_d4926309c3ee,\n", - " REGEX_TOKENIZER_a4789e51e51c,\n", - " NORMALIZER_81bbdb7b0bdb]" - ] - }, - "execution_count": 68, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "nlpPipeline.fit(spark_df).stages" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 795, - "status": "ok", - "timestamp": 1664907249847, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "oUp4au5eoYrw", - "outputId": "fc0588b5-3e3f-4895-f9ab-2800eaf253ce" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| text| document| token| normalized|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| Peter is a very good person.|[{document, 0, 27, Peter is a very go...|[{token, 0, 4, Peter, {sentence -> 0}...|[{token, 0, 4, peter, {sentence -> 0}...|\n", - "| My life in Russia is very interesting.|[{document, 0, 37, My life in Russia ...|[{token, 0, 1, My, {sentence -> 0}, [...|[{token, 0, 1, my, {sentence -> 0}, [...|\n", - "|John and Peter are brothers. However ...|[{document, 0, 76, John and Peter are...|[{token, 0, 3, John, {sentence -> 0},...|[{token, 0, 3, john, {sentence -> 0},...|\n", - "|Lucas Nogal Dunbercker is no longer h...|[{document, 0, 67, Lucas Nogal Dunber...|[{token, 0, 4, Lucas, {sentence -> 0}...|[{token, 0, 4, lucas, {sentence -> 0}...|\n", - "|Europe is very culture rich. There ar...|[{document, 0, 68, Europe is very cul...|[{token, 0, 5, Europe, {sentence -> 0...|[{token, 0, 5, europe, {sentence -> 0...|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result.show(truncate=40)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 7, - "status": "ok", - "timestamp": 1664907250848, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "zxS0MEoM02wl", - "outputId": "c24602d7-feda-425a-c138-fee038f592cf" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(token=[Row(annotatorType='token', begin=0, end=4, result='Peter', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=7, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=9, result='a', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=14, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=19, result='good', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=26, result='person', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=27, end=27, result='.', metadata={'sentence': '0'}, embeddings=[])]),\n", - " Row(token=[Row(annotatorType='token', begin=0, end=1, result='My', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=6, result='life', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=9, result='in', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=16, result='Russia', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=19, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=24, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=26, end=36, result='interesting', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=37, end=37, result='.', metadata={'sentence': '0'}, embeddings=[])])]" - ] - }, - "execution_count": 70, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('token').take(2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 542, - "status": "ok", - "timestamp": 1664907252725, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "xYQcnFVloa8R", - "outputId": "ff7780a5-bfc7-4523-8820-b6577918bb5f" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['peter', 'is', 'a', 'very', 'good', 'person']),\n", - " Row(result=['my', 'life', 'in', 'russia', 'is', 'very', 'interesting'])]" - ] - }, - "execution_count": 71, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('normalized.result').take(2)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 9, - "status": "ok", - "timestamp": 1664907253166, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "dy6TLD9c1LTg", - "outputId": "ad940727-807d-42c5-e9ca-a415d419e080" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(normalized=[Row(annotatorType='token', begin=0, end=4, result='peter', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=7, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=9, result='a', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=14, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=19, result='good', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=26, result='person', metadata={'sentence': '0'}, embeddings=[])]),\n", - " Row(normalized=[Row(annotatorType='token', begin=0, end=1, result='my', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=6, result='life', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=9, result='in', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=16, result='russia', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=19, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=24, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=26, end=36, result='interesting', metadata={'sentence': '0'}, embeddings=[])])]" - ] - }, - "execution_count": 72, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('normalized').take(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "__iJ4EMeVb3n" - }, - "source": [ - "## Document Normalizer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wfLIupJFZi3c" - }, - "source": [ - "The DocumentNormalizer is an annotator that can be used after the DocumentAssembler to narmalize documents once that they have been processed and indexed .\n", - "It takes in input annotated documents of type Array AnnotatorType.DOCUMENT and gives as output annotated document of type AnnotatorType.DOCUMENT .\n", - "\n", - "Parameters are: \n", - "\n", - "| Parametre | Description |\n", - "| - | - |\n", - "|**inputCol** |input column name string which targets a column of type Array(AnnotatorType.DOCUMENT).|\n", - "|**outputCol** |output column name string which targets a column of type AnnotatorType.DOCUMENT.|\n", - "|**action** |action string to perform applying regex patterns, i.e. (clean | extract). Default is \"clean\".|\n", - "|**cleanupPatterns** |normalization regex patterns which match will be removed from document. Default is \"<[^>]*>\" (e.g., it removes all HTML tags).|\n", - "|**replacement** |replacement string to apply when regexes match. Default is \" \".|\n", - "|**lowercase** |whether to convert strings to lowercase. Default is False.|\n", - "|**removalPolicy** |removalPolicy to remove patterns from text with a given policy. Valid policy values are: \"all\", \"pretty_all\", \"first\", \"pretty_first\". Defaults is \"pretty_all\". |\n", - "|**encoding** |file encoding to apply on normalized documents. Supported encodings are: UTF_8, UTF_16, US_ASCII, ISO-8859-1, UTF-16BE, UTF-16LE. Default is \"UTF-8\".|\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "8Tj1c6UYhSzK" - }, - "outputs": [], - "source": [ - "text = '''\n", - "
\n", - " THE WORLD'S LARGEST WEB DEVELOPER SITE\n", - "

THE WORLD'S LARGEST WEB DEVELOPER SITE

\n", - "

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..

\n", - "
\n", - "\n", - "'''" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 407, - "status": "ok", - "timestamp": 1664907258361, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "vB6CgjLlhi63", - "outputId": "2de6c8cf-9575-43f1-ef9f-b7108b254796" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "textn", - "|text |\nn", - "|\\n
\\n THE WORLD'S LARGEST WEB DEVELOPER SITE\\n

THE WORLD'S LARGEST WEB DEVELOPER SITE

\\n

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..

\\n
\\n\\n|\nn", - "\n" - ] - } - ], - "source": [ - "spark_df = spark.createDataFrame([[text]]).toDF(\"text\")\n", - "\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 10, - "status": "ok", - "timestamp": 1664907261003, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "y0nOIC0GKEqX", - "outputId": "56dad3ba-7104-49c6-ac13-f1b807497d11" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='DocumentNormalizer_99e443833d62', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='action', doc='action to perform applying regex patterns on text'): 'clean',\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='patterns', doc='normalization regex patterns which match will be removed from document. Defaults is <[^>]*>'): ['<[^>]*>'],\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='replacement', doc='replacement string to apply when regexes match'): ' ',\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='lowercase', doc='whether to convert strings to lowercase'): False,\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='policy', doc='policy to remove pattern from text'): 'pretty_all',\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='encoding', doc='file encoding to apply on normalized documents'): 'UTF-8',\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='inputCols', doc='previous annotations columns, if renamed'): ['document'],\n", - " Param(parent='DocumentNormalizer_99e443833d62', name='outputCol', doc='output annotation column. can be left default.'): 'normalizedDocument'}" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "documentNormalizer = DocumentNormalizer() \\\n", - " .setInputCols(\"document\") \\\n", - " .setOutputCol(\"normalizedDocument\")\n", - "\n", - "documentNormalizer.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "UA3tF3AOVZyY" - }, - "outputs": [], - "source": [ - "documentAssembler = DocumentAssembler() \\\n", - " .setInputCol('text') \\\n", - " .setOutputCol('document')\n", - "\n", - "#default\n", - "cleanUpPatterns = [\"<[^>]*>\"]\n", - "\n", - "documentNormalizer = DocumentNormalizer() \\\n", - " .setInputCols(\"document\") \\\n", - " .setOutputCol(\"normalizedDocument\") \\\n", - " .setAction(\"clean\") \\\n", - " .setPatterns(cleanUpPatterns) \\\n", - " .setReplacement(\" \") \\\n", - " .setPolicy(\"pretty_all\") \\\n", - " .setLowercase(True)\n", - "\n", - "docPatternRemoverPipeline = Pipeline() \\\n", - " .setStages([documentAssembler,\n", - " documentNormalizer])\n", - " \n", - "pipelineModel = docPatternRemoverPipeline.fit(spark_df).transform(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 13, - "status": "ok", - "timestamp": 1664907283933, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "BMxU26nTgiEU", - "outputId": "b843aad4-596a-4cc5-a34f-bc548bc047d4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "textn", - "|result |\nn", - "|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|\nn", - "\n" - ] - } - ], - "source": [ - "pipelineModel.select('normalizedDocument.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_Kxd4cCegROX" - }, - "source": [ - " for more examples : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/document-normalizer/document_normalizer_notebook.ipynb" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "g4jmmGZRwdwo" - }, - "source": [ - "## Stopwords Cleaner" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7fgcop89yIT-" - }, - "source": [ - "This annotator excludes from a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.\n", - "\n", - "**Functions**:\n", - "\n", - "| Parametre | Description |\n", - "| - | - |\n", - "|**setStopWords** |The words to be filtered out. Array[String]|\n", - "|**setCaseSensitive** |Whether to do a case sensitive comparison over the stop words.|\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "E7pS3jUcoedG" - }, - "outputs": [], - "source": [ - "stopwords_cleaner = StopWordsCleaner()\\\n", - " .setInputCols(\"token\")\\\n", - " .setOutputCol(\"cleanTokens\")\\\n", - " .setCaseSensitive(False)\\\n", - " #.setStopWords([\"no\", \"without\"]) (e.g. read a list of words from a txt)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 20, - "status": "ok", - "timestamp": 1664907288522, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "2mGt9RoD1ezP", - "outputId": "26b8d459-a014-45bf-f895-bdc1f9d60249" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['i',\n", - " 'me',\n", - " 'my',\n", - " 'myself',\n", - " 'we',\n", - " 'our',\n", - " 'ours',\n", - " 'ourselves',\n", - " 'you',\n", - " 'your',\n", - " 'yours',\n", - " 'yourself',\n", - " 'yourselves',\n", - " 'he',\n", - " 'him',\n", - " 'his',\n", - " 'himself',\n", - " 'she',\n", - " 'her',\n", - " 'hers',\n", - " 'herself',\n", - " 'it',\n", - " 'its',\n", - " 'itself',\n", - " 'they',\n", - " 'them',\n", - " 'their',\n", - " 'theirs',\n", - " 'themselves',\n", - " 'what',\n", - " 'which',\n", - " 'who',\n", - " 'whom',\n", - " 'this',\n", - " 'that',\n", - " 'these',\n", - " 'those',\n", - " 'am',\n", - " 'is',\n", - " 'are',\n", - " 'was',\n", - " 'were',\n", - " 'be',\n", - " 'been',\n", - " 'being',\n", - " 'have',\n", - " 'has',\n", - " 'had',\n", - " 'having',\n", - " 'do',\n", - " 'does',\n", - " 'did',\n", - " 'doing',\n", - " 'a',\n", - " 'an',\n", - " 'the',\n", - " 'and',\n", - " 'but',\n", - " 'if',\n", - " 'or',\n", - " 'because',\n", - " 'as',\n", - " 'until',\n", - " 'while',\n", - " 'of',\n", - " 'at',\n", - " 'by',\n", - " 'for',\n", - " 'with',\n", - " 'about',\n", - " 'against',\n", - " 'between',\n", - " 'into',\n", - " 'through',\n", - " 'during',\n", - " 'before',\n", - " 'after',\n", - " 'above',\n", - " 'below',\n", - " 'to',\n", - " 'from',\n", - " 'up',\n", - " 'down',\n", - " 'in',\n", - " 'out',\n", - " 'on',\n", - " 'off',\n", - " 'over',\n", - " 'under',\n", - " 'again',\n", - " 'further',\n", - " 'then',\n", - " 'once',\n", - " 'here',\n", - " 'there',\n", - " 'when',\n", - " 'where',\n", - " 'why',\n", - " 'how',\n", - " 'all',\n", - " 'any',\n", - " 'both',\n", - " 'each',\n", - " 'few',\n", - " 'more',\n", - " 'most',\n", - " 'other',\n", - " 'some',\n", - " 'such',\n", - " 'no',\n", - " 'nor',\n", - " 'not',\n", - " 'only',\n", - " 'own',\n", - " 'same',\n", - " 'so',\n", - " 'than',\n", - " 'too',\n", - " 'very',\n", - " 's',\n", - " 't',\n", - " 'can',\n", - " 'will',\n", - " 'just',\n", - " 'don',\n", - " 'should',\n", - " 'now',\n", - " \"i'll\",\n", - " \"you'll\",\n", - " \"he'll\",\n", - " \"she'll\",\n", - " \"we'll\",\n", - " \"they'll\",\n", - " \"i'd\",\n", - " \"you'd\",\n", - " \"he'd\",\n", - " \"she'd\",\n", - " \"we'd\",\n", - " \"they'd\",\n", - " \"i'm\",\n", - " \"you're\",\n", - " \"he's\",\n", - " \"she's\",\n", - " \"it's\",\n", - " \"we're\",\n", - " \"they're\",\n", - " \"i've\",\n", - " \"we've\",\n", - " \"you've\",\n", - " \"they've\",\n", - " \"isn't\",\n", - " \"aren't\",\n", - " \"wasn't\",\n", - " \"weren't\",\n", - " \"haven't\",\n", - " \"hasn't\",\n", - " \"hadn't\",\n", - " \"don't\",\n", - " \"doesn't\",\n", - " \"didn't\",\n", - " \"won't\",\n", - " \"wouldn't\",\n", - " \"shan't\",\n", - " \"shouldn't\",\n", - " \"mustn't\",\n", - " \"can't\",\n", - " \"couldn't\",\n", - " 'cannot',\n", - " 'could',\n", - " \"here's\",\n", - " \"how's\",\n", - " \"let's\",\n", - " 'ought',\n", - " \"that's\",\n", - " \"there's\",\n", - " \"what's\",\n", - " \"when's\",\n", - " \"where's\",\n", - " \"who's\",\n", - " \"why's\",\n", - " 'would']" - ] - }, - "execution_count": 79, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "stopwords_cleaner.getStopWords()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 691, - "status": "ok", - "timestamp": 1664907306395, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "i7-YZcLRv8Y7", - "outputId": "8e3d241a-f173-4303-c78c-2bbbf93816f1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| text| document| token| cleanTokens|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| Peter is a very good person.|[{document, 0, 27, Peter is a very go...|[{token, 0, 4, Peter, {sentence -> 0}...|[{token, 0, 4, Peter, {sentence -> 0}...|\n", - "| My life in Russia is very interesting.|[{document, 0, 37, My life in Russia ...|[{token, 0, 1, My, {sentence -> 0}, [...|[{token, 3, 6, life, {sentence -> 0},...|\n", - "|John and Peter are brothers. However ...|[{document, 0, 76, John and Peter are...|[{token, 0, 3, John, {sentence -> 0},...|[{token, 0, 3, John, {sentence -> 0},...|\n", - "|Lucas Nogal Dunbercker is no longer h...|[{document, 0, 67, Lucas Nogal Dunber...|[{token, 0, 4, Lucas, {sentence -> 0}...|[{token, 0, 4, Lucas, {sentence -> 0}...|\n", - "|Europe is very culture rich. There ar...|[{document, 0, 68, Europe is very cul...|[{token, 0, 5, Europe, {sentence -> 0...|[{token, 0, 5, Europe, {sentence -> 0...|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " stopwords_cleaner])\n", - " \n", - "spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)\n", - "\n", - "result.show(truncate=40)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 353, - "status": "ok", - "timestamp": 1664907309735, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "lp_oGHvt1waL", - "outputId": "3fc17681-9c6f-40ef-ed7c-4dad19438263" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['Peter', 'good', 'person', '.'])]" - ] - }, - "execution_count": 81, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('cleanTokens.result').take(1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zN9HpW_Lix7_" - }, - "source": [ - "## Token Assembler" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1332, - "status": "ok", - "timestamp": 1664907330120, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "V1wMkz9cix7_", - "outputId": "73c0099d-7d2d-4b83-f91c-470d39120c34" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "| text| document| sentences| token| normalized| cleanTokens| clean_text|\n", - "+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|Peter is a very g...|[{document, 0, 27...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{token, 0, 4, Pe...|[{token, 0, 4, Pe...|[{document, 0, 16...|\n", - "|My life in Russia...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 1, My...|[{token, 0, 1, My...|[{token, 3, 6, li...|[{document, 0, 22...|\n", - "|John and Peter ar...|[{document, 0, 76...|[{document, 0, 27...|[{token, 0, 3, Jo...|[{token, 0, 3, Jo...|[{token, 0, 3, Jo...|[{document, 0, 18...|\n", - "|Lucas Nogal Dunbe...|[{document, 0, 67...|[{document, 0, 41...|[{token, 0, 4, Lu...|[{token, 0, 4, Lu...|[{token, 0, 4, Lu...|[{document, 0, 34...|\n", - "|Europe is very cu...|[{document, 0, 68...|[{document, 0, 27...|[{token, 0, 5, Eu...|[{token, 0, 5, Eu...|[{token, 0, 5, Eu...|[{document, 0, 18...|\n", - "+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "sentenceDetector = SentenceDetector()\\\n", - " .setInputCols(['document'])\\\n", - " .setOutputCol('sentences')\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"sentences\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "normalizer = Normalizer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"normalized\")\\\n", - " .setLowercase(False)\\\n", - "\n", - "stopwords_cleaner = StopWordsCleaner()\\\n", - " .setInputCols(\"normalized\")\\\n", - " .setOutputCol(\"cleanTokens\")\\\n", - " .setCaseSensitive(False)\\\n", - "\n", - "tokenassembler = TokenAssembler()\\\n", - " .setInputCols([\"sentences\", \"cleanTokens\"]) \\\n", - " .setOutputCol(\"clean_text\")\n", - "\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler,\n", - " sentenceDetector,\n", - " tokenizer,\n", - " normalizer,\n", - " stopwords_cleaner,\n", - " tokenassembler])\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)\n", - "result.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 343, - "status": "ok", - "timestamp": 1664907333538, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "eOMngM9y78tp", - "outputId": "0731fa64-683c-48ad-ab5f-a952c9e825cb" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(clean_text=[Row(annotatorType='document', begin=0, end=16, result='Peter good person', metadata={'sentence': '0'}, embeddings=[])])]" - ] - }, - "execution_count": 83, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# if we use TokenAssembler().setPreservePosition(True), the original borders will be preserved (dropped & unwanted chars will be replaced by spaces)\n", - "\n", - "result.select('clean_text').take(1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 445, - "status": "ok", - "timestamp": 1664907335430, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "qEN-vDyQ_AUc", - "outputId": "8b182441-6e9e-4511-ed2c-f23423b4193d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+-----------------------------------+\n", - "|text |clean_text |\n", - "+-----------------------------------------------------------------------------+-----------------------------------+\n", - "|Peter is a very good person. |Peter good person |\n", - "|My life in Russia is very interesting. |life Russia interesting |\n", - "|John and Peter are brothers. However they don't support each other that much.|John Peter brothers |\n", - "|John and Peter are brothers. However they don't support each other that much.|However dont support much |\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |Lucas Nogal Dunbercker longer happy|\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |good car though |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |Europe culture rich |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |huge churches |\n", - "|Europe is very culture rich. There are huge churches! and big houses! |big houses |\n", - "+-----------------------------------------------------------------------------+-----------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('text', F.explode(result.clean_text.result).alias('clean_text')).show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 332 - }, - "executionInfo": { - "elapsed": 563, - "status": "ok", - "timestamp": 1664907337458, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "PMS_uxkXqOFu", - "outputId": "6088583c-c781-4fc9-fb67-8a07da21db7e" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
textclean_text
0Peter is a very good person.Peter good person
1My life in Russia is very interesting.life Russia interesting
2John and Peter are brothers. However they don'...John Peter brothers
3John and Peter are brothers. However they don'...However dont support much
4Lucas Nogal Dunbercker is no longer happy. He ...Lucas Nogal Dunbercker longer happy
5Lucas Nogal Dunbercker is no longer happy. He ...good car though
6Europe is very culture rich. There are huge ch...Europe culture rich
7Europe is very culture rich. There are huge ch...huge churches
8Europe is very culture rich. There are huge ch...big houses
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " text \\\n", - "0 Peter is a very good person. \n", - "1 My life in Russia is very interesting. \n", - "2 John and Peter are brothers. However they don'... \n", - "3 John and Peter are brothers. However they don'... \n", - "4 Lucas Nogal Dunbercker is no longer happy. He ... \n", - "5 Lucas Nogal Dunbercker is no longer happy. He ... \n", - "6 Europe is very culture rich. There are huge ch... \n", - "7 Europe is very culture rich. There are huge ch... \n", - "8 Europe is very culture rich. There are huge ch... \n", - "\n", - " clean_text \n", - "0 Peter good person \n", - "1 life Russia interesting \n", - "2 John Peter brothers \n", - "3 However dont support much \n", - "4 Lucas Nogal Dunbercker longer happy \n", - "5 good car though \n", - "6 Europe culture rich \n", - "7 huge churches \n", - "8 big houses " - ] - }, - "execution_count": 85, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('text', F.explode(result.clean_text.result).alias('clean_text')).toPandas()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 809, - "status": "ok", - "timestamp": 1664907340550, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "4NIAZNWcix8D", - "outputId": "37738f86-8c6b-4378-d301-8733fd135f8b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----+---+-----------------------------------+--------+\n", - "|begin|end|result |sentence|\n", - "+-----+---+-----------------------------------+--------+\n", - "|0 |16 |Peter good person |0 |\n", - "|0 |22 |life Russia interesting |0 |\n", - "|0 |18 |John Peter brothers |0 |\n", - "|29 |53 |However dont support much |1 |\n", - "|0 |34 |Lucas Nogal Dunbercker longer happy|0 |\n", - "|43 |57 |good car though |1 |\n", - "|0 |18 |Europe culture rich |0 |\n", - "|29 |41 |huge churches |1 |\n", - "|54 |63 |big houses |2 |\n", - "+-----+---+-----------------------------------+--------+\n", - "\n" - ] - } - ], - "source": [ - "import pyspark.sql.functions as F\n", - "\n", - "result.withColumn(\n", - " \"tmp\", \n", - " F.explode(\"clean_text\")) \\\n", - " .select(\"tmp.*\").select(\"begin\",\"end\",\"result\",\"metadata.sentence\").show(truncate = False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 847, - "status": "ok", - "timestamp": 1664907366551, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "zJLjfE_Qix8G", - "outputId": "f102695e-561c-48d7-9963-bde2a99c4e32" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------------------------------------------------------------------+-----------------------------------------------------+\n", - "|text |result |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------+\n", - "|Peter is a very good person. |[Peter good person] |\n", - "|My life in Russia is very interesting. |[life Russia interesting] |\n", - "|John and Peter are brothers. However they don't support each other that much.|[John Peter brothers However dont support much] |\n", - "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |[Lucas Nogal Dunbercker longer happy good car though]|\n", - "|Europe is very culture rich. There are huge churches! and big houses! |[Europe culture rich huge churches big houses] |\n", - "+-----------------------------------------------------------------------------+-----------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "# if we hadn't used Sentence Detector, this would be what we got. (tokenizer gets document instead of sentences column)\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "tokenassembler = TokenAssembler()\\\n", - " .setInputCols([\"document\", \"cleanTokens\"]) \\\n", - " .setOutputCol(\"clean_text\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler,\n", - " tokenizer,\n", - " normalizer,\n", - " stopwords_cleaner,\n", - " tokenassembler])\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)\n", - "result.select('text', 'clean_text.result').show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 853, - "status": "ok", - "timestamp": 1664907371195, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Pr419bEwix8H", - "outputId": "f638e140-d213-42bc-98a6-5a24c784c1ab" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----+---+---------------------------------------------------+--------+\n", - "|begin|end|result |sentence|\n", - "+-----+---+---------------------------------------------------+--------+\n", - "|0 |16 |Peter good person |0 |\n", - "|0 |22 |life Russia interesting |0 |\n", - "|0 |44 |John Peter brothers However dont support much |0 |\n", - "|0 |50 |Lucas Nogal Dunbercker longer happy good car though|0 |\n", - "|0 |43 |Europe culture rich huge churches big houses |0 |\n", - "+-----+---+---------------------------------------------------+--------+\n", - "\n" - ] - } - ], - "source": [ - "result.withColumn(\n", - " \"tmp\", \n", - " F.explode(\"clean_text\")) \\\n", - " .select(\"tmp.*\").select(\"begin\",\"end\",\"result\",\"metadata.sentence\").show(truncate = False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iPrXwVA9GCCS" - }, - "source": [ - "**IMPORTANT NOTE:**\n", - "\n", - "If you have some other steps & annotators in your pipeline that will need to use the tokens from cleaned text (assembled tokens), you will need to tokenize the processed text again as the original text is probably changed completely." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XO_ZWY2z1Ka6" - }, - "source": [ - "## Stemmer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aE2PEX0x1NgQ" - }, - "source": [ - "Returns hard-stems out of words with the objective of retrieving the meaningful part of the word\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "0sN0gS3ayfHT" - }, - "outputs": [], - "source": [ - "stemmer = Stemmer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"stem\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 794, - "status": "ok", - "timestamp": 1664907389078, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "VdJ-8aUy1RrC", - "outputId": "15a656ac-2a07-45f4-bd0d-7dcac1ca2fa2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| text| document| token| stem|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "| Peter is a very good person.|[{document, 0, 27, Peter is a very go...|[{token, 0, 4, Peter, {sentence -> 0}...|[{token, 0, 4, peter, {sentence -> 0}...|\n", - "| My life in Russia is very interesting.|[{document, 0, 37, My life in Russia ...|[{token, 0, 1, My, {sentence -> 0}, [...|[{token, 0, 1, my, {sentence -> 0}, [...|\n", - "|John and Peter are brothers. However ...|[{document, 0, 76, John and Peter are...|[{token, 0, 3, John, {sentence -> 0},...|[{token, 0, 3, john, {sentence -> 0},...|\n", - "|Lucas Nogal Dunbercker is no longer h...|[{document, 0, 67, Lucas Nogal Dunber...|[{token, 0, 4, Lucas, {sentence -> 0}...|[{token, 0, 4, luca, {sentence -> 0},...|\n", - "|Europe is very culture rich. There ar...|[{document, 0, 68, Europe is very cul...|[{token, 0, 5, Europe, {sentence -> 0...|[{token, 0, 5, europ, {sentence -> 0}...|\n", - "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " stemmer])\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)\n", - "\n", - "result.show(truncate=40)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 330, - "status": "ok", - "timestamp": 1664907391472, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "SvapcdcM1fWR", - "outputId": "f4596b57-a019-4344-9f09-561e37d6c020" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------------------------------------------------------+\n", - "|result |\n", - "+-------------------------------------------------------------------------------------------+\n", - "|[peter, i, a, veri, good, person, .] |\n", - "|[my, life, in, russia, i, veri, interest, .] |\n", - "|[john, and, peter, ar, brother, ., howev, thei, don't, support, each, other, that, much, .]|\n", - "|[luca, nogal, dunberck, i, no, longer, happi, ., he, ha, a, good, car, though, .] |\n", - "|[europ, i, veri, cultur, rich, ., there, ar, huge, church, !, and, big, hous, !] |\n", - "+-------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('stem.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CKvKqZx9wtpj" - }, - "source": [ - "If you are using PySpark 3.1.2 or below, You should use this format.\n", - "``` \n", - "import pyspark.sql.functions as F\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(\"token.result\", \"stem.result\")).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"stem\")).toPandas()\n", - "\n", - "result_df.head(10)\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 901, - "status": "ok", - "timestamp": 1664907414472, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "wCEPeXr-1iXR", - "outputId": "75616bdc-6018-4e2b-850a-20cfce3c6066" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tokenstem
0Peterpeter
1isi
2aa
3veryveri
4goodgood
5personperson
6..
7Mymy
8lifelife
9inin
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " token stem\n", - "0 Peter peter\n", - "1 is i\n", - "2 a a\n", - "3 very veri\n", - "4 good good\n", - "5 person person\n", - "6 . .\n", - "7 My my\n", - "8 life life\n", - "9 in in" - ] - }, - "execution_count": 92, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pyspark.sql.functions as F\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.stem.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"stem\")).toPandas()\n", - "\n", - "result_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "On3UrCoM2RFQ" - }, - "source": [ - "## Lemmatizer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "onCYFXGO2VSk" - }, - "source": [ - "Retrieves lemmas out of words with the objective of returning a base dictionary word " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gZYXURzi3N2T" - }, - "outputs": [], - "source": [ - "!wget -q https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2y_woCty2QXj" - }, - "outputs": [], - "source": [ - "lemmatizer = Lemmatizer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"lemma\") \\\n", - " .setDictionary(\"./AntBNC_lemmas_ver_001.txt\", value_delimiter =\"\\t\", key_delimiter = \"->\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 10, - "status": "ok", - "timestamp": 1664907421866, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "eR19i9uCKytw", - "outputId": "89c7af05-0c2f-47ce-ca34-db51ed1d888a" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='Lemmatizer_d115c22dc86c', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='Lemmatizer_d115c22dc86c', name='formCol', doc='Column that correspends to CoNLLU(formCol=) output'): 'form',\n", - " Param(parent='Lemmatizer_d115c22dc86c', name='lemmaCol', doc='Column that correspends to CoNLLU(lemmaCol=) output'): 'lemma',\n", - " Param(parent='Lemmatizer_d115c22dc86c', name='inputCols', doc='previous annotations columns, if renamed'): ['token'],\n", - " Param(parent='Lemmatizer_d115c22dc86c', name='outputCol', doc='output annotation column. can be left default.'): 'lemma',\n", - " Param(parent='Lemmatizer_d115c22dc86c', name='dictionary', doc=\"lemmatizer external dictionary. needs 'keyDelimiter' and 'valueDelimiter' in options for parsing target text\"): JavaObject id=o3141}" - ] - }, - "execution_count": 95, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "lemmatizer.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3015, - "status": "ok", - "timestamp": 1664907440131, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "dxSeQ0yz16cv", - "outputId": "edd595b7-7f74-4bd6-d451-6472dcecffd8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "| text| document| token| stem| lemma|\n", - "+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|Peter is a very g...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{token, 0, 4, pe...|[{token, 0, 4, Pe...|\n", - "|My life in Russia...|[{document, 0, 37...|[{token, 0, 1, My...|[{token, 0, 1, my...|[{token, 0, 1, My...|\n", - "|John and Peter ar...|[{document, 0, 76...|[{token, 0, 3, Jo...|[{token, 0, 3, jo...|[{token, 0, 3, Jo...|\n", - "|Lucas Nogal Dunbe...|[{document, 0, 67...|[{token, 0, 4, Lu...|[{token, 0, 4, lu...|[{token, 0, 4, Lu...|\n", - "|Europe is very cu...|[{document, 0, 68...|[{token, 0, 5, Eu...|[{token, 0, 5, eu...|[{token, 0, 5, Eu...|\n", - "+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "stemmer = Stemmer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"stem\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer])\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)\n", - "result.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 342, - "status": "ok", - "timestamp": 1664907442014, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "3LVBY-fL3kmv", - "outputId": "1177fe09-c218-47e1-df47-c02eb0f6a238" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------------------------------------------------------------------------------------------+\n", - "|result |\n", - "+---------------------------------------------------------------------------------------------+\n", - "|[Peter, be, a, very, good, person, .] |\n", - "|[My, life, in, Russia, be, very, interest, .] |\n", - "|[John, and, Peter, be, brother, ., However, they, don't, support, each, other, that, much, .]|\n", - "|[Lucas, Nogal, Dunbercker, be, no, long, happy, ., He, have, a, good, car, though, .] |\n", - "|[Europe, be, very, culture, rich, ., There, be, huge, church, !, and, big, house, !] |\n", - "+---------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('lemma.result').show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 445, - "status": "ok", - "timestamp": 1664907445211, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "QAJCpPbW3oZq", - "outputId": "16de8a7f-1e88-44f8-a115-dea1e1cca4da" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tokenstemlemma
0PeterpeterPeter
1isibe
2aaa
3veryverivery
4goodgoodgood
5personpersonperson
6...
7MymyMy
8lifelifelife
9ininin
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " token stem lemma\n", - "0 Peter peter Peter\n", - "1 is i be\n", - "2 a a a\n", - "3 very veri very\n", - "4 good good good\n", - "5 person person person\n", - "6 . . .\n", - "7 My my My\n", - "8 life life life\n", - "9 in in in" - ] - }, - "execution_count": 98, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.stem.result, \n", - " result.lemma.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"stem\"),\n", - " F.expr(\"cols['2']\").alias(\"lemma\")).toPandas()\n", - "\n", - "result_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EgdWV7yFix8e" - }, - "source": [ - "## NGram Generator" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TCuzINkOix8f" - }, - "source": [ - "NGramGenerator annotator takes as input a sequence of strings (e.g. the output of a `Tokenizer`, `Normalizer`, `Stemmer`, `Lemmatizer`, and `StopWordsCleaner`). \n", - "\n", - "The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words with annotatorType `CHUNK` same as the Chunker annotator.\n", - "\n", - "Functions:\n", - "\n", - "`setN:` number elements per n-gram (>=1)\n", - "\n", - "`setEnableCumulative:` whether to calculate just the actual n-grams or all n-grams from 1 through n\n", - "\n", - "`setDelimiter:` Glue character used to join the tokens" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 805, - "status": "ok", - "timestamp": 1664907463529, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "DxhS3L0_ix8f", - "outputId": "1177d632-83cd-468a-c4bd-f1362790595e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "| result|\n", - "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "| [Peter, is, a, very, good, person, ., Peter_is, is_a, a_very, very_good, good_person, person_., Peter_is_a, is_a_very, a_very_good, very_good_person, good_person_.]|\n", - "|[My, life, in, Russia, is, very, interesting, ., My_life, life_in, in_Russia, Russia_is, is_very, very_interesting, interesting_., My_life_in, life_in_Russia, in_Russia_is, Russia_is_very, is_very_...|\n", - "|[John, and, Peter, are, brothers, ., However, they, don't, support, each, other, that, much, ., John_and, and_Peter, Peter_are, are_brothers, brothers_., ._However, However_they, they_don't, don't_...|\n", - "|[Lucas, Nogal, Dunbercker, is, no, longer, happy, ., He, has, a, good, car, though, ., Lucas_Nogal, Nogal_Dunbercker, Dunbercker_is, is_no, no_longer, longer_happy, happy_., ._He, He_has, has_a, a_...|\n", - "|[Europe, is, very, culture, rich, ., There, are, huge, churches, !, and, big, houses, !, Europe_is, is_very, very_culture, culture_rich, rich_., ._There, There_are, are_huge, huge_churches, churche...|\n", - "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "ngrams_cum = NGramGenerator() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"ngrams\") \\\n", - " .setN(3) \\\n", - " .setEnableCumulative(True)\\\n", - " .setDelimiter(\"_\") # Default is space\n", - " \n", - "# .setN(3) means, take bigrams and trigrams.\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " ngrams_cum])\n", - "\n", - "result = nlpPipeline.fit(spark_df).transform(spark_df)\n", - "result.select('ngrams.result').show(truncate=200)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 319, - "status": "ok", - "timestamp": 1664907467569, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "oCVMPxBGGvXu", - "outputId": "4ccee400-375c-4c24-b38c-81f97a955adc" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "| result|\n", - "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "| [Peter_is_a, is_a_very, a_very_good, very_good_person, good_person_.]|\n", - "| [My_life_in, life_in_Russia, in_Russia_is, Russia_is_very, is_very_interesting, very_interesting_.]|\n", - "|[John_and_Peter, and_Peter_are, Peter_are_brothers, are_brothers_., brothers_._However, ._However_they, However_they_don't, they_don't_support, don't_support_each, support_each_other, each_other_th...|\n", - "| [Lucas_Nogal_Dunbercker, Nogal_Dunbercker_is, Dunbercker_is_no, is_no_longer, no_longer_happy, longer_happy_., happy_._He, ._He_has, He_has_a, has_a_good, a_good_car, good_car_though, car_though_.]|\n", - "|[Europe_is_very, is_very_culture, very_culture_rich, culture_rich_., rich_._There, ._There_are, There_are_huge, are_huge_churches, huge_churches_!, churches_!_and, !_and_big, and_big_houses, big_ho...|\n", - "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "ngrams_nonCum = NGramGenerator() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"ngrams_v2\") \\\n", - " .setN(3) \\\n", - " .setEnableCumulative(False)\\\n", - " .setDelimiter(\"_\") # Default is space\n", - " \n", - "ngrams_nonCum.transform(result).select('ngrams_v2.result').show(truncate=200)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C55M47nKCL3E" - }, - "source": [ - "## TextMatcher" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8hLITrkJICKO" - }, - "source": [ - "Annotator to match entire phrases (by token) provided in a file against a Document\n", - "\n", - "Functions:\n", - "\n", - "`setEntities(path, format, options)`: Provides a file with phrases to match. Default: Looks up path in configuration.\n", - "\n", - "`path`: a path to a file that contains the entities in the specified format.\n", - "\n", - "`readAs`: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.\n", - "\n", - "`options`: a map of additional parameters. Defaults to {“format”: “text”}.\n", - "\n", - "`entityValue` : Value for the entity metadata field to indicate which chunk comes from which textMatcher when there are multiple textMatchers. \n", - "\n", - "`mergeOverlapping` : whether to merge overlapping matched chunks. Defaults false\n", - "\n", - "`caseSensitive` : whether to match regardless of case. Defaults true\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 327, - "status": "ok", - "timestamp": 1664907471907, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "l_DrFKDqK7_-", - "outputId": "b627733a-cf36-44c9-a6a0-0e4cc309d600" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='TextMatcher_06501f2ac688', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='TextMatcher_06501f2ac688', name='caseSensitive', doc='whether to match regardless of case. Defaults true'): True,\n", - " Param(parent='TextMatcher_06501f2ac688', name='mergeOverlapping', doc='whether to merge overlapping matched chunks. Defaults false'): False,\n", - " Param(parent='TextMatcher_06501f2ac688', name='inputCols', doc='previous annotations columns, if renamed'): ['document',\n", - " 'token'],\n", - " Param(parent='TextMatcher_06501f2ac688', name='outputCol', doc='output annotation column. can be left default.'): 'matched_entities'}" - ] - }, - "execution_count": 101, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "entity_extractor = TextMatcher() \\\n", - " .setInputCols([\"document\",'token'])\\\n", - " .setOutputCol(\"matched_entities\")\\\n", - "\n", - "entity_extractor.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZyY6vL-gCOnV" - }, - "outputs": [], - "source": [ - "! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv\n", - "\n", - "news_df = spark.read \\\n", - " .option(\"header\", True) \\\n", - " .csv(\"news_category_train.csv\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 8, - "status": "ok", - "timestamp": 1664907474995, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "mY7ZBEpwESzO", - "outputId": "87352d8f-31e8-4035-e9c4-7da39e0c7db8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------+--------------------------------------------------+\n", - "|category| description|\n", - "+--------+--------------------------------------------------+\n", - "|Business| Short sellers, Wall Street's dwindling band of...|\n", - "|Business| Private investment firm Carlyle Group, which h...|\n", - "|Business| Soaring crude prices plus worries about the ec...|\n", - "|Business| Authorities have halted oil export flows from ...|\n", - "|Business| Tearaway world oil prices, toppling records an...|\n", - "+--------+--------------------------------------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], - "source": [ - "news_df.show(5, truncate=50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "it2UjbV_ZRHp" - }, - "outputs": [], - "source": [ - "# write the target entities to txt file \n", - "\n", - "entities = ['Wall Street', 'USD', 'stock', 'NYSE']\n", - "with open ('financial_entities.txt', 'w') as f:\n", - " for i in entities:\n", - " f.write(i+'\\n')\n", - "\n", - "\n", - "entities = ['soccer', 'world cup', 'Messi', 'FC Barcelona']\n", - "with open ('sport_entities.txt', 'w') as f:\n", - " for i in entities:\n", - " f.write(i+'\\n')\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ExRo9nnCCgG3" - }, - "outputs": [], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"description\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "financial_entity_extractor = TextMatcher() \\\n", - " .setInputCols([\"document\",'token'])\\\n", - " .setOutputCol(\"financial_entities\")\\\n", - " .setEntities(\"financial_entities.txt\")\\\n", - " .setCaseSensitive(False)\\\n", - " .setEntityValue('financial_entity')\n", - "\n", - "sport_entity_extractor = TextMatcher() \\\n", - " .setInputCols([\"document\",'token'])\\\n", - " .setOutputCol(\"sport_entities\")\\\n", - " .setEntities(\"sport_entities.txt\")\\\n", - " .setCaseSensitive(False)\\\n", - " .setEntityValue('sport_entity')\n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " financial_entity_extractor,\n", - " sport_entity_extractor\n", - " ])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 313, - "status": "ok", - "timestamp": 1664907514547, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "NctznXrSFb7Y", - "outputId": "01bd2324-1723-460a-ac9e-2915541e152e" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=[], result=[]),\n", - " Row(result=[], result=[]),\n", - " Row(result=['stock'], result=[])]" - ] - }, - "execution_count": 106, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('financial_entities.result','sport_entities.result').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5846, - "status": "ok", - "timestamp": 1664907528802, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "8nm65RMNNBha", - "outputId": "c2d130ff-0837-4076-9f31-049342f9fc03" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------+----------------------------------+-------------------+\n", - "| text| financial_matches| sport_matches|\n", - "+----------------------------------------------------------------------+----------------------------------+-------------------+\n", - "|\"Company launched the biggest electronic auction of stock in Wall S...| [stock, Wall Street]| []|\n", - "|Google, Inc. significantly cut the expected share price for its ini...| [stock, stock]| []|\n", - "|Google, Inc. significantly cut the expected share price this mornin...| [stock, stock]| []|\n", - "| Shares of Air Canada (AC.TO) fell by more than half on Wednesday,...| [Stock, stock]| []|\n", - "|Stock prices are lower in moderate trading. The Dow Jones Industria...| [Stock, Stock]| []|\n", - "|The bad news just keeps pouring in for mutual fund manager Janus Ca...| [NYSE, NYSE]| []|\n", - "| Shaun Wright Phillips scored in his international debut as Englan...| []|[soccer, World Cup]|\n", - "|NEWCASTLE, ENGLAND - England deservedly beat Ukraine 3-0 today in t...| []|[soccer, World Cup]|\n", - "|MONTREAL (Reuters) - Shares of Air Canada (AC.TO: Quote, Profile, R...| [Stock, stock]| []|\n", - "|\"SAN JOSE, California - On the cusp of its voyage into public tradi...|[stock, Wall Street, stock, Stock]| []|\n", - "|\"Shortly before noon today, Google Inc. stock began trading under t...| [stock, stock]| []|\n", - "|roundup Plus: EA to take World Cup soccer to Xbox...IBM chalks up t...| []|[World Cup, soccer]|\n", - "|The U.S. Securities and Exchange Commission yesterday approved Goog...| [stock, stock]| []|\n", - "|After a bumpy ride toward becoming a publicly traded company, Googl...| [stock, stock]| []|\n", - "|In the most highly anticipated Wall Street debut since the heady da...| [Wall Street, stock]| []|\n", - "|NEW YORK Despite voluble skepticism among investors, Google #39;s s...| [stock, stock]| []|\n", - "|If only the rest of my investments worked out this way. One week ag...| [stock, stock]| []|\n", - "| U.S. stocks to watch: GOOGLE INC. (GOOG.O) Google shares jumped 18...| [stock, stock]| []|\n", - "|\" U.S. stocks to watch: GOOGLE INC. <A HREF=\"\"http://www.invest...| [stock, stock]| []|\n", - "|roundup Plus: KDE updates Linux desktop...EA to take World Cup socc...| []|[World Cup, soccer]|\n", - "+----------------------------------------------------------------------+----------------------------------+-------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "result.select('description','financial_entities.result','sport_entities.result')\\\n", - " .toDF('text','financial_matches','sport_matches').filter((F.size('financial_matches')>1) | (F.size('sport_matches')>1))\\\n", - " .show(truncate=70)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 35765, - "status": "ok", - "timestamp": 1664907567503, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "-bVEDIblFoL6", - "outputId": "4a67c589-7268-4f09-caa2-e4ec3b558e28" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
clinical_entitiesbeginend
0stock112116
1stock114118
2stock4549
3stock126130
4stock188192
5stock5256
6Wall Street6171
7stock7074
8stock143147
9stock294298
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " clinical_entities begin end\n", - "0 stock 112 116\n", - "1 stock 114 118\n", - "2 stock 45 49\n", - "3 stock 126 130\n", - "4 stock 188 192\n", - "5 stock 52 56\n", - "6 Wall Street 61 71\n", - "7 stock 70 74\n", - "8 stock 143 147\n", - "9 stock 294 298" - ] - }, - "execution_count": 108, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.financial_entities.result, \n", - " result.financial_entities.begin, \n", - " result.financial_entities.end)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"clinical_entities\"),\n", - " F.expr(\"cols['1']\").alias(\"begin\"),\n", - " F.expr(\"cols['2']\").alias(\"end\")).toPandas()\n", - "\n", - "result_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "maNIjBznix8x" - }, - "source": [ - "## RegexMatcher" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1367, - "status": "ok", - "timestamp": 1664907568862, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "1FCcjxSsaeEk", - "outputId": "2fa8bd5d-a2d1-47b1-e912-08caa6d1aa76" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------+\n", - "| text|\n", - "+--------------------------------------------------+\n", - "|The human KCNJ9 (Kir 3.3, GIRK3) is a member of...|\n", - "|BACKGROUND: At present, it is one of the most i...|\n", - "|OBJECTIVE: To investigate the relationship betw...|\n", - "|Combined EEG/fMRI recording has been used to lo...|\n", - "|Kohlschutter syndrome is a rare neurodegenerati...|\n", - "|Statistical analysis of neuroimages is commonly...|\n", - "|The synthetic DOX-LNA conjugate was characteriz...|\n", - "|Our objective was to compare three different me...|\n", - "|We conducted a phase II study to assess the eff...|\n", - "|\"Monomeric sarcosine oxidase (MSOX) is a flavoe...|\n", - "|We presented the tachinid fly Exorista japonica...|\n", - "|The literature dealing with the water conductin...|\n", - "|A novel approach to synthesize chitosan-O-isopr...|\n", - "|An HPLC-ESI-MS-MS method has been developed for...|\n", - "|The localizing and lateralizing values of eye a...|\n", - "|OBJECTIVE: To evaluate the effectiveness and ac...|\n", - "|For the construction of new combinatorial libra...|\n", - "|We report the results of a screen for genetic a...|\n", - "|Intraparenchymal pericatheter cyst is rarely re...|\n", - "|It is known that patients with Klinefelter's sy...|\n", - "+--------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "! wget -q\thttps://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv\n", - "\n", - "pubMedDF = spark.read\\\n", - " .option(\"header\", \"true\")\\\n", - " .csv(\"./pubmed-sample.csv\")\\\n", - " .filter(\"AB IS NOT null\")\\\n", - " .withColumnRenamed(\"AB\", \"text\")\\\n", - " .drop(\"TI\")\n", - "\n", - "pubMedDF.show(truncate=50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "jT2DkSEMix8y" - }, - "outputs": [], - "source": [ - "rules = '''\n", - "renal\\s\\w+, started with 'renal'\n", - "cardiac\\s\\w+, started with 'cardiac'\n", - "\\w*ly\\b, ending with 'ly'\n", - "\\S*\\d+\\S*, match any word that contains numbers\n", - "(\\d+).?(\\d*)\\s*(mg|ml|g), match medication metrics\n", - "'''\n", - "\n", - "with open('regex_rules.txt', 'w') as f:\n", - " \n", - " f.write(rules)\n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 9, - "status": "ok", - "timestamp": 1664907568863, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "9xff1kP8LFHr", - "outputId": "9a418bef-a622-4e20-b2ae-953ab4ee75f9" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='RegexMatcher_5f6b3c0b4b06', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='RegexMatcher_5f6b3c0b4b06', name='strategy', doc='MATCH_FIRST|MATCH_ALL|MATCH_COMPLETE'): 'MATCH_ALL'}" - ] - }, - "execution_count": 111, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "RegexMatcher().extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 333, - "status": "ok", - "timestamp": 1664907569192, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ejUrjIGjix8z", - "outputId": "bf33296d-abac-4f3b-d35e-089fdced0b33" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['inwardly', 'family', 'spansapproximately', 'byapproximately', 'approximately', 'respectively', 'poly', 'KCNJ9', '3.3,', 'GIRK3)', 'KCNJ9', '1q21-23', '7.6', '2.2', '2.6', 'identified14', 'aVal366Ala', '8', 'KCNJ9', 'KCNJ9', '9 g']),\n", - " Row(result=['previously', 'previously', 'intravenously', 'previously', '25', 'mg/m(2)', '1', '8', 'a3', '50', '20.0%', '(10', '50;', '95%', 'interval,10.0-33.7%).', '58.0%', '[10', '18', '50].', '(50%', '115.0', '17.3%', '52).', '25 mg']),\n", - " Row(result=['renal failure', 'cardiac surgery', 'cardiac surgery', 'cardiac surgical', 'early', 'statistically', 'analy', '1995', '2005', '=9796).', '2.9', '11years).', '11.3%', '1105),', '7.2%', '30%', '0.0001),', '1.55,95%', '1.42-1.70,', '0.0001).'])]" - ] - }, - "execution_count": 112, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "regex_matcher = RegexMatcher()\\\n", - " .setInputCols('document')\\\n", - " .setStrategy(\"MATCH_ALL\")\\\n", - " .setOutputCol(\"regex_matches\")\\\n", - " .setExternalRules(path='./regex_rules.txt', delimiter=',')\n", - " \n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " regex_matcher\n", - " ])\n", - "\n", - "match_df = nlpPipeline.fit(pubMedDF).transform(pubMedDF)\n", - "match_df.select('regex_matches.result').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 394, - "status": "ok", - "timestamp": 1664907569582, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "3xLGRM6Vix81", - "outputId": "4aa8d560-6a60-4c9d-ebe2-de2ba5f8de32" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------+----------------------------------------------------------------------+\n", - "| text| matches|\n", - "+----------------------------------------------------------------------+----------------------------------------------------------------------+\n", - "|The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activ...|[inwardly, family, spansapproximately, byapproximately, approximate...|\n", - "|BACKGROUND: At present, it is one of the most important issues for ...|[previously, previously, intravenously, previously, 25, mg/m(2), 1,...|\n", - "|OBJECTIVE: To investigate the relationship between preoperative atr...|[renal failure, cardiac surgery, cardiac surgery, cardiac surgical,...|\n", - "|Combined EEG/fMRI recording has been used to localize the generator...|[normally, significantly, effectively, analy, only, considerably, 2...|\n", - "|Statistical analysis of neuroimages is commonly approached with int...|[analy, commonly, overly, normally, thatsuccessfully, recently, ana...|\n", - "|The synthetic DOX-LNA conjugate was characterized by proton nuclear...| [wasanaly, substantially]|\n", - "|Our objective was to compare three different methods of blood press...|[daily, only, Conversely, Hourly, hourly, Hourly, hourly, hourly, h...|\n", - "|We conducted a phase II study to assess the efficacy and tolerabili...|[analy, respectively, generally, 5-fluorouracil, (5-FU)-, 5-FU-base...|\n", - "|\"Monomeric sarcosine oxidase (MSOX) is a flavoenzyme that catalyzes...|[cataly, methylgly, gly, ethylgly, dimethylgly, spectrally, practic...|\n", - "|We presented the tachinid fly Exorista japonica with moving host mo...| [fly, fly, fly, fly, fly]|\n", - "|The literature dealing with the water conducting properties of sapw...| [generally, mathematically, especially]|\n", - "|A novel approach to synthesize chitosan-O-isopropyl-5'-O-d4T monoph...|[efficiently, poly, chitosan-O-isopropyl-5'-O-d4T, Chitosan-d4T, 1....|\n", - "|An HPLC-ESI-MS-MS method has been developed for the quantitative de...|[chromatographically, respectively, successfully, C18, (n=5), 95.0%...|\n", - "|The localizing and lateralizing values of eye and head ictal deviat...| [early, early]|\n", - "|OBJECTIVE: To evaluate the effectiveness and acceptability of expec...|[weekly, respectively, theanaly, 2006, 2007,, 2, 66, 1), 30patients...|\n", - "|We report the results of a screen for genetic association with urin...|[poly, threepoly, significantly, analy, actually, anextremely, only...|\n", - "|Intraparenchymal pericatheter cyst is rarely reported. Obstruction ...| [rarely, possibly, unusually, Early]|\n", - "|PURPOSE: To compare the effectiveness, potential advantages and com...|[analy, comparatively, wassignificantly, respectively, a7-year, 155...|\n", - "|We have demonstrated a new type of all-optical 2 x 2 switch by usin...|[approximately, fully, approximately, approximately, approximately,...|\n", - "|Physalis peruviana (PP) is a widely used medicinal herb for treatin...|[widely, (20,, 40,, 60,, 80, 95%, 100, 95%, (82.3%), onFeCl2-ascorb...|\n", - "+----------------------------------------------------------------------+----------------------------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "match_df.select('text','regex_matches.result')\\\n", - " .toDF('text','matches').filter(F.size('matches')>1)\\\n", - " .show(truncate=70)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "g5tfkArf0XQT" - }, - "source": [ - "## MultiDateMatcher" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mY7q_7zl9HR5" - }, - "source": [ - "Extract exact & normalize dates from relative date-time phrases. The default anchor date will be the date the code is run." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 13, - "status": "ok", - "timestamp": 1664907569583, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "EBIg8nm0LNaF", - "outputId": "79f85c7e-88ea-417c-fffa-c9de90ffb8f1" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='MultiDateMatcher_45d507efaadb', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='MultiDateMatcher_45d507efaadb', name='inputFormats', doc='input formats list of patterns to match'): [''],\n", - " Param(parent='MultiDateMatcher_45d507efaadb', name='outputFormat', doc='desired output format for dates extracted'): 'yyyy/MM/dd',\n", - " Param(parent='MultiDateMatcher_45d507efaadb', name='readMonthFirst', doc='Whether to parse july 07/05/2015 or as 05/07/2015'): True,\n", - " Param(parent='MultiDateMatcher_45d507efaadb', name='defaultDayWhenMissing', doc='which day to set when it is missing from parsed input'): 1}" - ] - }, - "execution_count": 114, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "MultiDateMatcher().extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 757, - "status": "ok", - "timestamp": 1664907584207, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "QxNprtam9S6Q", - "outputId": "b8ac64ce-88ed-4c29-e86f-c4074db2a81b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------------------+\n", - "|result |\n", - "+------------------------+\n", - "|[2022/10/11, 2022/10/03]|\n", - "+------------------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "date_matcher = MultiDateMatcher() \\\n", - " .setInputCols('document') \\\n", - " .setOutputCol(\"date\")\\\n", - " .setOutputFormat(\"yyyy/MM/dd\")\\\n", - " .setSourceLanguage(\"en\")\n", - "\n", - "\n", - "date_pipeline = PipelineModel(\n", - " stages=[\n", - " documentAssembler, \n", - " date_matcher\n", - " ])\n", - "\n", - "sample_df = spark.createDataFrame([['I saw him yesterday and he told me that he will visit us next week']]).toDF(\"text\")\n", - "\n", - "result = date_pipeline.transform(sample_df)\n", - "result.select('date.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "R0Wd3ZLDqM4K" - }, - "source": [ - "Let's set the Input Format and Output Format to specific format" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 361, - "status": "ok", - "timestamp": 1664907596811, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "VFZSzKMrteHj", - "outputId": "c69914bf-582e-4ac7-e976-9f8f7a2b0b44" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+\n", - "|result |\n", - "+------------+\n", - "|[2022/05/21]|\n", - "+------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "date_matcher = MultiDateMatcher() \\\n", - " .setInputCols('document') \\\n", - " .setOutputCol(\"date\")\\\n", - " .setInputFormats([\"dd/MM/yyyy\"])\\\n", - " .setOutputFormat(\"yyyy/MM/dd\")\\\n", - " .setSourceLanguage(\"en\")\n", - "\n", - "date_pipeline = PipelineModel(\n", - " stages=[\n", - " documentAssembler, \n", - " date_matcher\n", - " ])\n", - "\n", - "sample_df = spark.createDataFrame([[\"the last payment date of this invoice is 21/05/2022\"]]).toDF(\"text\")\n", - "\n", - "result = date_pipeline.transform(sample_df)\n", - "result.select('date.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZOwrFGd8ix83" - }, - "source": [ - "## Text Cleaning with UDF" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1578, - "status": "ok", - "timestamp": 1664907603877, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "FMGb8oMoix83", - "outputId": "1cf4bbb6-8fb4-4e93-fcf4-5e80c548824f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------------------------------+-----------------------+\n", - "|text |cleaned |\n", - "+----------------------------------------------------------------------------------------------+-----------------------+\n", - "|

Have a great birth day!

|Have a great birth day!|\n", - "+----------------------------------------------------------------------------------------------+-----------------------+\n", - "\n" - ] - } - ], - "source": [ - "text = '

Have a great birth day!

'\n", - "\n", - "text_df = spark.createDataFrame([[text]]).toDF(\"text\")\n", - "\n", - "import re\n", - "from pyspark.sql.functions import udf\n", - "from pyspark.sql.types import StringType, IntegerType\n", - "\n", - "clean_text = lambda s: re.sub(r'<[^>]*>', '', s)\n", - "\n", - "text_df.withColumn('cleaned', udf(clean_text, StringType())('text')).select('text','cleaned').show(truncate= False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 409, - "status": "ok", - "timestamp": 1664907605532, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "2jtfb5TGix86", - "outputId": "312a7d3d-8f42-4a59-94b9-58b8cd17e1f2" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "2" - ] - }, - "execution_count": 118, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "find_not_alnum_count = lambda s: len([i for i in s if not i.isalnum() and i!=' '])\n", - "\n", - "find_not_alnum_count(\"it's your birth day!\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 350, - "status": "ok", - "timestamp": 1664907607173, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "5d7DkFK7ix87", - "outputId": "77913e9e-917c-4728-92e8-fb9347f15764" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "23" - ] - }, - "execution_count": 119, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text = '

Have a great birth day!

'\n", - "\n", - "find_not_alnum_count(text)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 510, - "status": "ok", - "timestamp": 1664907608242, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "OACwrrToix89", - "outputId": "7158910f-9fe7-492d-fe01-4a0861ee6c7b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------------------------------+-------+\n", - "|text |cleaned|\n", - "+----------------------------------------------------------------------------------------------+-------+\n", - "|

Have a great birth day!

|23 |\n", - "+----------------------------------------------------------------------------------------------+-------+\n", - "\n" - ] - } - ], - "source": [ - "text_df.withColumn('cleaned', udf(find_not_alnum_count, IntegerType())('text')).select('text','cleaned').show(truncate= False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Tj-RyTZaix8-" - }, - "source": [ - "## Finisher" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "czhvgXuHix8-" - }, - "source": [ - "***Finisher:*** Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into a string.\n", - "\n", - "If we just want the desired output column in the final dataframe, we can use Finisher to drop previous stages in the final output and get the `result` from the process.\n", - "\n", - "This is very handy when you want to use the output from Spark NLP annotator as an input to another Spark ML transformer.\n", - "\n", - "**Settable parameters are:**\n", - "\n", - "| Parametre | Description |\n", - "| - | - |\n", - "|**setInputCols** |input column name string which targets a column of type Array|\n", - "|**setOutputCols** |output column name string which targets a column of type AnnotatorType|\n", - "|**setCleanAnnotations(True)**|Whether to remove intermediate annotations|\n", - "|**setValueSplitSymbol(“#”)**|split values within an annotation character|\n", - "|**setAnnotationSplitSymbol(“@”)**|split values between annotations character|\n", - "|**setIncludeMetadata(False)**|Whether to include metadata keys. Sometimes useful in some annotations.|\n", - "|**setOutputAsArray(False)**| Whether to output as Array. Useful as input for other Spark transformers.|" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 798, - "status": "ok", - "timestamp": 1664907626571, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "z0aDUU2Fix8-", - "outputId": "1a02acdf-b602-417f-f4f6-7d4df90af320" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------+--------------------------------------------------+\n", - "| text| finished_regex_matches|\n", - "+--------------------------------------------------+--------------------------------------------------+\n", - "|The human KCNJ9 (Kir 3.3, GIRK3) is a member of...|[inwardly, family, spansapproximately, byapprox...|\n", - "|BACKGROUND: At present, it is one of the most i...|[previously, previously, intravenously, previou...|\n", - "|OBJECTIVE: To investigate the relationship betw...|[renal failure, cardiac surgery, cardiac surger...|\n", - "|Combined EEG/fMRI recording has been used to lo...|[normally, significantly, effectively, analy, o...|\n", - "|Kohlschutter syndrome is a rare neurodegenerati...| [family]|\n", - "|Statistical analysis of neuroimages is commonly...|[analy, commonly, overly, normally, thatsuccess...|\n", - "|The synthetic DOX-LNA conjugate was characteriz...| [wasanaly, substantially]|\n", - "|Our objective was to compare three different me...|[daily, only, Conversely, Hourly, hourly, Hourl...|\n", - "|We conducted a phase II study to assess the eff...|[analy, respectively, generally, 5-fluorouracil...|\n", - "|\"Monomeric sarcosine oxidase (MSOX) is a flavoe...|[cataly, methylgly, gly, ethylgly, dimethylgly,...|\n", - "|We presented the tachinid fly Exorista japonica...| [fly, fly, fly, fly, fly]|\n", - "|The literature dealing with the water conductin...| [generally, mathematically, especially]|\n", - "|A novel approach to synthesize chitosan-O-isopr...|[efficiently, poly, chitosan-O-isopropyl-5'-O-d...|\n", - "|An HPLC-ESI-MS-MS method has been developed for...|[chromatographically, respectively, successfull...|\n", - "|The localizing and lateralizing values of eye a...| [early, early]|\n", - "|OBJECTIVE: To evaluate the effectiveness and ac...|[weekly, respectively, theanaly, 2006, 2007,, 2...|\n", - "|For the construction of new combinatorial libra...| [newly]|\n", - "|We report the results of a screen for genetic a...|[poly, threepoly, significantly, analy, actuall...|\n", - "|Intraparenchymal pericatheter cyst is rarely re...| [rarely, possibly, unusually, Early]|\n", - "|It is known that patients with Klinefelter's sy...| []|\n", - "+--------------------------------------------------+--------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "finisher = Finisher() \\\n", - " .setInputCols([\"regex_matches\"]) \\\n", - " .setIncludeMetadata(False) # set to False to remove metadata\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " regex_matcher,\n", - " finisher])\n", - "\n", - "match_df = nlpPipeline.fit(pubMedDF).transform(pubMedDF)\n", - "match_df.show(truncate = 50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 414, - "status": "ok", - "timestamp": 1664907628746, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "f9Yy-yUgix9A", - "outputId": "c7970d20-5cb3-4311-c3b8-27150962d539" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "root\n", - " |-- text: string (nullable = true)\n", - " |-- finished_regex_matches: array (nullable = true)\n", - " | |-- element: string (containsNull = true)\n", - "\n" - ] - } - ], - "source": [ - "match_df.printSchema()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 377, - "status": "ok", - "timestamp": 1664907629451, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "gZCd2KSfix9C", - "outputId": "dfaef820-bf99-45d2-b634-ad18dad8e9ab" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------+--------------------------------------------------+\n", - "| text| finished_regex_matches|\n", - "+--------------------------------------------------+--------------------------------------------------+\n", - "|The human KCNJ9 (Kir 3.3, GIRK3) is a member of...|[inwardly, family, spansapproximately, byapprox...|\n", - "|BACKGROUND: At present, it is one of the most i...|[previously, previously, intravenously, previou...|\n", - "|OBJECTIVE: To investigate the relationship betw...|[renal failure, cardiac surgery, cardiac surger...|\n", - "|Combined EEG/fMRI recording has been used to lo...|[normally, significantly, effectively, analy, o...|\n", - "|Statistical analysis of neuroimages is commonly...|[analy, commonly, overly, normally, thatsuccess...|\n", - "|Our objective was to compare three different me...|[daily, only, Conversely, Hourly, hourly, Hourl...|\n", - "|We conducted a phase II study to assess the eff...|[analy, respectively, generally, 5-fluorouracil...|\n", - "|\"Monomeric sarcosine oxidase (MSOX) is a flavoe...|[cataly, methylgly, gly, ethylgly, dimethylgly,...|\n", - "|We presented the tachinid fly Exorista japonica...| [fly, fly, fly, fly, fly]|\n", - "|The literature dealing with the water conductin...| [generally, mathematically, especially]|\n", - "|A novel approach to synthesize chitosan-O-isopr...|[efficiently, poly, chitosan-O-isopropyl-5'-O-d...|\n", - "|An HPLC-ESI-MS-MS method has been developed for...|[chromatographically, respectively, successfull...|\n", - "|OBJECTIVE: To evaluate the effectiveness and ac...|[weekly, respectively, theanaly, 2006, 2007,, 2...|\n", - "|We report the results of a screen for genetic a...|[poly, threepoly, significantly, analy, actuall...|\n", - "|Intraparenchymal pericatheter cyst is rarely re...| [rarely, possibly, unusually, Early]|\n", - "|PURPOSE: To compare the effectiveness, potentia...|[analy, comparatively, wassignificantly, respec...|\n", - "|We have demonstrated a new type of all-optical ...|[approximately, fully, approximately, approxima...|\n", - "|Physalis peruviana (PP) is a widely used medici...|[widely, (20,, 40,, 60,, 80, 95%, 100, 95%, (82...|\n", - "|We report the discovery of a series of substitu...|[highly, potentially, highly, respectively, tub...|\n", - "|The purpose of this study was to identify and c...|[family, Nearly, only, 43, 10, 44%, 32%, 64%, 4...|\n", - "+--------------------------------------------------+--------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "match_df.filter(F.size('finished_regex_matches')>2).show(truncate = 50)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RBeh9Yv44elz" - }, - "source": [ - "## LightPipeline\n", - "\n", - "https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eaANY-Bg4paN" - }, - "source": [ - "LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.\n", - "\n", - "Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.\n", - "\n", - " **It is nearly 10x faster than using Spark ML Pipeline**\n", - "\n", - "`LightPipeline(someTrainedPipeline).annotate(someStringOrArray)`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 4053, - "status": "ok", - "timestamp": 1664907656865, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "D5DwGaT4oTyu", - "outputId": "ffa33dfb-23ab-47fc-d413-051d76318cb9" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "| text| document| token| stem| lemma|\n", - "+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|Peter is a very g...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{token, 0, 4, pe...|[{token, 0, 4, Pe...|\n", - "|My life in Russia...|[{document, 0, 37...|[{token, 0, 1, My...|[{token, 0, 1, my...|[{token, 0, 1, My...|\n", - "|John and Peter ar...|[{document, 0, 76...|[{token, 0, 3, Jo...|[{token, 0, 3, jo...|[{token, 0, 3, Jo...|\n", - "|Lucas Nogal Dunbe...|[{document, 0, 67...|[{token, 0, 4, Lu...|[{token, 0, 4, lu...|[{token, 0, 4, Lu...|\n", - "|Europe is very cu...|[{document, 0, 68...|[{token, 0, 5, Eu...|[{token, 0, 5, eu...|[{token, 0, 5, Eu...|\n", - "+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "stemmer = Stemmer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"stem\")\n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer\n", - " ])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)\n", - "nlpPipeline.fit(empty_df).transform(spark_df).show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kilQL1ps4kXh" - }, - "outputs": [], - "source": [ - "from sparknlp.base import LightPipeline\n", - "\n", - "light_model = LightPipeline(pipelineModel)\n", - "light_result = light_model.annotate(\"John and Peter are brothers. However they don't support each other that much.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 15, - "status": "ok", - "timestamp": 1664907657454, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Sw9Z5q_H49M_", - "outputId": "ca407fb9-bb67-47f9-9005-4e0094a5dd14" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "dict_keys(['document', 'token', 'stem', 'lemma'])" - ] - }, - "execution_count": 126, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "light_result.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 14, - "status": "ok", - "timestamp": 1664907657455, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "3x4u11yQ5CxR", - "outputId": "069b4820-308b-459a-e8c2-d8162413d0b2" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[('John', 'john', 'John'),\n", - " ('and', 'and', 'and'),\n", - " ('Peter', 'peter', 'Peter'),\n", - " ('are', 'ar', 'be'),\n", - " ('brothers', 'brother', 'brother'),\n", - " ('.', '.', '.'),\n", - " ('However', 'howev', 'However'),\n", - " ('they', 'thei', 'they'),\n", - " (\"don't\", \"don't\", \"don't\"),\n", - " ('support', 'support', 'support'),\n", - " ('each', 'each', 'each'),\n", - " ('other', 'other', 'other'),\n", - " ('that', 'that', 'that'),\n", - " ('much', 'much', 'much'),\n", - " ('.', '.', '.')]" - ] - }, - "execution_count": 127, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "list(zip(light_result['token'], light_result['stem'], light_result['lemma']))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cOe6aYzn5NXG" - }, - "outputs": [], - "source": [ - "light_result = light_model.fullAnnotate(\"John and Peter are brothers. However they don't support each other that much.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 9, - "status": "ok", - "timestamp": 1664907658672, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "nObeWNkt55pS", - "outputId": "c82c79a2-398e-4523-e6b1-36cfdc9a1093" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'document': [Annotation(document, 0, 76, John and Peter are brothers. However they don't support each other that much., {})],\n", - " 'token': [Annotation(token, 0, 3, John, {'sentence': '0'}),\n", - " Annotation(token, 5, 7, and, {'sentence': '0'}),\n", - " Annotation(token, 9, 13, Peter, {'sentence': '0'}),\n", - " Annotation(token, 15, 17, are, {'sentence': '0'}),\n", - " Annotation(token, 19, 26, brothers, {'sentence': '0'}),\n", - " Annotation(token, 27, 27, ., {'sentence': '0'}),\n", - " Annotation(token, 29, 35, However, {'sentence': '0'}),\n", - " Annotation(token, 37, 40, they, {'sentence': '0'}),\n", - " Annotation(token, 42, 46, don't, {'sentence': '0'}),\n", - " Annotation(token, 48, 54, support, {'sentence': '0'}),\n", - " Annotation(token, 56, 59, each, {'sentence': '0'}),\n", - " Annotation(token, 61, 65, other, {'sentence': '0'}),\n", - " Annotation(token, 67, 70, that, {'sentence': '0'}),\n", - " Annotation(token, 72, 75, much, {'sentence': '0'}),\n", - " Annotation(token, 76, 76, ., {'sentence': '0'})],\n", - " 'stem': [Annotation(token, 0, 3, john, {'sentence': '0'}),\n", - " Annotation(token, 5, 7, and, {'sentence': '0'}),\n", - " Annotation(token, 9, 13, peter, {'sentence': '0'}),\n", - " Annotation(token, 15, 17, ar, {'sentence': '0'}),\n", - " Annotation(token, 19, 26, brother, {'sentence': '0'}),\n", - " Annotation(token, 27, 27, ., {'sentence': '0'}),\n", - " Annotation(token, 29, 35, howev, {'sentence': '0'}),\n", - " Annotation(token, 37, 40, thei, {'sentence': '0'}),\n", - " Annotation(token, 42, 46, don't, {'sentence': '0'}),\n", - " Annotation(token, 48, 54, support, {'sentence': '0'}),\n", - " Annotation(token, 56, 59, each, {'sentence': '0'}),\n", - " Annotation(token, 61, 65, other, {'sentence': '0'}),\n", - " Annotation(token, 67, 70, that, {'sentence': '0'}),\n", - " Annotation(token, 72, 75, much, {'sentence': '0'}),\n", - " Annotation(token, 76, 76, ., {'sentence': '0'})],\n", - " 'lemma': [Annotation(token, 0, 3, John, {'sentence': '0'}),\n", - " Annotation(token, 5, 7, and, {'sentence': '0'}),\n", - " Annotation(token, 9, 13, Peter, {'sentence': '0'}),\n", - " Annotation(token, 15, 17, be, {'sentence': '0'}),\n", - " Annotation(token, 19, 26, brother, {'sentence': '0'}),\n", - " Annotation(token, 27, 27, ., {'sentence': '0'}),\n", - " Annotation(token, 29, 35, However, {'sentence': '0'}),\n", - " Annotation(token, 37, 40, they, {'sentence': '0'}),\n", - " Annotation(token, 42, 46, don't, {'sentence': '0'}),\n", - " Annotation(token, 48, 54, support, {'sentence': '0'}),\n", - " Annotation(token, 56, 59, each, {'sentence': '0'}),\n", - " Annotation(token, 61, 65, other, {'sentence': '0'}),\n", - " Annotation(token, 67, 70, that, {'sentence': '0'}),\n", - " Annotation(token, 72, 75, much, {'sentence': '0'}),\n", - " Annotation(token, 76, 76, ., {'sentence': '0'})]}]" - ] - }, - "execution_count": 129, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "light_result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 437, - "status": "ok", - "timestamp": 1664907660277, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "uEgemHyB57LJ", - "outputId": "5b8cb4b6-84f1-4e4b-e273-ce143c17a126" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'document': ['How did serfdom develop in and then leave Russia ?'],\n", - " 'token': ['How',\n", - " 'did',\n", - " 'serfdom',\n", - " 'develop',\n", - " 'in',\n", - " 'and',\n", - " 'then',\n", - " 'leave',\n", - " 'Russia',\n", - " '?'],\n", - " 'stem': ['how',\n", - " 'did',\n", - " 'serfdom',\n", - " 'develop',\n", - " 'in',\n", - " 'and',\n", - " 'then',\n", - " 'leav',\n", - " 'russia',\n", - " '?'],\n", - " 'lemma': ['How',\n", - " 'do',\n", - " 'serfdom',\n", - " 'develop',\n", - " 'in',\n", - " 'and',\n", - " 'then',\n", - " 'leave',\n", - " 'Russia',\n", - " '?']},\n", - " {'document': ['There will be some exciting breakthroughs in NLP this year.'],\n", - " 'token': ['There',\n", - " 'will',\n", - " 'be',\n", - " 'some',\n", - " 'exciting',\n", - " 'breakthroughs',\n", - " 'in',\n", - " 'NLP',\n", - " 'this',\n", - " 'year',\n", - " '.'],\n", - " 'stem': ['there',\n", - " 'will',\n", - " 'be',\n", - " 'some',\n", - " 'excit',\n", - " 'breakthrough',\n", - " 'in',\n", - " 'nlp',\n", - " 'thi',\n", - " 'year',\n", - " '.'],\n", - " 'lemma': ['There',\n", - " 'will',\n", - " 'be',\n", - " 'some',\n", - " 'exciting',\n", - " 'breakthrough',\n", - " 'in',\n", - " 'NLP',\n", - " 'this',\n", - " 'year',\n", - " '.']}]" - ] - }, - "execution_count": 130, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text_list= [\"How did serfdom develop in and then leave Russia ?\",\n", - "\"There will be some exciting breakthroughs in NLP this year.\"]\n", - "\n", - "light_model.annotate(text_list)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-44fmATEqrL_" - }, - "source": [ - "**important note:** When you use Finisher in your pipeline, regardless of setting `cleanAnnotations` to False or True, LigtPipeline will only return the finished columns." - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "machine_shape": "hm", - "provenance": [] - }, - "gpuClass": "standard", - "kernelspec": { - "display_name": "base", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12 (main, Apr 5 2022, 06:56:58) \n[GCC 7.5.0]" - }, - "vscode": { - "interpreter": { - "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf" - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/examples/bak3.SparkNLP_Pretrained_Models.ipynb b/examples/bak3.SparkNLP_Pretrained_Models.ipynb deleted file mode 100644 index 899519bf610023..00000000000000 --- a/examples/bak3.SparkNLP_Pretrained_Models.ipynb +++ /dev/null @@ -1,9011 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "sXatvRX899i0" - }, - "source": [ - "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JhpTOc_Ox8P8" - }, - "source": [ - "\n", - "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.SparkNLP_Pretrained_Models.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uzDYtrcEe0Ig" - }, - "source": [ - "# 3. Spark NLP Pretrained Models" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RcOCaLKoIwow" - }, - "source": [ - "Spark NLP offers the following pre-trained models in 200+ languages and all you need to do is to load the pre-trained model into your disk by specifying the model name and then configuring the model parameters as per your use case and dataset. Then you will not need to worry about training a new model from scratch and will be able to enjoy the pre-trained SOTA algorithms directly applied to your own data with transform().\n", - "\n", - "In the official documentation, you can find detailed information regarding how these models are trained by using which algorithms and datasets.\n", - "\n", - "https://github.com/JohnSnowLabs/spark-nlp-models\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "uKCBqmQKUzPc" - }, - "source": [ - "![JohnSnowLabs](https://www.johnsnowlabs.com/wp-content/uploads/2021/06/spark_npl_06_2021.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lYYRQmIKIZ-1" - }, - "source": [ - "## Colab Setup" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "bG6E9KhzIatv" - }, - "outputs": [], - "source": [ - "!pip install -q pyspark==3.3.0 spark-nlp==4.3.0" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 254 - }, - "executionInfo": { - "elapsed": 19647, - "status": "ok", - "timestamp": 1664911118555, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "E1PhbeBKKYyT", - "outputId": "5f4215bc-5c77-40b9-dc42-18e4bb8a9c3f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Spark NLP version 4.3.0\n", - "Apache Spark version: 3.3.0\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "

SparkSession - in-memory

\n", - " \n", - "
\n", - "

SparkContext

\n", - "\n", - "

Spark UI

\n", - "\n", - "
\n", - "
Version
\n", - "
v3.3.0
\n", - "
Master
\n", - "
local[*]
\n", - "
AppName
\n", - "
Spark NLP
\n", - "
\n", - "
\n", - " \n", - "
\n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import sparknlp\n", - "\n", - "spark = sparknlp.start()\n", - "\n", - "from sparknlp.base import *\n", - "from sparknlp.annotator import *\n", - "\n", - "print(\"Spark NLP version\", sparknlp.version())\n", - "print(\"Apache Spark version:\", spark.version)\n", - "\n", - "spark" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "09BYgkOkIsV4" - }, - "source": [ - "## LemmatizerModel" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HbYUnHwPoFq2" - }, - "outputs": [], - "source": [ - "from pyspark.ml import Pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "1zuJoZq-Saer" - }, - "outputs": [], - "source": [ - "!wget -q -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1165, - "status": "ok", - "timestamp": 1664911615584, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ELxwZNaTSd1K", - "outputId": "8ffc8354-0452-4405-d3c9-9a957ef619c5" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------+--------------------------------------------------+\n", - "|category| text|\n", - "+--------+--------------------------------------------------+\n", - "|Business|Unions representing workers at Turner Newall ...|\n", - "|Sci/Tech| TORONTO, Canada A second team of rocketeers...|\n", - "|Sci/Tech| A company founded by a chemistry researcher at...|\n", - "|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts ...|\n", - "|Sci/Tech| Southern California's smog fighting agency wen...|\n", - "|Sci/Tech|\"The British Department for Education and Skill...|\n", - "|Sci/Tech|\"confessed author of the Netsky and Sasser viru...|\n", - "|Sci/Tech|\\\\FOAF/LOAF and bloom filters have a lot of in...|\n", - "|Sci/Tech|\"Wiltshire Police warns about \"\"phishing\"\" afte...|\n", - "|Sci/Tech|In its first two years, the UK's dedicated card...|\n", - "|Sci/Tech| A group of technology companies including Tex...|\n", - "|Sci/Tech| Apple Computer Inc.<AAPL.O> on Tuesday ...|\n", - "|Sci/Tech| Free Record Shop, a Dutch music retail chain,...|\n", - "|Sci/Tech|A giant 100km colony of ants which has been di...|\n", - "|Sci/Tech| \"Dolphin groups, or \"\"pods\"\"|\n", - "|Sci/Tech|Tyrannosaurus rex achieved its massive size due...|\n", - "|Sci/Tech| Scientists have discovered irregular lumps be...|\n", - "|Sci/Tech| ESAs Mars Express has relayed pictures from o...|\n", - "|Sci/Tech|When did life begin? One evidential clue stems ...|\n", - "|Sci/Tech|update Earnings per share rise compared with a ...|\n", - "+--------+--------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "import pyspark.sql.functions as F\n", - "\n", - "news_df = spark.read\\\n", - " .option(\"header\", \"true\")\\\n", - " .csv(\"news_category_test.csv\")\\\n", - " .withColumnRenamed(\"description\", \"text\")\n", - "\n", - "news_df.show(truncate=50)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qd-CX_39Iktn" - }, - "outputs": [], - "source": [ - "lemmatizer = LemmatizerModel.pretrained('lemma_antbnc', 'en') \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"lemma\") \\\n", - "\n", - "'''\n", - "lemmatizer = Lemmatizer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"lemma\") \\\n", - " .setDictionary(\"./AntBNC_lemmas_ver_001.txt\", value_delimiter =\"\\t\", key_delimiter = \"->\")\n", - "'''" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "avTepygiVUFC" - }, - "outputs": [], - "source": [ - "!cd ~/cache_pretrained && ls -l" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "mAYe90QU8lHh" - }, - "outputs": [], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "stemmer = Stemmer() \\\n", - " .setInputCols([\"token\"]) \\\n", - " .setOutputCol(\"stem\")\n", - "\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1483, - "status": "ok", - "timestamp": 1664910030044, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "YTvZeE0vIk2Y", - "outputId": "e5fbb045-0cae-40f3-9594-b15d98d05169" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|category| text| document| token| stem| lemma|\n", - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|Business|Unions representi...|[{document, 0, 12...|[{token, 0, 5, Un...|[{token, 0, 5, un...|[{token, 0, 5, Un...|\n", - "|Sci/Tech| TORONTO, Canada ...|[{document, 0, 22...|[{token, 1, 7, TO...|[{token, 1, 7, to...|[{token, 1, 7, TO...|\n", - "|Sci/Tech| A company founde...|[{document, 0, 20...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{token, 1, 1, A,...|\n", - "|Sci/Tech| It's barely dawn...|[{document, 0, 26...|[{token, 1, 4, It...|[{token, 1, 4, it...|[{token, 1, 4, It...|\n", - "|Sci/Tech| Southern Califor...|[{document, 0, 17...|[{token, 1, 8, So...|[{token, 1, 8, so...|[{token, 1, 8, So...|\n", - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], - "source": [ - "result = pipelineModel.transform(news_df)\n", - "\n", - "result.show(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 17, - "status": "ok", - "timestamp": 1664908494075, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "-efJWYk6NiE3", - "outputId": "be9c6900-ef54-4c47-b930-3a34f22c8a94" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "| result| result|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "|[Unions, representing, workers, at, Turner, Newall, say, they, are, ', disappointed, ', after, ta...|[Unions, represent, worker, at, Turner, Newall, say, they, be, ', disappointed, ', after, talk, w...|\n", - "|[TORONTO, ,, Canada, A, second, team, of, rocketeers, competing, for, the, #36;10, million, Ansar...|[TORONTO, ,, Canada, A, second, team, of, rocketeer, compete, for, the, #36;10, million, Ansari, ...|\n", - "|[A, company, founded, by, a, chemistry, researcher, at, the, University, of, Louisville, won, a, ...|[A, company, founded, by, a, chemistry, researcher, at, the, University, of, Louisville, win, a, ...|\n", - "|[It's, barely, dawn, when, Mike, Fitzpatrick, starts, his, shift, with, a, blur, of, colorful, ma...|[It's, barely, dawn, when, Mike, Fitzpatrick, start, he, shift, with, a, blur, of, colorful, map,...|\n", - "|[Southern, California's, smog, fighting, agency, went, after, emissions, of, the, bovine, variety...|[Southern, California's, smog, fight, agency, go, after, emission, of, the, bovine, variety, Frid...|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], - "source": [ - "result.select('token.result','lemma.result').show(5, truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x8xqKxLeUvDc" - }, - "source": [ - "## PerceptronModel (POS - Part of speech tags)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "executionInfo": { - "elapsed": 645, - "status": "ok", - "timestamp": 1664908709974, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Dyv1H1i42DaZ", - "outputId": "eba2c3d0-fbf9-44d4-c711-3f3983b63546" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
NumberTagDescription
01.0CCCoordinating conjunction
12.0CDCardinal number
23.0DTDeterminer
34.0EXExistential there
45.0FWForeign word
56.0INPreposition or subordinating conjunction
67.0JJAdjective
78.0JJRAdjective, comparative
89.0JJSAdjective, superlative
910.0LSList item marker
1011.0MDModal
1112.0NNNoun, singular or mass
1213.0NNSNoun, plural
1314.0NNPProper noun, singular
1415.0NNPSProper noun, plural
1516.0PDTPredeterminer
1617.0POSPossessive ending
1718.0PRPPersonal pronoun
1819.0PRP$Possessive pronoun
1920.0RBAdverb
2021.0RBRAdverb, comparative
2122.0RBSAdverb, superlative
2223.0RPParticle
2324.0SYMSymbol
2425.0TOto
2526.0UHInterjection
2627.0VBVerb, base form
2728.0VBDVerb, past tense
2829.0VBGVerb, gerund or present participle
2930.0VBNVerb, past participle
3031.0VBPVerb, non-3rd person singular present
3132.0VBZVerb, 3rd person singular present
3233.0WDTWh-determiner
3334.0WPWh-pronoun
3435.0WP$Possessive wh-pronoun
3536.0WRBWh-adverb
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " Number Tag Description\n", - "0 1.0 CC Coordinating conjunction\n", - "1 2.0 CD Cardinal number\n", - "2 3.0 DT Determiner\n", - "3 4.0 EX Existential there\n", - "4 5.0 FW Foreign word\n", - "5 6.0 IN Preposition or subordinating conjunction\n", - "6 7.0 JJ Adjective\n", - "7 8.0 JJR Adjective, comparative\n", - "8 9.0 JJS Adjective, superlative\n", - "9 10.0 LS List item marker\n", - "10 11.0 MD Modal\n", - "11 12.0 NN Noun, singular or mass\n", - "12 13.0 NNS Noun, plural\n", - "13 14.0 NNP Proper noun, singular\n", - "14 15.0 NNPS Proper noun, plural\n", - "15 16.0 PDT Predeterminer\n", - "16 17.0 POS Possessive ending\n", - "17 18.0 PRP Personal pronoun\n", - "18 19.0 PRP$ Possessive pronoun\n", - "19 20.0 RB Adverb\n", - "20 21.0 RBR Adverb, comparative\n", - "21 22.0 RBS Adverb, superlative\n", - "22 23.0 RP Particle\n", - "23 24.0 SYM Symbol\n", - "24 25.0 TO to\n", - "25 26.0 UH Interjection\n", - "26 27.0 VB Verb, base form\n", - "27 28.0 VBD Verb, past tense\n", - "28 29.0 VBG Verb, gerund or present participle\n", - "29 30.0 VBN Verb, past participle\n", - "30 31.0 VBP Verb, non-3rd person singular present\n", - "31 32.0 VBZ Verb, 3rd person singular present\n", - "32 33.0 WDT Wh-determiner\n", - "33 34.0 WP Wh-pronoun\n", - "34 35.0 WP$ Possessive wh-pronoun\n", - "35 36.0 WRB Wh-adverb" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "\n", - "pos_df= pd.read_html('https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html', header=0)\n", - "\n", - "pos_df[0]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3725, - "status": "ok", - "timestamp": 1664908717470, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "MwOLREB9Ik55", - "outputId": "12a8e4ea-4699-4379-f035-5dd013d18d3b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "pos_anc download started this may take some time.\n", - "Approximate size to download 3.9 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "pos = PerceptronModel.pretrained(\"pos_anc\", 'en')\\\n", - " .setInputCols(\"document\", \"token\")\\\n", - " .setOutputCol(\"pos\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 23, - "status": "ok", - "timestamp": 1664908717471, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "OZkdncsmIk74", - "outputId": "2ea420c0-c076-46e5-d7b1-67049e90f74a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "total 8\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:26 lemma_antbnc_en_2.0.2_2.4_1556480454569\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:34 pos_anc_en_3.0.0_3.0_1614962126490\n" - ] - } - ], - "source": [ - "!cd ~/cache_pretrained && ls -l" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gwiBekZ2IlHp" - }, - "outputs": [], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer,\n", - " pos])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 351, - "status": "ok", - "timestamp": 1664908717814, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "79R4-HHPNv6h", - "outputId": "02c0bdec-771f-49b2-99fd-573ec79fd171" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|category| text| document| token| stem| lemma| pos|\n", - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|Business|Unions representi...|[{document, 0, 12...|[{token, 0, 5, Un...|[{token, 0, 5, un...|[{token, 0, 5, Un...|[{pos, 0, 5, NNP,...|\n", - "|Sci/Tech| TORONTO, Canada ...|[{document, 0, 22...|[{token, 1, 7, TO...|[{token, 1, 7, to...|[{token, 1, 7, TO...|[{pos, 1, 7, NNP,...|\n", - "|Sci/Tech| A company founde...|[{document, 0, 20...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{token, 1, 1, A,...|[{pos, 1, 1, DT, ...|\n", - "|Sci/Tech| It's barely dawn...|[{document, 0, 26...|[{token, 1, 4, It...|[{token, 1, 4, it...|[{token, 1, 4, It...|[{pos, 1, 4, NNP,...|\n", - "|Sci/Tech| Southern Califor...|[{document, 0, 17...|[{token, 1, 8, So...|[{token, 1, 8, so...|[{token, 1, 8, So...|[{pos, 1, 8, NNP,...|\n", - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], - "source": [ - "result = pipelineModel.transform(news_df)\n", - "\n", - "result.show(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 272, - "status": "ok", - "timestamp": 1664908718082, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "r0y9NOb_Vhfk", - "outputId": "e30494eb-ac8f-4c95-f1ed-92c95196bb9c" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "| result| result|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "|[Unions, representing, workers, at, Turner, Newall, say, they, are, ', disappointed, ', after, ta...| [NNP, VBG, NNS, IN, NNP, NNP, VBP, PRP, VBP, POS, JJ, POS, IN, NNS, IN, NN, NN, NN, NNP, NNP, .]|\n", - "|[TORONTO, ,, Canada, A, second, team, of, rocketeers, competing, for, the, #36;10, million, Ansar...|[NNP, ,, NNP, DT, JJ, NN, IN, NNS, VBG, IN, DT, NN, CD, NNP, NNP, NNP, ,, DT, NN, IN, RB, JJ, JJ,...|\n", - "|[A, company, founded, by, a, chemistry, researcher, at, the, University, of, Louisville, won, a, ...|[DT, NN, VBN, IN, DT, NN, NN, IN, DT, NNP, IN, NNP, VBD, DT, NN, TO, VB, DT, NN, IN, VBG, JJR, NN...|\n", - "|[It's, barely, dawn, when, Mike, Fitzpatrick, starts, his, shift, with, a, blur, of, colorful, ma...|[NNP, RB, NN, WRB, NNP, NNP, VBZ, PRP$, NN, IN, DT, NN, IN, JJ, NNS, ,, NNS, CC, JJ, NNS, ,, CC, ...|\n", - "|[Southern, California's, smog, fighting, agency, went, after, emissions, of, the, bovine, variety...|[NNP, NNP, NN, VBG, NN, VBD, IN, NNS, IN, DT, NN, NN, NNP, ,, VBG, DT, NN, JJ, NNS, TO, VB, NN, N...|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], - "source": [ - "result.select('token.result','pos.result').show(5, truncate=100)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 1209, - "status": "ok", - "timestamp": 1664908719285, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ffYOgQIOVhnX", - "outputId": "725a804b-0e5a-429e-d327-feb80fb48d68" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tokenbeginendstemlemmapos
0Unions05unionUnionsNNP
1representing718represrepresentVBG
2workers2026workerworkerNNS
3at2829atatIN
4Turner3136turnerTurnerNNP
5Newall4045newalNewallNNP
6say4749saisayVBP
7they5154theitheyPRP
8are5658arbeVBP
9'6060''POS
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " token begin end stem lemma pos\n", - "0 Unions 0 5 union Unions NNP\n", - "1 representing 7 18 repres represent VBG\n", - "2 workers 20 26 worker worker NNS\n", - "3 at 28 29 at at IN\n", - "4 Turner 31 36 turner Turner NNP\n", - "5 Newall 40 45 newal Newall NNP\n", - "6 say 47 49 sai say VBP\n", - "7 they 51 54 thei they PRP\n", - "8 are 56 58 ar be VBP\n", - "9 ' 60 60 ' ' POS" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# applying this pipeline to top 100 rows and then converting to Pandas\n", - "\n", - "result = pipelineModel.transform(news_df.limit(100))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.token.begin, \n", - " result.token.end, \n", - " result.stem.result, \n", - " result.lemma.result, \n", - " result.pos.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"begin\"),\n", - " F.expr(\"cols['2']\").alias(\"end\"),\n", - " F.expr(\"cols['3']\").alias(\"stem\"),\n", - " F.expr(\"cols['4']\").alias(\"lemma\"),\n", - " F.expr(\"cols['5']\").alias(\"pos\")).toPandas()\n", - "\n", - "result_df.head(10) " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 14, - "status": "ok", - "timestamp": 1664908719285, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "g7t4I5s7bDjI", - "outputId": "0598a731-322d-40f4-88ed-12966d506d4b" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[('Unions', 'union', 'Unions', 'NNP'),\n", - " ('representing', 'repres', 'represent', 'VBG'),\n", - " ('workers', 'worker', 'worker', 'NNS'),\n", - " ('at', 'at', 'at', 'IN'),\n", - " ('Turner', 'turner', 'Turner', 'NNP'),\n", - " ('Newall', 'newal', 'Newall', 'NNP'),\n", - " ('say', 'sai', 'say', 'VBP'),\n", - " ('they', 'thei', 'they', 'PRP'),\n", - " ('are', 'ar', 'be', 'VBP'),\n", - " ('disappointed', 'disappoint', 'disappointed', 'VBN'),\n", - " ('after', 'after', 'after', 'IN'),\n", - " ('talks', 'talk', 'talk', 'NNS'),\n", - " ('with', 'with', 'with', 'IN'),\n", - " ('stricken', 'stricken', 'stricken', 'NN'),\n", - " ('parent', 'parent', 'parent', 'NN'),\n", - " ('firm', 'firm', 'firm', 'NN'),\n", - " ('Federal', 'feder', 'Federal', 'NNP'),\n", - " ('Mogul', 'mogul', 'Mogul', 'NNP'),\n", - " ('.', '.', '.', '.')]" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# same in LightPipeline\n", - "\n", - "light_model = LightPipeline(pipelineModel)\n", - "\n", - "light_result = light_model.annotate('Unions representing workers at Turner Newall say they are disappointed after talks with stricken parent firm Federal Mogul.')\n", - "\n", - "list(zip(light_result['token'], light_result['stem'], light_result['lemma'], light_result['pos']))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 645 - }, - "executionInfo": { - "elapsed": 12, - "status": "ok", - "timestamp": 1664908720075, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "QjRWpWOov7aF", - "outputId": "d57cf051-8750-44fa-ae80-cbeb39b80f28" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tokenstemlemmapos
0UnionsunionUnionsNNP
1representingrepresrepresentVBG
2workersworkerworkerNNS
3atatatIN
4TurnerturnerTurnerNNP
5NewallnewalNewallNNP
6saysaisayVBP
7theytheitheyPRP
8arearbeVBP
9disappointeddisappointdisappointedVBN
10afterafterafterIN
11talkstalktalkNNS
12withwithwithIN
13strickenstrickenstrickenNN
14parentparentparentNN
15firmfirmfirmNN
16FederalfederFederalNNP
17MogulmogulMogulNNP
18....
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " token stem lemma pos\n", - "0 Unions union Unions NNP\n", - "1 representing repres represent VBG\n", - "2 workers worker worker NNS\n", - "3 at at at IN\n", - "4 Turner turner Turner NNP\n", - "5 Newall newal Newall NNP\n", - "6 say sai say VBP\n", - "7 they thei they PRP\n", - "8 are ar be VBP\n", - "9 disappointed disappoint disappointed VBN\n", - "10 after after after IN\n", - "11 talks talk talk NNS\n", - "12 with with with IN\n", - "13 stricken stricken stricken NN\n", - "14 parent parent parent NN\n", - "15 firm firm firm NN\n", - "16 Federal feder Federal NNP\n", - "17 Mogul mogul Mogul NNP\n", - "18 . . . ." - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.DataFrame(list(zip(light_result['token'], light_result['stem'], light_result['lemma'], light_result['pos'])), columns = ['token', 'stem', 'lemma', 'pos'])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JFc_5BlI2znE" - }, - "source": [ - "## **Chunker**\n", - "\n", - "Meaningful phrase matching. This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document\n", - "\n", - "> **Output type**: Chunk\n", - "\n", - "> **Input types**: Document, POS\n", - "\n", - "Functions:\n", - "\n", - "🔍`setRegexParsers(patterns)`: A list of regex patterns to match chunks, for example: Array(“‹DT›?‹JJ›*‹NN›\n", - "\n", - "🔍`addRegexParser(patterns)`: adds a pattern to the current list of chunk patterns, for example: “‹DT›?‹JJ›*‹NN›”" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 283, - "status": "ok", - "timestamp": 1664908725750, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "uvqfIP6XLm3p", - "outputId": "8ceae7a1-c450-4bfe-b8ea-13b25f13c8a7" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='Chunker_8572c053a681', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='Chunker_8572c053a681', name='inputCols', doc='previous annotations columns, if renamed'): ['document',\n", - " 'pos'],\n", - " Param(parent='Chunker_8572c053a681', name='outputCol', doc='output annotation column. can be left default.'): 'chunk',\n", - " Param(parent='Chunker_8572c053a681', name='regexParsers', doc='an array of grammar based chunk parsers'): ['+',\n", - " '
?*']}" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# applying POS chunker to find a custom pattern\n", - "\n", - "chunker = Chunker()\\\n", - " .setInputCols([\"document\", \"pos\"])\\\n", - " .setOutputCol(\"chunk\")\\\n", - " .setRegexParsers([\"+\", \"
?*\"])\n", - "\n", - "# NNP: Proper Noun\n", - "# NN: COmmon Noun\n", - "# DT: Determinator (e.g. the)\n", - "# JJ: Adjective\n", - "\n", - "chunker.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 602, - "status": "ok", - "timestamp": 1664908747650, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "KsbWh5shIlB8", - "outputId": "69338733-53e8-4215-9fdd-123ac9ee39ac" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|category| text| document| token| stem| lemma| pos| chunk|\n", - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "|Business|Unions representi...|[{document, 0, 12...|[{token, 0, 5, Un...|[{token, 0, 5, un...|[{token, 0, 5, Un...|[{pos, 0, 5, NNP,...|[{chunk, 0, 5, Un...|\n", - "|Sci/Tech| TORONTO, Canada ...|[{document, 0, 22...|[{token, 1, 7, TO...|[{token, 1, 7, to...|[{token, 1, 7, TO...|[{pos, 1, 7, NNP,...|[{chunk, 1, 7, TO...|\n", - "|Sci/Tech| A company founde...|[{document, 0, 20...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{token, 1, 1, A,...|[{pos, 1, 1, DT, ...|[{chunk, 52, 61, ...|\n", - "|Sci/Tech| It's barely dawn...|[{document, 0, 26...|[{token, 1, 4, It...|[{token, 1, 4, it...|[{token, 1, 4, It...|[{pos, 1, 4, NNP,...|[{chunk, 1, 4, It...|\n", - "|Sci/Tech| Southern Califor...|[{document, 0, 17...|[{token, 1, 8, So...|[{token, 1, 8, so...|[{token, 1, 8, So...|[{pos, 1, 8, NNP,...|[{chunk, 1, 21, S...|\n", - "+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+\n", - "only showing top 5 rows\n", - "\n" - ] - } - ], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer,\n", - " pos,\n", - " chunker])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(100))\n", - "result.show(5)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 1078, - "status": "ok", - "timestamp": 1664908751905, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "f9wEshybZPN2", - "outputId": "c88fcd59-336c-49de-eb62-32d60201c964" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
chunkbeginend
0Unions05
1Turner Newall3145
2Federal Mogul113125
3stricken9299
4parent101106
5firm108111
6TORONTO17
7Canada1015
8Ansari X Prize8295
9A second team2032
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " chunk begin end\n", - "0 Unions 0 5\n", - "1 Turner Newall 31 45\n", - "2 Federal Mogul 113 125\n", - "3 stricken 92 99\n", - "4 parent 101 106\n", - "5 firm 108 111\n", - "6 TORONTO 1 7\n", - "7 Canada 10 15\n", - "8 Ansari X Prize 82 95\n", - "9 A second team 20 32" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.chunk.result, \n", - " result.chunk.begin, \n", - " result.chunk.end)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", - " F.expr(\"cols['1']\").alias(\"begin\"),\n", - " F.expr(\"cols['2']\").alias(\"end\")).toPandas()\n", - "\n", - "result_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gm1aI--pgHId" - }, - "source": [ - "## **Dependency Parser**\n", - "\n", - "The practice of analyzing the relationships between sentences in a phrase to ascertain its grammatical structure is known as dependency parsing (DP). Based on this, sentences are frequently broken up into multiple sections. The method is predicated on the idea that each language component in a phrase has a direct link with the others. Dependencies are the names given to these relationships.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 28206, - "status": "ok", - "timestamp": 1664908781504, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "idjZppOHgOSj", - "outputId": "a56eaac7-cf48-4ee1-f00b-cd48ac7aef92" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "dependency_conllu download started this may take some time.\n", - "Approximate size to download 16.7 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "dep_parser = DependencyParserModel.pretrained('dependency_conllu')\\\n", - " .setInputCols([\"document\", \"pos\", \"token\"])\\\n", - " .setOutputCol(\"dependency\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5621, - "status": "ok", - "timestamp": 1664908787110, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ibIvWMsjhqh_", - "outputId": "f4f9bca4-2e69-4b4d-8e16-681a718a9633" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "dependency_typed_conllu download started this may take some time.\n", - "Approximate size to download 2.4 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "typed_dep_parser = TypedDependencyParserModel.pretrained('dependency_typed_conllu')\\\n", - " .setInputCols([\"token\", \"pos\", \"dependency\"])\\\n", - " .setOutputCol(\"dependency_type\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "szCwgxYLhG6c" - }, - "outputs": [], - "source": [ - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer,\n", - " pos,\n", - " dep_parser,\n", - " typed_dep_parser\n", - " ])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(100))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 4286, - "status": "ok", - "timestamp": 1664908824032, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "nmva7uDshOEN", - "outputId": "c40776cb-d3e8-4d2f-e256-1a02f4165ed9" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
chunkbeginenddependencydependency_type
0Unions05ROOTroot
1representing718workersamod
2workers2026Unionsflat
3at2829Turnercase
4Turner3136workersflat
5Newall4045saynsubj
6say4749Unionsparataxis
7they5154disappointednsubj
8are5658disappointednsubj
9'6060disappointedcase
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " chunk begin end dependency dependency_type\n", - "0 Unions 0 5 ROOT root\n", - "1 representing 7 18 workers amod\n", - "2 workers 20 26 Unions flat\n", - "3 at 28 29 Turner case\n", - "4 Turner 31 36 workers flat\n", - "5 Newall 40 45 say nsubj\n", - "6 say 47 49 Unions parataxis\n", - "7 they 51 54 disappointed nsubj\n", - "8 are 56 58 disappointed nsubj\n", - "9 ' 60 60 disappointed case" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.token.begin, \n", - " result.token.end, \n", - " result.dependency.result, \n", - " result.dependency_type.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", - " F.expr(\"cols['1']\").alias(\"begin\"),\n", - " F.expr(\"cols['2']\").alias(\"end\"),\n", - " F.expr(\"cols['3']\").alias(\"dependency\"),\n", - " F.expr(\"cols['4']\").alias(\"dependency_type\")).toPandas()\n", - "\n", - "result_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8sxmntVciusy" - }, - "source": [ - "![image.png]()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kbqjOV0iGCqy" - }, - "source": [ - "## StopWordsCleaner\n", - "\n", - "`stopwords_fr`, `stopwords_de`, `stopwords_en`, `stopwords_it`, `stopwords_af` .... over 40 languages " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3080, - "status": "ok", - "timestamp": 1664908832107, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "jEcrH4f4GD8g", - "outputId": "42a58115-0d25-49c3-a9ed-99425fbdf26d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "stopwords_en download started this may take some time.\n", - "Approximate size to download 2.9 KB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "stopwords_cleaner = StopWordsCleaner.pretrained('stopwords_en','en')\\\n", - " .setInputCols(\"token\")\\\n", - " .setOutputCol(\"cleanTokens\")\\\n", - " .setCaseSensitive(False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 2957, - "status": "ok", - "timestamp": 1664908835345, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "fDE2i6J9JUCv", - "outputId": "a9ff1126-f1f8-41d6-8fca-83b4e1ad528a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "stopwords_es download started this may take some time.\n", - "Approximate size to download 2.2 KB\n", - "[OK!]\n" - ] - }, - { - "data": { - "text/plain": [ - "['a',\n", - " 'acuerdo',\n", - " 'adelante',\n", - " 'ademas',\n", - " 'además',\n", - " 'adrede',\n", - " 'ahi',\n", - " 'ahí',\n", - " 'ahora',\n", - " 'al']" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# we can also get the list of stopwords \n", - "\n", - "stopwords_cleaner_es = StopWordsCleaner.pretrained('stopwords_es','es')\\\n", - " .setInputCols(\"token\")\\\n", - " .setOutputCol(\"cleanTokens\")\\\n", - " .setCaseSensitive(False)\n", - "\n", - "stopwords_cleaner_es.getStopWords()[:10]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 740, - "status": "ok", - "timestamp": 1664908852130, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "7WarSNQYG6Q0", - "outputId": "25fbaa01-1e9e-4558-a933-d9cebdb54a10" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['Peter Parker nice person friend mine.']" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "token_assembler = TokenAssembler() \\\n", - " .setInputCols([\"document\", \"cleanTokens\"]) \\\n", - " .setOutputCol(\"clean_text\")\n", - "\n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " stopwords_cleaner,\n", - " token_assembler\n", - " ])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)\n", - "\n", - "# same in LightPipeline\n", - "\n", - "light_model = LightPipeline(pipelineModel)\n", - "\n", - "light_result = light_model.annotate('Peter Parker is a nice person and a friend of mine.')\n", - "\n", - "light_result['clean_text']" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9ouF8KqNb7tt" - }, - "source": [ - "## **SpellChecker**\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Trnkx6UYwEve" - }, - "source": [ - "### Norvig Spell Checker\n", - "\n", - "This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 8177, - "status": "ok", - "timestamp": 1664908863208, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "A8RrJeKCarp9", - "outputId": "90d1cb97-085b-48b2-9691-d4b1ab0fd474" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "spellcheck_norvig download started this may take some time.\n", - "Approximate size to download 4.2 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "spell_checker_norvig = NorvigSweetingModel.pretrained('spellcheck_norvig')\\\n", - " .setInputCols(\"token\")\\\n", - " .setOutputCol(\"corrected\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 763, - "status": "ok", - "timestamp": 1664908863965, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "hnCTwbIgMVzK", - "outputId": "dd40f4ad-d025-4370-d051-b3479daab7bb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------------------------------------------+\n", - "|text |\n", - "+--------------------------------------------------------+\n", - "|Peter Parker is a nice persn and lives in New York. |\n", - "|Bruce Wayne is also a nice guy and lives in Gotham City.|\n", - "+--------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql.types import StringType\n", - "\n", - "text_list = ['Peter Parker is a nice persn and lives in New York.', \n", - " 'Bruce Wayne is also a nice guy and lives in Gotham City.']\n", - "\n", - "spark_df = spark.createDataFrame(text_list, StringType()).toDF(\"text\")\n", - "\n", - "spark_df.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HxIK6oxHccuw" - }, - "outputs": [], - "source": [ - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " stemmer,\n", - " lemmatizer,\n", - " pos,\n", - " spell_checker_norvig\n", - " ])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tSQkHKrNdJEx" - }, - "outputs": [], - "source": [ - "result = pipelineModel.transform(spark_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 363 - }, - "executionInfo": { - "elapsed": 1085, - "status": "ok", - "timestamp": 1664908885485, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ZRL3ukFneUtM", - "outputId": "5ab2609f-3022-4305-e69b-49271df81995" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tokencorrectedstemlemmapos
0PeterPeterpeterPeterNNP
1ParkerParkerparkerParkerNNP
2isisibeVBZ
3aaaaDT
4niceniceniceniceJJ
5persnpersonpersnpersnNN
6andandandandCC
7livesliveslivelifeNNS
8ininininIN
9NewNewnewNewNNP
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " token corrected stem lemma pos\n", - "0 Peter Peter peter Peter NNP\n", - "1 Parker Parker parker Parker NNP\n", - "2 is is i be VBZ\n", - "3 a a a a DT\n", - "4 nice nice nice nice JJ\n", - "5 persn person persn persn NN\n", - "6 and and and and CC\n", - "7 lives lives live life NNS\n", - "8 in in in in IN\n", - "9 New New new New NNP" - ] - }, - "execution_count": 49, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from pyspark.sql import functions as F\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.corrected.result, \n", - " result.stem.result, \n", - " result.lemma.result, \n", - " result.pos.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"corrected\"),\n", - " F.expr(\"cols['2']\").alias(\"stem\"),\n", - " F.expr(\"cols['3']\").alias(\"lemma\"),\n", - " F.expr(\"cols['4']\").alias(\"pos\")).toPandas()\n", - "\n", - "result_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 300, - "status": "ok", - "timestamp": 1664908886968, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "3NMCMcDEe2UV", - "outputId": "24919aee-0989-4f2a-df09-31499a43e2ff" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[('The', 'The'),\n", - " ('patint', 'patient'),\n", - " ('has', 'has'),\n", - " ('pain', 'pain'),\n", - " ('and', 'and'),\n", - " ('headace', 'headache')]" - ] - }, - "execution_count": 50, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# same in LightPipeline\n", - "\n", - "light_model = LightPipeline(pipelineModel)\n", - "\n", - "light_result = light_model.annotate('The patint has pain and headace')\n", - "\n", - "list(zip(light_result['token'], light_result['corrected']))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gur1-TQWwMi8" - }, - "source": [ - "### Context SpellChecker" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "paSJd3uqSky6" - }, - "source": [ - "The idea for this annotator is to have a flexible, configurable and \"re-usable by parts\" model.\n", - "\n", - "Flexibility is the ability to accommodate different use cases for spell checking like OCR text, keyboard-input text, ASR text, and general spelling problems due to orthographic errors.\n", - "\n", - "We say this is a configurable annotator, as you can adapt it yourself to different use cases avoiding re-training as much as possible.\n", - "\n", - "Spell Checking at three levels: The final ranking of a correction sequence is affected by three things,\n", - "\n", - "Different correction candidates for each word - **word level**.\n", - "\n", - "The surrounding text of each word, i.e. it's context - **sentence level**.\n", - "\n", - "The relative cost of different correction candidates according to the edit operations at the character level it requires - **subword level**." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 35902, - "status": "ok", - "timestamp": 1664908935636, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "hK9VWFp3SaW9", - "outputId": "b96ad15e-4d35-4b19-e832-6dc62b1bb346" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "spellcheck_dl download started this may take some time.\n", - "Approximate size to download 95.1 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "spellModel = ContextSpellCheckerModel.pretrained('spellcheck_dl')\\\n", - " .setInputCols(\"token\")\\\n", - " .setOutputCol(\"checked\")\n", - "\n", - "finisher = Finisher()\\\n", - " .setInputCols(\"checked\")\n", - "\n", - "pipeline = Pipeline(\n", - " stages = [\n", - " documentAssembler,\n", - " tokenizer,\n", - " spellModel,\n", - " finisher\n", - " ])\n", - "\n", - "empty_ds = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", - "\n", - "sc_model = pipeline.fit(empty_ds)\n", - "lp = LightPipeline(sc_model)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1137, - "status": "ok", - "timestamp": 1664908936768, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "OixU8ShyUo0i", - "outputId": "88aa746d-a86a-41c3-aa2d-491c33afd913" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{'checked': ['Please',\n", - " 'allow',\n", - " 'me',\n", - " 'to',\n", - " 'introduce',\n", - " 'myself',\n", - " ',',\n", - " 'I',\n", - " 'am',\n", - " 'a',\n", - " 'man',\n", - " 'of',\n", - " 'wealth',\n", - " 'and',\n", - " 'taste']}" - ] - }, - "execution_count": 52, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "lp.annotate(\"Plaese alliow me tao introdduce myhelf, I am a man of waelth and tiaste\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1150, - "status": "ok", - "timestamp": 1664908937914, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "LAT3Q8FbWpeU", - "outputId": "256d355f-f2d6-4bd6-f468-03a96272f09c" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------+-------------------------------------------------------------------+\n", - "|text |finished_checked |\n", - "+----------------------------------------------------+-------------------------------------------------------------------+\n", - "|We will go to swimming if the ueather is sunny. |[We, will, go, to, swimming, if, the, weather, is, sunny, .] |\n", - "|I have a black ueather jacket, so nice. |[I, have, a, black, leather, jacket, ,, so, nice, .] |\n", - "|I introduce you to my sister, she is called ueather.|[I, introduce, you, to, my, sister, ,, she, is, called, Heather, .]|\n", - "+----------------------------------------------------+-------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql.types import StringType\n", - "\n", - "examples = ['We will go to swimming if the ueather is sunny.',\\\n", - " \"I have a black ueather jacket, so nice.\",\\\n", - " \"I introduce you to my sister, she is called ueather.\"]\n", - "\n", - "spark_df = spark.createDataFrame(examples, StringType()).toDF(\"text\")\n", - "\n", - "results = sc_model.transform(spark_df)\n", - "results.show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gyNDdgVqvVtV" - }, - "source": [ - "## **Language Detector**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lJSy_xiufD42" - }, - "source": [ - "Language detection and identification is the task of automatically detecting the language(s) present in a document based on the content of the document. `LanguageDetectorDL` is an annotator that detects the language of documents or sentences depending on the inputCols. In addition, LanguageDetetorDL can accurately detect language from documents with mixed languages by coalescing sentences and select the best candidate." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 16, - "status": "ok", - "timestamp": 1664908937915, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "43hvF4-OL-Lj", - "outputId": "74f25fb3-e810-460a-ec24-bc7e7b421fab" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='LanguageDetectorDL_bf1972e695d7', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='LanguageDetectorDL_bf1972e695d7', name='threshold', doc='The minimum threshold for the final result otheriwse it will be either neutral or the value set in thresholdLabel.'): 0.5,\n", - " Param(parent='LanguageDetectorDL_bf1972e695d7', name='thresholdLabel', doc='In case the score is less than threshold, what should be the label. Default is neutral.'): 'Unknown',\n", - " Param(parent='LanguageDetectorDL_bf1972e695d7', name='coalesceSentences', doc='If sets to true the output of all sentences will be averaged to one output instead of one output per sentence. Default to false.'): True}" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "LanguageDetectorDL().extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3103, - "status": "ok", - "timestamp": 1664908941006, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "LnacEgnaY7nR", - "outputId": "4bd87fbc-3a54-4274-b03f-86fea8122cb5" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ld_wiki_tatoeba_cnn_375 download started this may take some time.\n", - "Approximate size to download 8.8 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "languageDetector = LanguageDetectorDL.pretrained(\"ld_wiki_tatoeba_cnn_375\", \"xx\")\\\n", - " .setInputCols(\"document\")\\\n", - " .setOutputCol(\"language\")\\\n", - " .setThreshold(0.5)\\\n", - " .setCoalesceSentences(True)\n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " languageDetector\n", - " ])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yWPp7EeyaeEI" - }, - "outputs": [], - "source": [ - "light_model = LightPipeline(pipelineModel)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "UjmVCWMdZNB_" - }, - "outputs": [], - "source": [ - "text_en = \"William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist. He is best known as the co-founder of Microsoft Corporation. During his career at Microsoft, Gates held the positions of chairman, chief executive officer (CEO), president and chief software architect, while also being the largest individual shareholder until May 2014.\"\n", - "\n", - "text_de = 'Als Sebastian Thrun 2007 bei Google anfing, an selbstfahrenden Autos zu arbeiten, nahmen ihn nur wenige Leute außerhalb des Unternehmens ernst.'\n", - "\n", - "text_es = \"La historia del procesamiento del lenguaje natural generalmente comenzó en la década de 1950, aunque se puede encontrar trabajo de períodos anteriores. En 1950, Alan Turing publicó un artículo titulado 'Maquinaria de computación e inteligencia' que proponía lo que ahora se llama la prueba de Turing como criterio de inteligencia\"\n", - "\n", - "text_it = \"Geoffrey Everest Hinton è uno psicologo cognitivo e uno scienziato informatico canadese inglese, noto soprattutto per il suo lavoro sulle reti neurali artificiali. Dal 2013 divide il suo tempo lavorando per Google e l'Università di Toronto. Nel 2017 è stato cofondatore ed è diventato Chief Scientific Advisor del Vector Institute di Toronto.\"\n", - "\n", - "text_tr = 'Doğal Dil İşleme (NLP), bilgisayar biliminin, insanlar doğal olarak konuşup yazdıkça insan dilini anlamasını sağlayan bilgisayar biliminin alt alanıdır. '" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 390, - "status": "ok", - "timestamp": 1664908941386, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "X0hRXP8maUwF", - "outputId": "932445d2-5fa1-4253-9ee5-2f0d9bfafe3f" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['tr']" - ] - }, - "execution_count": 58, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "light_model.annotate(text_tr)['language']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 18, - "status": "ok", - "timestamp": 1664908941388, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "gf0KUzqLBTN9", - "outputId": "40202d16-a75a-465c-de3f-0b3bc1bca673" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['es']" - ] - }, - "execution_count": 59, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "light_model.annotate(text_es)['language']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 376, - "status": "ok", - "timestamp": 1664908941756, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "LC6k4k3yajCe", - "outputId": "fc11ba6c-51be-4244-cb87-c5461b3e5f31" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Annotation(language, 0, 328, es, {'avk': '3.2227776E-25', 'toki': '0.0', 'dng': '0.0', 'hy': '0.0', 'bua': '0.0', 'pcd': '1.9255921E-31', 'se': '3.0613954E-33', 'nlv': '0.0', 'ku': '1.6084451E-30', 'gcf': '0.0', 'xmf': '0.0', 'rue': '0.0', 'lou': '0.0', 'crh': '0.0', 'lkt': '0.0', 'oar': '0.0', 'aoz': '0.0', 'ss': '0.0', 'st': '0.0', 'ota': '0.0', 'bs': '7.278714E-36', 'cho': '0.0', 'stq': '0.0', 'kaa': '0.0', 'ba': '0.0', 'ngu': '0.0', 'pfl': '0.0', 'lb': '6.1421093E-24', 'hr': '3.8482914E-28', 'ta': '0.0', 'ka': '0.0', 'ar': '0.0', 'lzz': '0.0', 'swh': '0.0', 'hbo': '0.0', 'pi': '0.0', 'nov': '9.6963065E-25', 'yue': '0.0', 'ty': '0.0', 'fr': '2.2967522E-22', 'lfn': '5.509995E-9', 'is': '1.7047716E-29', 'urh': '0.0', 'mgm': '0.0', 'nah': '0.0', 'ug': '0.0', 'otk': '0.0', 'lv': '0.0', 'tmw': '0.0', 'eu': '1.0338304E-22', 'mdf': '0.0', 'got': '0.0', 'kl': '0.0', 'rn': '0.0', 'emx': '0.0', 'vep': '0.0', 'am': '0.0', 'hif': '0.0', 'mt': '4.717723E-21', 'krc': '0.0', 'bn': '0.0', 'rw': '0.0', 'gsw': '0.0', 'apc': '0.0', 'uz': '0.0', 'csb': '0.0', 'ckt': '0.0', 'aii': '0.0', 'bho': '0.0', 'uk': '0.0', 'chg': '0.0', 'co': '0.0', 'fj': '0.0', 'zlm': '0.0', 'toi': '0.0', 'si': '0.0', 'dsb': '0.0', 'lld': '0.0', 'ky': '0.0', 'enm': '0.0', 'ksh': '0.0', 'bvy': '0.0', 'pa': '0.0', 'ga': '7.1540605E-31', 'gan': '0.0', 'ceb': '0.0', 'br': '8.016619E-19', 'lmo': '0.0', 'rap': '0.0', 'bal': '0.0', 'ady': '0.0', 'nys': '0.0', 'tt': '0.0', 'war': '5.471905E-35', 'so': '0.0', 'tts': '0.0', 'mwl': '0.0', 'pt': '6.489546E-14', 'tpi': '0.0', 'cs': '1.0178266E-20', 'phn': '0.0', 'zu': '0.0', 'lo': '0.0', 'gl': '1.5889118E-8', 'gn': '0.0', 'sux': '0.0', 'ban': '0.0', 'ny': '0.0', 'nds': '2.8447118E-31', 'cjy': '0.0', 'fuc': '0.0', 'sr': '1.6595062E-26', 'nog': '0.0', 'ts': '0.0', 'chn': '0.0', 'el': '0.0', 'tpw': '0.0', 'it': '2.1318946E-13', 'sc': '0.0', 'su': '0.0', 'ber': '1.966357E-26', 'ca': '1.6977338E-10', 'rel': '0.0', 'os': '0.0', 'hnj': '0.0', 'vi': '0.0', 'mnc': '0.0', 'aln': '0.0', 'la': '2.1372422E-15', 'nch': '0.0', 'ltg': '0.0', 'ab': '0.0', 'tg': '0.0', 'mg': '0.0', 'as': '0.0', 'kam': '0.0', 'yo': '0.0', 'tzl': '2.683327E-26', 'min': '0.0', 'shs': '0.0', 'dv': '0.0', 'pdc': '0.0', 'cay': '0.0', 'tl': '1.04098085E-23', 'nl': '1.8647921E-28', 'ike': '0.0', 'jpa': '0.0', 'bg': '0.0', 'gv': '0.0', 'bi': '0.0', 'swg': '0.0', 'hil': '0.0', 'ko': '0.0', 'rm': '0.0', 'or': '0.0', 'eo': '1.4023826E-13', 'tk': '3.6414057E-35', 'cyo': '0.0', 'rom': '4.664236E-36', 'mk': '0.0', 'dtp': '0.0', 'gil': '0.0', 'fkv': '0.0', 'oc': '1.7036886E-11', 'haw': '0.0', 'egl': '0.0', 'umb': '0.0', 'et': '3.80267E-30', 'pau': '0.0', 'af': '5.8317244E-36', 'gag': '0.0', 'laa': '0.0', 'de': '1.2161509E-26', 'bm': '0.0', 'xh': '0.0', 'dws': '0.0', 'ps': '0.0', 'scn': '0.0', 'ch': '2.7466975E-34', 'yi': '0.0', 'qya': '0.0', 'ha': '0.0', 'cy': '3.6395086E-32', 'myv': '0.0', 'nb': '2.7764615E-31', 'sn': '0.0', 'to': '0.0', 'rif': '0.0', 'bjn': '0.0', 'fro': '0.0', 'pap': '0.0', 'ig': '0.0', 'frr': '0.0', 'kxi': '0.0', 'bzt': '0.0', 'zza': '0.0', 'arq': '0.0', 'cv': '0.0', 'ur': '0.0', 'mfe': '0.0', 'oj': '0.0', 'pam': '0.0', 'quc': '0.0', 'fy': '0.0', 'ln': '0.0', 'jv': '2.8774632E-32', 'jbo': '1.3549087E-27', 'afh': '0.0', 'jdt': '0.0', 'ru': '1.6341115E-37', 'ht': '7.9929254E-26', 'vro': '0.0', 'kw': '3.1891775E-28', 'ml': '0.0', 'th': '0.0', 'tly': '0.0', 'drt': '0.0', 'id': '1.0684691E-35', 'ce': '0.0', 'pnb': '0.0', 'sq': '1.9903843E-35', 'kha': '0.0', 'ia': '3.723646E-8', 'arz': '0.0', 'lzh': '0.0', 'max': '0.0', 'pag': '0.0', 'sv': '1.5181038E-26', 'ppl': '0.0', 'udm': '0.0', 'tr': '1.3081725E-27', 'ain': '0.0', 'da': '1.2120395E-24', 'my': '0.0', 'gbm': '0.0', 'liv': '0.0', 'kzj': '0.0', 'zsm': '0.0', 'sg': '0.0', 'chr': '0.0', 'kek': '0.0', 'wo': '0.0', 'awa': '0.0', 'lg': '0.0', 'mh': '0.0', 'xal': '0.0', 'sm': '0.0', 'en': '1.6624443E-28', 'gu': '0.0', 'tn': '0.0', 'he': '0.0', 'sah': '0.0', 'tet': '0.0', 'new': '0.0', 'ilo': '3.3793208E-34', 'kn': '0.0', 'gd': '0.0', 'syc': '0.0', 'sk': '5.776495E-17', 'orv': '0.0', 'fur': '0.0', 'ary': '0.0', 'mrj': '0.0', 'krl': '0.0', 'bar': '0.0', 'lvs': '1.3227742E-36', 'na': '0.0', 'tig': '0.0', 'kpv': '0.0', 'lad': '4.582317E-6', 'sma': '0.0', 'az': '0.0', 'mic': '0.0', 'iba': '0.0', 'wa': '8.01678E-32', 'hoc': '0.0', 'lij': '2.766855E-28', 'es': '0.9999949', 'mvv': '0.0', 'fo': '1.9281377E-36', 'hsn': '0.0', 'prg': '0.0', 'mai': '0.0', 'hi': '0.0', 'vo': '2.4248794E-32', 'gom': '0.0', 'bcl': '0.0', 'te': '0.0', 'mr': '0.0', 'tlh': '1.7962606E-38', 'ie': '2.6994945E-13', 'ext': '5.785406E-24', 'tkl': '0.0', 'an': '1.859455E-8', 'sco': '0.0', 'nn': '3.8007597E-28', 'kjh': '0.0', 'io': '2.8876567E-17', 'sw': '9.545265E-37', 'mww': '0.0', 'be': '0.0', 'qu': '1.381481E-9', 'sgs': '0.0', 'kum': '0.0', 'mnw': '0.0', 'cbk': '6.1433343E-7', 'sd': '0.0', 'osp': '5.093505E-38', 'ang': '0.0', 'izh': '0.0', 'mi': '0.0', 'kab': '1.1073482E-25', 'hsb': '0.0', 'ja': '0.0', 'cpi': '0.0', 'sa': '0.0', 'moh': '0.0', 'acm': '0.0', 'ast': '1.2008109E-8', 'jam': '0.0', 'sjn': '0.0', 'grc': '0.0', 'fi': '4.1400016E-35', 'bo': '0.0', 'cycl': '0.0', 'tyv': '0.0', 'ro': '3.3371775E-17', 'evn': '0.0', 'tvl': '0.0', 'ngt': '0.0', 'cmo': '0.0', 'brx': '0.0', 'frm': '0.0', 'ryu': '0.0', 'afb': '0.0', 'ne': '0.0', 'ee': '0.0', 'gaa': '0.0', 'fuv': '0.0', 'mhr': '0.0', 'lt': '7.4077635E-21', 'no': '0.0', 'wuu': '0.0', 'npi': '0.0', 'nst': '0.0', 'tmr': '0.0', 'vec': '0.0', 'koi': '0.0', 'km': '0.0', 'gos': '0.0', 'kk': '0.0', 'sl': '1.5304573E-22', 'pms': '2.1088961E-23', 'ay': '0.0', 'thv': '0.0', 'ti': '0.0', 'ii': '0.0', 'hak': '0.0', 'non': '0.0', 'fa': '0.0', 'mn': '0.0', 'zh': '0.0', 'osx': '0.0', 'shy': '0.0', 'ms': '0.0', 'ldn': '0.0', 'sentence': '0', 'hu': '9.299107E-22', 'nv': '0.0', 'akl': '0.0', 'pl': '1.00265134E-19', 'mad': '0.0', 'ks': '0.0', 'hrx': '0.0', 'niu': '0.0', 'crs': '0.0'})]" - ] - }, - "execution_count": 60, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "light_model.fullAnnotate(text_es)[0]['language']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 6329, - "status": "ok", - "timestamp": 1664908948076, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "vNaXj6bYgPJe", - "outputId": "a7ef6444-e190-46cb-ae1d-a413364635d1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "detect_language_220 download started this may take some time.\n", - "Approx size to download 9.1 MB\n", - "[OK!]\n" - ] - }, - { - "data": { - "text/plain": [ - "{'document': ['French author who helped pioneer the science-fiction genre.'],\n", - " 'sentence': ['French author who helped pioneer the science-fiction genre.'],\n", - " 'language': ['en']}" - ] - }, - "execution_count": 61, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sparknlp.pretrained import PretrainedPipeline\n", - "\n", - "pipeline = PretrainedPipeline(\"detect_language_220\", lang = \"xx\")\n", - "\n", - "pipeline.annotate(\"French author who helped pioneer the science-fiction genre.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IlpLWUJhLPvM" - }, - "source": [ - "Translation with MarianTransformer\n", - "\n", - "[MarianTransformer](https://nlp.johnsnowlabs.com/docs/en/transformers#mariantransformer) is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.\n", - "\n", - "It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 24696, - "status": "ok", - "timestamp": 1664909112937, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "3ZqfVlpbLXfV", - "outputId": "dc18f780-9804-43c6-ed8e-7cd3e475e065" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence_detector_dl download started this may take some time.\n", - "Approximate size to download 514.9 KB\n", - "[OK!]\n", - "opus_mt_de_en download started this may take some time.\n", - "Approximate size to download 372.5 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler() \\\n", - " .setInputCol(\"text\") \\\n", - " .setOutputCol(\"document\")\n", - "\n", - "sentence = SentenceDetectorDLModel.pretrained(\"sentence_detector_dl\", \"xx\") \\\n", - " .setInputCols(\"document\") \\\n", - " .setOutputCol(\"sentence\")\n", - "\n", - "marian = MarianTransformer.pretrained('opus_mt_de_en') \\\n", - " .setInputCols(\"sentence\") \\\n", - " .setOutputCol(\"translation\") \\\n", - " .setMaxInputLength(30)\n", - "\n", - "pipeline = Pipeline().setStages([documentAssembler,\n", - " sentence,\n", - " marian])\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 12719, - "status": "ok", - "timestamp": 1664909129723, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "xepsNREULyeW", - "outputId": "05923286-cf1e-4c97-aa91-b6362d1e51e0" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------------------------+\n", - "|result |\n", - "+---------------------------+\n", - "|The Germans love bratwurst!|\n", - "+---------------------------+\n", - "\n" - ] - } - ], - "source": [ - "data = spark.createDataFrame([[\"Die Deutschen lieben Bratwurst!\"]]).toDF(\"text\")\n", - "result = pipeline.fit(data).transform(data)\n", - "\n", - "result.selectExpr(\"explode(translation.result) as result\").show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LnPg4AgLwgXc" - }, - "source": [ - "## Embeddings\n", - "\n", - "**Here is the Multi-Lingual Embeddings Model List avaliable in Spark NLP:**\n", - "\n", - "| title | name | language |\n", - "|:-----------------------------------------------------------------------------------------------------------------------|:---------------------------------------|:-----------|\n", - "| Glove Embeddings 6B 100 | glove_100d | en |\n", - "| GloVe Embeddings 6B 300 (Multilingual) | glove_6B_300 | xx |\n", - "| GloVe Embeddings 840B 300 (Multilingual) | glove_840B_300 | xx |\n", - "| Embeddings Clinical | embeddings_clinical | en |\n", - "| ELMo Embeddings | elmo | en |\n", - "| Embeddings Healthcare | embeddings_healthcare | en |\n", - "| Universal Sentence Encoder | tfhub_use | en |\n", - "| Universal Sentence Encoder Large | tfhub_use_lg | en |\n", - "| ALBERT Embeddings (Base Uncase) | albert_base_uncased | en |\n", - "| ALBERT Embeddings (Large Uncase) | albert_large_uncased | en |\n", - "| ALBERT Embeddings (XLarge Uncase) | albert_xlarge_uncased | en |\n", - "| ALBERT Embeddings (XXLarge Uncase) | albert_xxlarge_uncased | en |\n", - "| XLNet Embeddings (Base) | xlnet_base_cased | en |\n", - "| XLNet Embeddings (Large) | xlnet_large_cased | en |\n", - "| Embeddings Scielo 150 dims | embeddings_scielo_150d | es |\n", - "| Embeddings Scielo 300 dims | embeddings_scielo_300d | es |\n", - "| Embeddings Scielo 50 dims | embeddings_scielo_50d | es |\n", - "| Embeddings Scielowiki 150 dims | embeddings_scielowiki_150d | es |\n", - "| Embeddings Scielowiki 300 dims | embeddings_scielowiki_300d | es |\n", - "| Embeddings Scielowiki 50 dims | embeddings_scielowiki_50d | es |\n", - "| Embeddings Sciwiki 150 dims | embeddings_sciwiki_150d | es |\n", - "| Embeddings Sciwiki 300 dims | embeddings_sciwiki_300d | es |\n", - "| Embeddings Sciwiki 50 dims | embeddings_sciwiki_50d | es |\n", - "| Embeddings Healthcare 100 dims | embeddings_healthcare_100d | en |\n", - "| Embeddings BioVec | embeddings_biovec | en |\n", - "| BERT Embeddings (Base Cased) | bert_base_cased | en |\n", - "| BERT Embeddings (Base Uncased) | bert_base_uncased | en |\n", - "| BERT Embeddings (Large Cased) | bert_large_cased | en |\n", - "| BERT Embeddings (Large Uncased) | bert_large_uncased | en |\n", - "| Multilingual BERT Embeddings (Base Cased) | bert_multi_cased | xx |\n", - "| BioBERT Embeddings (Clinical) | biobert_clinical_base_cased | en |\n", - "| BioBERT Embeddings (Discharge) | biobert_discharge_base_cased | en |\n", - "| BioBERT Embeddings (PMC) | biobert_pmc_base_cased | en |\n", - "| BioBERT Embeddings (Pubmed) | biobert_pubmed_base_cased | en |\n", - "| BioBERT Embeddings (Pubmed Large) | biobert_pubmed_large_cased | en |\n", - "| BioBERT Embeddings (Pubmed PMC) | biobert_pubmed_pmc_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Clinical) | sent_biobert_clinical_base_cased | en |\n", - "| BERT Sentence Embeddings (Base Cased) | sent_bert_base_cased | en |\n", - "| BERT Sentence Embeddings (Base Uncased) | sent_bert_base_uncased | en |\n", - "| BERT Sentence Embeddings (Large Cased) | sent_bert_large_cased | en |\n", - "| BERT Sentence Embeddings (Large Uncased) | sent_bert_large_uncased | en |\n", - "| Multilingual BERT Sentence Embeddings (Base Cased) | sent_bert_multi_cased | xx |\n", - "| BioBERT Sentence Embeddings (Discharge) | sent_biobert_discharge_base_cased | en |\n", - "| BioBERT Sentence Embeddings (PMC) | sent_biobert_pmc_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Pubmed) | sent_biobert_pubmed_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Pubmed Large) | sent_biobert_pubmed_large_cased | en |\n", - "| BioBERT Sentence Embeddings (Pubmed PMC) | sent_biobert_pubmed_pmc_base_cased | en |\n", - "| Smaller BERT Sentence Embeddings (L-10_H-128_A-2) | sent_small_bert_L10_128 | en |\n", - "| Smaller BERT Sentence Embeddings (L-10_H-256_A-4) | sent_small_bert_L10_256 | en |\n", - "| Smaller BERT Sentence Embeddings (L-10_H-512_A-8) | sent_small_bert_L10_512 | en |\n", - "| Smaller BERT Sentence Embeddings (L-10_H-768_A-12) | sent_small_bert_L10_768 | en |\n", - "| Smaller BERT Sentence Embeddings (L-12_H-128_A-2) | sent_small_bert_L12_128 | en |\n", - "| Smaller BERT Sentence Embeddings (L-12_H-256_A-4) | sent_small_bert_L12_256 | en |\n", - "| Smaller BERT Sentence Embeddings (L-12_H-512_A-8) | sent_small_bert_L12_512 | en |\n", - "| Smaller BERT Sentence Embeddings (L-12_H-768_A-12) | sent_small_bert_L12_768 | en |\n", - "| Smaller BERT Sentence Embeddings (L-2_H-128_A-2) | sent_small_bert_L2_128 | en |\n", - "| Smaller BERT Sentence Embeddings (L-2_H-256_A-4) | sent_small_bert_L2_256 | en |\n", - "| Smaller BERT Sentence Embeddings (L-2_H-512_A-8) | sent_small_bert_L2_512 | en |\n", - "| Smaller BERT Sentence Embeddings (L-2_H-768_A-12) | sent_small_bert_L2_768 | en |\n", - "| Smaller BERT Sentence Embeddings (L-4_H-128_A-2) | sent_small_bert_L4_128 | en |\n", - "| Smaller BERT Sentence Embeddings (L-4_H-256_A-4) | sent_small_bert_L4_256 | en |\n", - "| Smaller BERT Sentence Embeddings (L-4_H-512_A-8) | sent_small_bert_L4_512 | en |\n", - "| Smaller BERT Sentence Embeddings (L-4_H-768_A-12) | sent_small_bert_L4_768 | en |\n", - "| Smaller BERT Sentence Embeddings (L-6_H-128_A-2) | sent_small_bert_L6_128 | en |\n", - "| Smaller BERT Sentence Embeddings (L-6_H-256_A-4) | sent_small_bert_L6_256 | en |\n", - "| Smaller BERT Sentence Embeddings (L-6_H-512_A-8) | sent_small_bert_L6_512 | en |\n", - "| Smaller BERT Sentence Embeddings (L-6_H-768_A-12) | sent_small_bert_L6_768 | en |\n", - "| Smaller BERT Sentence Embeddings (L-8_H-128_A-2) | sent_small_bert_L8_128 | en |\n", - "| Smaller BERT Sentence Embeddings (L-8_H-256_A-4) | sent_small_bert_L8_256 | en |\n", - "| Smaller BERT Sentence Embeddings (L-8_H-512_A-8) | sent_small_bert_L8_512 | en |\n", - "| Smaller BERT Sentence Embeddings (L-8_H-768_A-12) | sent_small_bert_L8_768 | en |\n", - "| Smaller BERT Embeddings (L-10_H-128_A-2) | small_bert_L10_128 | en |\n", - "| Smaller BERT Embeddings (L-10_H-256_A-4) | small_bert_L10_256 | en |\n", - "| Smaller BERT Embeddings (L-10_H-512_A-8) | small_bert_L10_512 | en |\n", - "| Smaller BERT Embeddings (L-10_H-768_A-12) | small_bert_L10_768 | en |\n", - "| Smaller BERT Embeddings (L-12_H-128_A-2) | small_bert_L12_128 | en |\n", - "| Smaller BERT Embeddings (L-12_H-256_A-4) | small_bert_L12_256 | en |\n", - "| Smaller BERT Embeddings (L-12_H-512_A-8) | small_bert_L12_512 | en |\n", - "| Smaller BERT Embeddings (L-12_H-768_A-12) | small_bert_L12_768 | en |\n", - "| Smaller BERT Embeddings (L-2_H-128_A-2) | small_bert_L2_128 | en |\n", - "| Smaller BERT Embeddings (L-2_H-256_A-4) | small_bert_L2_256 | en |\n", - "| Smaller BERT Embeddings (L-2_H-512_A-8) | small_bert_L2_512 | en |\n", - "| Smaller BERT Embeddings (L-2_H-768_A-12) | small_bert_L2_768 | en |\n", - "| Smaller BERT Embeddings (L-4_H-128_A-2) | small_bert_L4_128 | en |\n", - "| Smaller BERT Embeddings (L-4_H-256_A-4) | small_bert_L4_256 | en |\n", - "| Smaller BERT Embeddings (L-4_H-512_A-8) | small_bert_L4_512 | en |\n", - "| Smaller BERT Embeddings (L-4_H-768_A-12) | small_bert_L4_768 | en |\n", - "| Smaller BERT Embeddings (L-6_H-128_A-2) | small_bert_L6_128 | en |\n", - "| Smaller BERT Embeddings (L-6_H-256_A-4) | small_bert_L6_256 | en |\n", - "| Smaller BERT Embeddings (L-6_H-512_A-8) | small_bert_L6_512 | en |\n", - "| Smaller BERT Embeddings (L-6_H-768_A-12) | small_bert_L6_768 | en |\n", - "| Smaller BERT Embeddings (L-8_H-128_A-2) | small_bert_L8_128 | en |\n", - "| Smaller BERT Embeddings (L-8_H-256_A-4) | small_bert_L8_256 | en |\n", - "| Smaller BERT Embeddings (L-8_H-512_A-8) | small_bert_L8_512 | en |\n", - "| Smaller BERT Embeddings (L-8_H-768_A-12) | small_bert_L8_768 | en |\n", - "| COVID BERT Embeddings (Large Uncased) | covidbert_large_uncased | en |\n", - "| ELECTRA Embeddings(ELECTRA Base) | electra_base_uncased | en |\n", - "| ELECTRA Embeddings(ELECTRA Small) | electra_large_uncased | en |\n", - "| ELECTRA Embeddings(ELECTRA Small) | electra_small_uncased | en |\n", - "| COVID BERT Sentence Embeddings (Large Uncased) | sent_covidbert_large_uncased | en |\n", - "| ELECTRA Sentence Embeddings(ELECTRA Base) | sent_electra_base_uncased | en |\n", - "| ELECTRA Sentence Embeddings(ELECTRA Large) | sent_electra_large_uncased | en |\n", - "| ELECTRA Sentence Embeddings(ELECTRA Small) | sent_electra_small_uncased | en |\n", - "| Finnish BERT Embeddings (Base Cased) | bert_finnish_cased | fi |\n", - "| Finnish BERT Embeddings (Base Uncased) | bert_finnish_uncased | fi |\n", - "| Finnish BERT Sentence Embeddings (Base Cased) | sent_bert_finnish_cased | fi |\n", - "| Finnish BERT Sentence Embeddings (Base Uncased) | sent_bert_finnish_uncased | fi |\n", - "| Fastext Word Embeddings in German | w2v_cc_300d | de |\n", - "| BioBERT Embeddings (Clinical) | biobert_clinical_base_cased | en |\n", - "| BioBERT Embeddings (Discharge) | biobert_discharge_base_cased | en |\n", - "| BioBERT Embeddings (PMC) | biobert_pmc_base_cased | en |\n", - "| BioBERT Embeddings (Pubmed) | biobert_pubmed_base_cased | en |\n", - "| BioBERT Embeddings (Pubmed Large) | biobert_pubmed_large_cased | en |\n", - "| BioBERT Embeddings (Pubmed PMC) | biobert_pubmed_pmc_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Clinical) | sent_biobert_clinical_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Discharge) | sent_biobert_discharge_base_cased | en |\n", - "| BioBERT Sentence Embeddings (PMC) | sent_biobert_pmc_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Pubmed) | sent_biobert_pubmed_base_cased | en |\n", - "| BioBERT Sentence Embeddings (Pubmed Large) | sent_biobert_pubmed_large_cased | en |\n", - "| BioBERT Sentence Embeddings (Pubmed PMC) | sent_biobert_pubmed_pmc_base_cased | en |\n", - "| BERT LaBSE Sentence Embeddings | labse | en |\n", - "| Portuguese BERT Embeddings (Base Cased) | bert_portuguese_base_cased | pt |\n", - "| Portuguese BERT Embeddings (Large Cased) | bert_portuguese_large_cased | pt |\n", - "| Sentence Embeddings - Biobert cased (MedNLI) | sbiobert_base_cased_mli | en |\n", - "| Sentence Embeddings - Bluebert uncased (MedNLI) | sbluebert_base_uncased_mli | en |\n", - "| Word Embeddings for Urdu (urduvec_140M_300d) | urduvec_140M_300d | ur |\n", - "| Word Embeddings for Arabic (arabic_w2v_cc_300d) | arabic_w2v_cc_300d | ar |\n", - "| Word Embeddings for Persian (persian_w2v_cc_300d) | persian_w2v_cc_300d | fa |\n", - "| Universal Sentence Encoder Multilingual Large | tfhub_use_multi_lg | xx |\n", - "| Universal Sentence Encoder Multilingual | tfhub_use_multi | xx |\n", - "| Universal Sentence Encoder XLING English and German | tfhub_use_xling_en_de | xx |\n", - "| Universal Sentence Encoder XLING English and Spanish | tfhub_use_xling_en_es | xx |\n", - "| Universal Sentence Encoder XLING English and French | tfhub_use_xling_en_fr | xx |\n", - "| Universal Sentence Encoder XLING Many | tfhub_use_xling_many | xx |\n", - "| Word Embeddings for Hebrew (hebrew_cc_300d) | hebrew_cc_300d | he |\n", - "| Word Embeddings for Hindi (hindi_cc_300d) | hindi_cc_300d | hi |\n", - "| Word Embeddings for Bengali (bengali_cc_300d) | bengali_cc_300d | bn |\n", - "| Universal Sentence Encoder Multilingual Large (tfhub_use_multi_lg) | tfhub_use_multi_lg | xx |\n", - "| Universal Sentence Encoder Multilingual (tfhub_use_multi) | tfhub_use_multi | xx |\n", - "| Sentence Embeddings - sbert medium (tuned) | sbert_jsl_medium_umls_uncased | en |\n", - "| Sentence Embeddings - sbert medium (tuned) | sbert_jsl_medium_uncased | en |\n", - "| Sentence Embeddings - sbert mini (tuned) | sbert_jsl_mini_umls_uncased | en |\n", - "| Sentence Embeddings - sbert mini (tuned) | sbert_jsl_mini_uncased | en |\n", - "| Sentence Embeddings - sbert tiny (tuned) | sbert_jsl_tiny_umls_uncased | en |\n", - "| Sentence Embeddings - sbert tiny (tuned) | sbert_jsl_tiny_uncased | en |\n", - "| Sentence Embeddings - sbiobert (tuned) | sbiobert_jsl_cased | en |\n", - "| Sentence Embeddings - sbiobert (tuned) | sbiobert_jsl_umls_cased | en |\n", - "| Chinese BERT Base | bert_base_chinese | zh |\n", - "| BERTje A Dutch BERT model | bert_base_dutch_cased | nl |\n", - "| German BERT Base Cased Model | bert_base_german_cased | de |\n", - "| German BERT Base Uncased Model | bert_base_german_uncased | de |\n", - "| Italian BERT Base Cased | bert_base_italian_cased | it |\n", - "| Italian BERT Base Uncased | bert_base_italian_uncased | it |\n", - "| BERT multilingual base model (cased) | bert_base_multilingual_cased | xx |\n", - "| BERT multilingual base model (uncased) | bert_base_multilingual_uncased | xx |\n", - "| Turkish BERT Base Cased (BERTurk) | bert_base_turkish_cased | tr |\n", - "| Turkish BERT Base Uncased (BERTurk) | bert_base_turkish_uncased | tr |\n", - "| Chinese BERT with Whole Word Masking | chinese_bert_wwm | zh |\n", - "| DistilBERT base model (cased) | distilbert_base_cased | en |\n", - "| DistilBERT base multilingual model (cased) | distilbert_base_multilingual_cased | xx |\n", - "| DistilBERT base model (uncased) | distilbert_base_uncased | en |\n", - "| DistilRoBERTa base model | distilroberta_base | en |\n", - "| RoBERTa base model | roberta_base | en |\n", - "| RoBERTa large model | roberta_large | en |\n", - "| Twitter XLM-RoBERTa Base (twitter_xlm_roberta_base) | twitter_xlm_roberta_base | xx |\n", - "| XLM-RoBERTa Base (xlm_roberta_base) | xlm_roberta_base | xx |\n", - "| ALBERT Embeddings (Base Uncase) | albert_base_uncased | en |\n", - "| ALBERT Embeddings (Large Uncase) | albert_large_uncased | en |\n", - "| ALBERT Embeddings (XLarge Uncase) | albert_xlarge_uncased | en |\n", - "| ALBERT Embeddings (XXLarge Uncase) | albert_xxlarge_uncased | en |\n", - "| Sentence Embeddings - sbert medium (tuned) | sbert_jsl_medium_umls_uncased | en |\n", - "| Sentence Embeddings - sbert medium (tuned) | sbert_jsl_medium_uncased | en |\n", - "| Sentence Embeddings - sbert mini (tuned) | sbert_jsl_mini_umls_uncased | en |\n", - "| Sentence Embeddings - sbert mini (tuned) | sbert_jsl_mini_uncased | en |\n", - "| Sentence Embeddings - sbert tiny (tuned) | sbert_jsl_tiny_umls_uncased | en |\n", - "| Sentence Embeddings - sbert tiny (tuned) | sbert_jsl_tiny_uncased | en |\n", - "| Sentence Embeddings - sbiobert (tuned) | sbiobert_jsl_cased | en |\n", - "| Sentence Embeddings - sbiobert (tuned) | sbiobert_jsl_umls_cased | en |\n", - "| Chinese Pre-Trained XLNet (Base) | chinese_xlnet_base | zh |\n", - "| XLNet Embeddings (Base Cased) | xlnet_base_cased | en |\n", - "| XLNet Embeddings (Large Cased) | xlnet_large_cased | en |\n", - "| XLM-RoBERTa XTREME Base (xlm_roberta_xtreme_base) | xlm_roberta_xtreme_base | xx |\n", - "| Universal sentence encoder for English trained with CMLM (sent_bert_use_cmlm_en_base) | sent_bert_use_cmlm_en_base | en |\n", - "| Universal sentence encoder for English trained with CMLM (sent_bert_use_cmlm_en_large) | sent_bert_use_cmlm_en_large | en |\n", - "| Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base_br) | sent_bert_use_cmlm_multi_base_br | xx |\n", - "| Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base) | sent_bert_use_cmlm_multi_base | xx |\n", - "| MS-BERT base model (uncased) | ms_bluebert_base_uncased | en |\n", - "| Longformer Base (longformer_base_4096) | longformer_base_4096 | en |\n", - "| Longformer Large (longformer_large_4096) | longformer_large_4096 | en |\n", - "| Multilingual Representations for Indian Languages (MuRIL) | bert_muril | xx |\n", - "| BERT Embeddings trained on MEDLINE/PubMed | bert_pubmed | en |\n", - "| BERT Embeddings trained on MEDLINE/PubMed and fine-tuned on SQuAD 2.0 | bert_pubmed_squad2 | en |\n", - "| BERT Embeddings trained on Wikipedia and BooksCorpus | bert_wiki_books | en |\n", - "| BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on MNLI | bert_wiki_books_mnli | en |\n", - "| BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QNLI | bert_wiki_books_qnli | en |\n", - "| BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QQP | bert_wiki_books_qqp | en |\n", - "| BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SQuAD 2.0 | bert_wiki_books_squad2 | en |\n", - "| BERT Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SST-2 | bert_wiki_books_sst2 | en |\n", - "| Sentence Detection in Telugu Text | sentence_detector_dl | te |\n", - "| BERT Sentence Embeddings trained on MEDLINE/PubMed | sent_bert_pubmed | en |\n", - "| BERT Sentence Embeddings trained on MEDLINE/PubMed and fine-tuned on SQuAD 2.0 | sent_bert_pubmed_squad2 | en |\n", - "| BERT Sentence Embeddings trained on Wikipedia and BooksCorpus | sent_bert_wiki_books | en |\n", - "| BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on MNLI | sent_bert_wiki_books_mnli | en |\n", - "| BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QNLI | sent_bert_wiki_books_qnli | en |\n", - "| BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on QQP | sent_bert_wiki_books_qqp | en |\n", - "| BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SQuAD 2.0 | sent_bert_wiki_books_squad2 | en |\n", - "| BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SST-2 | sent_bert_wiki_books_sst2 | en |\n", - "| Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages | sent_bert_muril | xx |\n", - "| DistilRoBERTa Base Sentence Embeddings(sent_distilroberta_base) | sent_distilroberta_base | en |\n", - "| RoBERTa Base Sentence Embeddings(sent_roberta_base) | sent_roberta_base | en |\n", - "| RoBERTa Large Sentence Embeddings(sent_roberta_large) | sent_roberta_large | en |\n", - "| XLM-RoBERTa Base Sentence Embeddings (sent_xlm_roberta_base) | sent_xlm_roberta_base | xx |\n", - "| Spanish BERT Sentence Base Cased Embedding | sent_bert_base_cased | es |\n", - "| Dutch BERT Sentence Base Cased Embedding | sent_bert_base_cased | nl |\n", - "| Swedish BERT Sentence Base Cased Embedding | sent_bert_base_cased | sv |\n", - "| Greek BERT Sentence Base Uncased Embedding | sent_bert_base_uncased | el |\n", - "| Spanish BERT Sentence Base Uncased Embedding | sent_bert_base_uncased | es |\n", - "| Legal BERT Sentence Base Uncased Embedding | sent_bert_base_uncased_legal | en |\n", - "| Spanish BERT Base Cased Embedding | bert_base_cased | es |\n", - "| Dutch BERT Base Cased Embedding | bert_base_cased | nl |\n", - "| Swedish BERT Base Cased Embedding | bert_base_cased | sv |\n", - "| Greek BERT Base Uncased Embedding | bert_base_uncased | el |\n", - "| Spanish BERT Base Uncased Embedding | bert_base_uncased | es |\n", - "| Legal BERT Base Uncased Embedding | bert_base_uncased_legal | en |\n", - "| Word Embeddings for Japanese (japanese_cc_300d) | japanese_cc_300d | ja |\n", - "| Bert Embeddings Romanian (Base Cased) | bert_base_cased | ro |\n", - "| BERT Sentence Embeddings German (Base Cased) | sent_bert_base_cased | de |\n", - "| Japanese BERT Base | bert_base_japanese | ja |\n", - "| XLM-RoBERTa Base for Amharic (xlm_roberta_base_finetuned_amharic) | xlm_roberta_base_finetuned_amharic | am |\n", - "| XLM-RoBERTa Base for Hausa (xlm_roberta_base_finetuned_hausa) | xlm_roberta_base_finetuned_hausa | ha |\n", - "| XLM-RoBERTa Base for Igbo (xlm_roberta_base_finetuned_igbo) | xlm_roberta_base_finetuned_igbo | ig |\n", - "| XLM-RoBERTa Base for Kinyarwanda (xlm_roberta_base_finetuned_kinyarwanda) | xlm_roberta_base_finetuned_kinyarwanda | rw |\n", - "| XLM-RoBERTa Base for Luganda (xlm_roberta_base_finetuned_luganda) | xlm_roberta_base_finetuned_luganda | lg |\n", - "| XLM-RoBERTa Large (xlm_roberta_large) | xlm_roberta_large | xx |\n", - "| Word Embeddings for Dutch (dutch_cc_300d) | dutch_cc_300d | nl |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9SJw0qleR98a" - }, - "source": [ - "**You can find all these models and more [HERE](https://nlp.johnsnowlabs.com/models?task=Embeddings&edition=Spark+NLP)**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "s6BhkdS2jn9T" - }, - "source": [ - "### Word Embeddings (Glove)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5993, - "status": "ok", - "timestamp": 1664909751172, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Og-knlUnjnNR", - "outputId": "5b6314c0-b289-44b0-a88b-75445073e4d1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "glove_100d download started this may take some time.\n", - "Approximate size to download 145.3 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\\\n", - " .setInputCols([\"document\", \"token\"])\\\n", - " .setOutputCol(\"embeddings\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "WY0kpJQMj3nu" - }, - "outputs": [], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " glove_embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1370, - "status": "ok", - "timestamp": 1664909147376, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "_9pyocDNkPE2", - "outputId": "be0a0aed-a79b-482c-bb30-a517efc2842e" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| token| embeddings|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| Unions|[0.71865, 0.80754, -1.1787, 0.27145, -0.48833, -0.18938, -1.1789, 0.17836, -0.21995, -0.7216, -0....|\n", - "|representing|[0.25671, 0.30035, -0.18006, 0.46666, 0.98501, 0.2321, -0.34959, 0.26997, -0.99667, -0.43404, -0....|\n", - "| workers|[0.50592, 0.71717, -0.67236, -0.32112, -0.58285, -0.47977, -0.50243, 0.60594, 0.25709, 0.03974, -...|\n", - "| at|[0.1766, 0.093851, 0.24351, 0.44313, -0.39037, 0.12524, -0.19918, 0.59855, -0.82035, 0.28006, 0.5...|\n", - "| Turner|[0.51634, -0.37186, -0.21776, -1.0115, 0.4014, -0.4841, 0.36274, -0.2952, -0.42258, -0.62844, 0.6...|\n", - "| Newall|[-0.38857, -1.1449, -0.41737, -0.31969, -0.16546, -0.7044, 0.12875, -0.26047, 0.072844, -0.13314,...|\n", - "| say|[-0.091682, 0.58105, 0.40477, -0.41979, -0.85111, -0.28719, -0.41949, -0.10424, 0.45317, -0.09907...|\n", - "| they|[-0.07954, 0.30171, 0.079516, -0.74662, -0.67879, 0.35029, -0.19754, 0.4929, 0.14162, -0.23789, 0...|\n", - "| are|[-0.51533, 0.83186, 0.22457, -0.73865, 0.18718, 0.26021, -0.42564, 0.67121, -0.31084, -0.61275, 0...|\n", - "| '|[-0.34562, -0.24993, 0.58678, -0.89119, -1.0954, -0.45078, -0.074549, -0.44779, -0.38492, -0.4923...|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "only showing top 10 rows\n", - "\n" - ] - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"embeddings\"))\n", - "\n", - "result_df.show(10, truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x16b0IqLwyLj" - }, - "source": [ - "#### Using your own Word embeddings in Spark NLP" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pSb6TNYTbf55" - }, - "outputs": [], - "source": [ - "! wget -q https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nl.300.vec.gz" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Srhl4Dwrbklm" - }, - "outputs": [], - "source": [ - "!gunzip cc.nl.300.vec.gz" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "bGJogyZ-w7cB" - }, - "outputs": [], - "source": [ - "custom_embeddings = WordEmbeddings()\\\n", - " .setInputCols([\"document\", \"token\"])\\\n", - " .setOutputCol(\"my_embeddings\")\\\n", - " .setStoragePath('cc.nl.300.vec', \"TEXT\")\\\n", - " .setDimension(300)\n", - "\n", - "custom_embeddings_model = custom_embeddings.fit(result.limit(10))# any df would be fine as long as it had document and token columns thru Spark NLP" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1184, - "status": "ok", - "timestamp": 1664909418626, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ZgJ6xShAdbFo", - "outputId": "6cb0490b-72c1-43ef-a770-735d0098e3f5" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------------------------------------+\n", - "| embeddings|\n", - "+----------------------------------------------------------------------------------------------------+\n", - "|[[-0.0724, -0.0156, -0.031, -0.0285, 0.0037, 0.0091, -0.0524, -0.0528, -0.0572, -0.0975, 0.0156, ...|\n", - "|[[0.0123, -0.0048, -0.0016, 0.015, 0.0078, 0.0286, -0.0042, -0.0298, 0.0143, -0.1116, 0.0151, -0....|\n", - "|[[0.0807, 0.032, -0.0031, 0.0422, 0.14, 0.0321, -0.1067, -0.1257, -0.0627, -0.2708, -0.0688, 0.11...|\n", - "|[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...|\n", - "|[[-0.0471, 0.025, -0.007, 0.055, -0.019, 0.0615, -0.0303, -0.0282, -0.0087, -0.037, 0.0122, -0.02...|\n", - "|[[-0.2369, 0.1427, 0.0382, -0.0638, 0.0302, 0.0055, -0.0086, -0.0269, 0.0066, 0.1798, -0.0224, -0...|\n", - "|[[-0.2369, 0.1427, 0.0382, -0.0638, 0.0302, 0.0055, -0.0086, -0.0269, 0.0066, 0.1798, -0.0224, -0...|\n", - "|[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...|\n", - "|[[-0.2369, 0.1427, 0.0382, -0.0638, 0.0302, 0.0055, -0.0086, -0.0269, 0.0066, 0.1798, -0.0224, -0...|\n", - "|[[0.0312, 0.0831, -0.0697, -0.0373, 0.118, -0.0333, -0.0756, -0.0233, -0.0578, 0.4225, -0.0321, -...|\n", - "+----------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "custom_embeddings_model.transform(result.limit(10)).select('my_embeddings.embeddings').show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zCJzgV_Up7Yg" - }, - "source": [ - "### Elmo Embeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_WcSwAzJew8l" - }, - "source": [ - "Computes contextualized word representations using character-based word representations and bidirectional LSTMs.\n", - "\n", - "It can work with 4 different pooling layer options: `word_emb`, \n", - "`lstm_outputs1`, `lstm_outputs2`, or `elmo`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 17958, - "status": "ok", - "timestamp": 1664909436576, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ehvin7KVpzYk", - "outputId": "724b7158-36f9-4e1f-c039-ce5051f4acb9" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "elmo download started this may take some time.\n", - "Approximate size to download 334.1 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "elmo_embeddings = ElmoEmbeddings.pretrained('elmo')\\\n", - " .setInputCols([\"document\", \"token\"])\\\n", - " .setOutputCol(\"embeddings\")\\\n", - " .setPoolingLayer('elmo')# default --> elmo " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1806, - "status": "ok", - "timestamp": 1664909438371, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Liew-mgrr5eD", - "outputId": "60ee5647-af42-4418-c17b-f24af5b307d1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| token| elmo_embeddings|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| Unions|[-0.90887415, -0.30120426, 0.43796417, -0.57044435, 0.47783935, -0.119429074, 0.0076734107, 0.383...|\n", - "|representing|[0.13965544, 0.22222666, -0.039318062, 0.016958922, 0.7448401, 0.3416085, -0.3798288, -0.24866202...|\n", - "| workers|[-0.46730578, -0.30479333, -0.026425809, -0.32429358, 0.75354207, -0.18502708, -0.12659112, -0.46...|\n", - "| at|[-0.25518382, -0.22426175, -0.3707175, -0.60111916, 0.06384361, -0.48253918, 0.43195224, 0.506473...|\n", - "| Turner|[-1.28543, -0.08228641, -0.191375, -0.42979646, -0.22902456, 0.11305181, 0.28118223, -0.6329747, ...|\n", - "| Newall|[-0.15922381, -0.4069708, -0.26923794, -0.4310938, 0.02952183, -0.34705558, -0.17874795, 0.088628...|\n", - "| say|[0.040825248, 0.79921997, 0.020270275, -0.5167818, 0.089849845, -0.54928946, -0.09102909, -0.0168...|\n", - "| they|[-1.1243961, -0.010669309, -0.097467996, -0.0038198307, 0.8827248, -0.8975107, 0.25767303, 0.5404...|\n", - "| are|[-0.7048798, -0.3428075, -0.07719923, -0.49473467, 0.014530063, -0.12821776, 0.2985173, 0.5282617...|\n", - "| '|[-0.54589736, -0.29538193, 0.023100121, -0.019431472, 0.25288752, 0.009414777, 0.17324895, -0.034...|\n", - "|disappointed|[-0.67589045, -0.20069829, 0.83521056, -0.50575566, 0.65615255, -0.6367348, -0.16388428, -0.90439...|\n", - "| '|[-0.25755265, 0.3030559, 0.5622954, 0.30021656, 0.54014075, -0.5633521, 0.24766476, -0.57990265, ...|\n", - "| after|[-0.202389, 0.24767357, -0.23489536, -0.32719886, 0.44261807, -0.9367193, 0.21963781, -0.12545675...|\n", - "| talks|[-0.1785077, 0.25418776, 0.08238828, -0.24434832, 0.17956342, -1.2120204, 0.07875113, 0.3737102, ...|\n", - "| with|[0.52579856, -0.15434241, -0.118289724, -0.13308676, -0.05741524, 0.014611676, 0.27711594, 0.1267...|\n", - "| stricken|[0.4297342, -0.14186615, 0.069291145, -0.23070803, 0.045775726, -0.10202141, 0.6809373, 0.7641753...|\n", - "| parent|[0.44972146, -0.1574139, 0.16139527, -0.067404754, 0.6104196, 0.18980037, -0.19194514, 1.1387352,...|\n", - "| firm|[0.23465863, -0.27139914, 0.22050485, -0.04467322, 0.44497567, -0.49274597, -0.042227328, 0.19309...|\n", - "| Federal|[-0.9843947, 1.3339465, 0.3907869, -0.50403845, 0.71886575, -0.28959268, 0.75705194, 0.17455763, ...|\n", - "| Mogul|[-0.4634573, -0.12920536, 0.111302294, -0.26346883, 0.9858167, -1.0185302, -0.060575206, 0.365284...|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " elmo_embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"elmo_embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fyUNNvrgsW5u" - }, - "source": [ - "### Bert Embeddings\n", - "\n", - "BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture\n", - "\n", - "It can work with 3 different pooling layer options: `0`, \n", - "`-1`, or `-2`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 20530, - "status": "ok", - "timestamp": 1664909458894, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "s5TGgrQPsl4P", - "outputId": "cffca8e0-c615-4b96-927c-2ade78b848a0" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "bert_base_uncased download started this may take some time.\n", - "Approximate size to download 392.5 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "bert_embeddings = BertEmbeddings.pretrained('bert_base_uncased')\\\n", - " .setInputCols([\"document\", \"token\"])\\\n", - " .setOutputCol(\"embeddings\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 4586, - "status": "ok", - "timestamp": 1664909463445, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "VRHsfwBrsvF8", - "outputId": "771d576e-aa7e-4a0e-92b3-b6c2889485bc" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| token| bert_embeddings|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| Unions|[0.6670659, 0.075691774, -0.43506515, 0.13065232, 0.44819474, -0.075209334, -0.2241662, 0.1837851...|\n", - "|representing|[0.42402333, -0.15955889, -0.17578948, -0.5392357, -0.40183172, 0.38137984, 0.12605838, -0.409287...|\n", - "| workers|[0.7290717, 0.06131853, -0.54681975, -0.47506055, 0.3344038, 0.03293537, 0.38784602, 0.8520438, 0...|\n", - "| at|[0.10098228, 0.40903705, -0.44306707, -0.17509109, 0.1682218, 0.11896059, 0.41752312, 0.28065842,...|\n", - "| Turner|[0.6406127, 0.21072698, 0.010704566, 0.061129358, 0.052616015, 0.48817542, 0.04534916, 0.36956063...|\n", - "| Newall|[0.23471491, -0.47658986, -0.08398946, 0.054417428, 0.59816754, -0.328746, -0.34127328, 0.3301417...|\n", - "| say|[0.30300033, 0.123910174, 0.26303437, -0.29814374, 0.6184639, 0.15924262, -0.41755497, 0.51777476...|\n", - "| they|[0.06975791, -0.09760859, 0.19559911, -0.09057991, 0.10705751, 0.29734504, 0.09018942, 0.3121962,...|\n", - "| are|[-0.4600867, -0.03295532, 0.23510689, -0.23376492, 0.7956509, -0.009301226, -0.17701793, 0.711720...|\n", - "| '|[-0.32480395, -0.25789693, 0.5846827, -0.1503588, 0.45304078, -0.100868076, 0.43844336, 1.1270385...|\n", - "|disappointed|[-0.2920178, -0.5172758, 0.3739883, -0.46082836, 0.3249434, -0.14818825, -0.29012105, -0.01015694...|\n", - "| '|[-0.15516862, -0.13123243, 1.1257515, -0.6699537, 0.69658446, -0.08509679, 0.72284716, 0.7080308,...|\n", - "| after|[-0.2543314, -0.8899713, 0.5071763, 0.21592468, 0.2176922, 0.013359261, -0.47493804, 0.787083, 0....|\n", - "| talks|[-0.37898618, -0.2394179, 0.5133715, -0.025473528, 0.5798027, -0.29655194, 0.1830677, 0.07918449,...|\n", - "| with|[-0.40935937, 0.001023029, 0.20301563, 0.22106495, 0.062305883, -0.71892744, -0.17498948, 0.17216...|\n", - "| stricken|[-0.2687675, -0.51437914, 0.09687303, 0.032191057, 0.012207681, -0.044916622, -0.56810105, 0.2606...|\n", - "| parent|[0.6813847, -0.05907987, -0.45890468, 0.29420638, -0.38349813, 0.043616094, -0.50075155, 1.391248...|\n", - "| firm|[0.6276437, 0.32360873, 0.07091117, 0.48709965, 0.5830411, 0.57710993, -0.35511157, 0.3764335, -0...|\n", - "| Federal|[0.39469343, 0.43995667, -0.21780302, 0.34551498, 0.5347833, 0.65870786, -0.604482, 0.23736075, -...|\n", - "| Mogul|[0.49591896, 0.13451222, 0.1510944, 0.45716962, 0.8584716, 0.150838, -0.62374413, 0.3586182, -0.2...|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " bert_embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"bert_embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LC8E1nrUv6bD" - }, - "source": [ - "### Chunk Embeddings\n", - "\n", - "This annotator utilizes `WordEmbeddings` or `BertEmbeddings` to generate chunk embeddings from either `TextMatcher`, `RegexMatcher`, `Chunker`, `NGramGenerator`, or `NerConverter` outputs.\n", - "\n", - "`setPoolingStrategy`: Choose how you would like to aggregate Word Embeddings to Sentence Embeddings: `AVERAGE` or `SUM`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 31, - "status": "ok", - "timestamp": 1664909463446, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "FkZYc6jXhK7l", - "outputId": "2c46aa91-26d1-4070-dd74-2204f8a60b45" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(category='Business', text=\"Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.\"),\n", - " Row(category='Sci/Tech', text=' TORONTO, Canada A second team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket.'),\n", - " Row(category='Sci/Tech', text=' A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.')]" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "news_df.take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XMTBQh24hDx5" - }, - "outputs": [], - "source": [ - "entities = ['parent firm', 'economy', 'amino acids']\n", - "\n", - "with open ('entities.txt', 'w') as f:\n", - " for i in entities:\n", - " f.write(i+'\\n')\n", - "\n", - "entity_extractor = TextMatcher() \\\n", - " .setInputCols([\"document\",'token'])\\\n", - " .setOutputCol(\"entities\")\\\n", - " .setEntities(\"entities.txt\")\\\n", - " .setCaseSensitive(False)\\\n", - " .setEntityValue('entities')\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " entity_extractor])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 383, - "status": "ok", - "timestamp": 1664909464487, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "C9h6W5ixhkxp", - "outputId": "cf116e73-8e2e-420e-883d-ffa20afc9001" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Row(result=['parent firm']), Row(result=[]), Row(result=['amino acids'])]" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result.select('entities.result').take(3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Y_kbAnkMiLx_" - }, - "outputs": [], - "source": [ - "chunk_embeddings = ChunkEmbeddings() \\\n", - " .setInputCols([\"entities\", \"embeddings\"]) \\\n", - " .setOutputCol(\"chunk_embeddings\") \\\n", - " .setPoolingStrategy(\"AVERAGE\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " entity_extractor,\n", - " glove_embeddings,\n", - " chunk_embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5153, - "status": "ok", - "timestamp": 1664909470368, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "o4peezBVjEWQ", - "outputId": "7d0a8b78-43c9-4ec7-f244-4e78529a6167" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------+----------------------------------------------------------------------------------------------------+\n", - "| entities| chunk_embeddings|\n", - "+-----------+----------------------------------------------------------------------------------------------------+\n", - "|parent firm|[0.45683652, -0.105479494, -0.34525, -0.143924, -0.192452, -0.33616, -0.22334, -0.208185, -0.3673...|\n", - "|amino acids|[-0.3861, 0.054408997, -0.287795, -0.33318, 0.375065, -0.185539, -0.330525, -0.214415, -0.73892, ...|\n", - "+-----------+----------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.entities.result,\n", - " result.chunk_embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"entities\"),\n", - " F.expr(\"cols['1']\").alias(\"chunk_embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ak26Rrf1tH_s" - }, - "source": [ - "### UniversalSentenceEncoder" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FfHu1tUIj4uK" - }, - "source": [ - "The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 26072, - "status": "ok", - "timestamp": 1664909777210, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "88WEGqcSsWIY", - "outputId": "9e582ab4-93ca-4588-843f-ef5ff661a2ee" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "# no need for token columns \n", - "use_embeddings = UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n", - " .setInputCols(\"document\") \\\n", - " .setOutputCol(\"sentence_embeddings\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 2298, - "status": "ok", - "timestamp": 1664909779486, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "RJAYyXfztu-8", - "outputId": "297713c9-0c4e-4501-98e9-6e324cf5096f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "| document| USE_embeddings|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "|Unions representing workers at Turner Newall say they are 'disappointed' after talks with stric...|[0.012997561, 0.01984477, -0.024626475, 0.039759077, -0.044246476, 0.013197604, 0.07867438, -0.05...|\n", - "| TORONTO, Canada A second team of rocketeers competing for the #36;10 million Ansari X Prize,...|[0.001999881, 0.051844038, -0.044029105, -5.932957E-4, -0.038505986, -0.027279468, 0.06940469, -0...|\n", - "| A company founded by a chemistry researcher at the University of Louisville won a grant to devel...|[0.03864186, 0.023220852, -0.004016253, 0.07199469, 0.027279727, -0.058951836, -0.0019538593, 0.0...|\n", - "| It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures an...|[0.072394915, 0.06559241, -0.015342068, -0.023814267, 0.01322931, -0.046775296, 0.009602186, 0.02...|\n", - "| Southern California's smog fighting agency went after emissions of the bovine variety Friday, ad...|[-0.034784008, -0.04371792, -8.897062E-4, 0.061689097, 0.03343994, 0.021937966, 0.05359926, -0.05...|\n", - "|\"The British Department for Education and Skills (DfES) recently launched a \"\"Music Manifesto\"\" c...|[-0.02244975, -0.0051162946, -0.011258106, 0.037045635, -0.0015617626, -0.046843972, 0.08016194, ...|\n", - "|\"confessed author of the Netsky and Sasser viruses, is responsible for 70 percent of virus infect...|[4.5218898E-4, -0.044989046, -0.027128322, 0.05711092, 0.04432688, -0.03930944, -0.0049682218, -0...|\n", - "|\\\\FOAF/LOAF and bloom filters have a lot of interesting properties for social\\network and whitel...|[0.045447674, -0.0673045, 0.047636565, -0.02094886, -0.025080299, 0.03516157, 0.0047589773, 0.030...|\n", - "| \"Wiltshire Police warns about \"\"phishing\"\" after its fraud squad chief was targeted.\"|[-0.00513464, -0.07799961, -0.03284713, -0.01024671, 0.037006836, -0.05350421, 0.009521332, -0.03...|\n", - "|In its first two years, the UK's dedicated card fraud unit, has recovered 36,000 stolen cards and...|[0.007892201, -0.0701188, -0.028799085, -0.018356498, 0.024937212, -0.07232398, 0.085445836, 0.02...|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " use_embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.document.result, \n", - " result.sentence_embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"document\"),\n", - " F.expr(\"cols['1']\").alias(\"USE_embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "W74tlfkl75P5" - }, - "source": [ - "### LongFormer Embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "L6wv9S3gF0Sj" - }, - "outputs": [], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 40865, - "status": "ok", - "timestamp": 1664909830454, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "8jzsd5paFps6", - "outputId": "4f2fd0cf-2f36-4d63-ad9e-29999004c02d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "longformer_base_4096 download started this may take some time.\n", - "Approximate size to download 343.3 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "# no need for token columns \n", - "longformer_embeddings = LongformerEmbeddings.pretrained() \\\n", - " .setInputCols('document',\"token\") \\\n", - " .setOutputCol(\"longformer_embeddings\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 29988, - "status": "ok", - "timestamp": 1664909860431, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Jc4sj9dC8AMT", - "outputId": "15e60cb5-7cd3-4320-eeea-8c2249fafbad" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| token| longformer_embeddings|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "| Unions|[0.07719437, -0.17138699, -6.540511E-4, -0.13498086, -0.023513347, -0.2028867, 0.0938687, -0.0760...|\n", - "|representing|[-0.037457973, -0.122049384, 0.11676876, 0.20218223, 0.05563064, 0.16728787, 0.110208884, -0.1453...|\n", - "| workers|[-0.13273284, -0.015104517, 0.052178774, -0.34765008, -0.26494068, 0.21955602, 0.0025631627, -0.0...|\n", - "| at|[-0.1871058, -0.15763302, -0.036046065, -0.14117411, 0.32089376, 0.0988634, -0.099187165, 0.09724...|\n", - "| Turner|[0.041943394, -0.007565988, 0.032059915, -0.019206144, -0.34937793, -0.187767, -0.030716063, -0.1...|\n", - "| Newall|[-0.016664965, 0.20831646, -0.07345706, -0.08770898, 0.09218825, -0.1095924, 0.39393264, 0.363961...|\n", - "| say|[0.10822279, 0.09800173, -0.050198782, -0.26110616, 0.055013414, -0.24756625, 0.023142464, 0.0982...|\n", - "| they|[-0.06725839, 0.18717986, -0.25182387, -0.52932215, -0.8810263, 0.20006596, 0.12919274, 0.0215994...|\n", - "| are|[0.20806815, 0.25371054, -0.03866426, -0.14666738, -0.3055287, -0.06543347, 0.019405391, -0.01027...|\n", - "| '|[0.27279565, -0.008778511, 0.029686064, 0.09686689, -0.047101133, 0.17664933, 0.02347222, -0.2584...|\n", - "|disappointed|[-0.061784416, 0.10046677, -0.01712867, 0.07331942, 0.2395818, -0.3137749, -0.07112072, -0.325874...|\n", - "| '|[0.30586535, -0.08653332, 0.06588482, 0.23384081, 0.018363494, -0.031568833, 0.03716231, -0.15546...|\n", - "| after|[0.12608868, 0.021265892, -8.5087307E-4, 0.20219721, -0.11005052, -0.16537799, -0.093448654, 0.04...|\n", - "| talks|[-0.065675214, 0.2547877, -0.023002231, 0.07334101, -1.0161656, 0.10010045, -0.076567814, -0.1042...|\n", - "| with|[-0.058689136, 0.08328565, -0.062080774, 0.24219596, 0.48440862, 0.31241417, 0.04525351, 0.111382...|\n", - "| stricken|[0.022671841, -0.0010705628, 0.046994254, 0.019946698, -0.3519812, 0.101592556, -0.14684096, -0.1...|\n", - "| parent|[0.099573314, -0.030517425, 0.027951747, 0.10648133, 0.3734014, -0.33672994, 0.206242, -0.0200887...|\n", - "| firm|[0.25588194, -0.10536073, -0.00896785, 0.24162258, -0.6211535, -0.3675931, -0.10997876, 0.0204709...|\n", - "| Federal|[-0.019053962, 0.009452908, -0.012464292, -0.069504485, 0.14036526, 0.022730783, -0.011906238, 0....|\n", - "| Mogul|[-0.076652616, 0.04900066, -0.051001675, -0.14854756, 0.08934068, -0.16336307, -0.03759157, 0.139...|\n", - "+------------+----------------------------------------------------------------------------------------------------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " longformer_embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.longformer_embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"longformer_embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wtoIJ_rc8iCl" - }, - "source": [ - "### Multi-Lingual Embeddings \n", - "\n", - "These Embeddings can map text from multiple languages to topological close points in hyperspace, which standard embeddings fail to do. \n", - "Classifier Models trained with these embeddings are able to generalize across all the supported languages, even if the classifier model was trained on english data. \n", - "\n", - "**Here is the Multi-Lingual Embeddings Model List avaliable in Spark NLP:**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "At39DrH2NHA7" - }, - "source": [ - "\n", - "\n", - "| Name | Spark NLP Model Name | language |\n", - "|:-----------------------------------------------------------------------------------------------------------------------|:-----------------------------------|:-----------|\n", - "| GloVe Embeddings 6B 300 (Multilingual) | glove_6B_300 | xx |\n", - "| GloVe Embeddings 840B 300 (Multilingual) | glove_840B_300 | xx |\n", - "| Multilingual BERT Embeddings (Base Cased) | bert_multi_cased | xx |\n", - "| Multilingual BERT Sentence Embeddings (Base Cased) | sent_bert_multi_cased | xx |\n", - "| Universal Sentence Encoder Multilingual Large | tfhub_use_multi_lg | xx |\n", - "| Universal Sentence Encoder Multilingual | tfhub_use_multi | xx |\n", - "| Universal Sentence Encoder XLING English and German | tfhub_use_xling_en_de | xx |\n", - "| Universal Sentence Encoder XLING English and Spanish | tfhub_use_xling_en_es | xx |\n", - "| Universal Sentence Encoder XLING English and French | tfhub_use_xling_en_fr | xx |\n", - "| Universal Sentence Encoder XLING Many | tfhub_use_xling_many | xx |\n", - "| Universal Sentence Encoder Multilingual Large (tfhub_use_multi_lg) | tfhub_use_multi_lg | xx |\n", - "| Universal Sentence Encoder Multilingual (tfhub_use_multi) | tfhub_use_multi | xx |\n", - "| BERT multilingual base model (cased) | bert_base_multilingual_cased | xx |\n", - "| BERT multilingual base model (uncased) | bert_base_multilingual_uncased | xx |\n", - "| DistilBERT base multilingual model (cased) | distilbert_base_multilingual_cased | xx |\n", - "| Twitter XLM-RoBERTa Base (twitter_xlm_roberta_base) | twitter_xlm_roberta_base | xx |\n", - "| XLM-RoBERTa Base (xlm_roberta_base) | xlm_roberta_base | xx |\n", - "| XLM-RoBERTa XTREME Base (xlm_roberta_xtreme_base) | xlm_roberta_xtreme_base | xx |\n", - "| Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base_br) | sent_bert_use_cmlm_multi_base_br | xx |\n", - "| Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base) | sent_bert_use_cmlm_multi_base | xx |\n", - "| Multilingual Representations for Indian Languages (MuRIL) | bert_muril | xx |\n", - "| Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages | sent_bert_muril | xx |\n", - "| XLM-RoBERTa Base Sentence Embeddings (sent_xlm_roberta_base) | sent_xlm_roberta_base | xx |\n", - "| XLM-RoBERTa Large (xlm_roberta_large) | xlm_roberta_large | xx |\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Jp6ByDKwQzHv" - }, - "source": [ - "**You can find all these models and more [HERE](https://nlp.johnsnowlabs.com/models?language=xx&task=Embeddings&edition=Spark+NLP)**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-M7vAkB2F-7x" - }, - "source": [ - "#### Multi-Lingual BERT Embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 18538, - "status": "ok", - "timestamp": 1664910048572, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "9_ef5r1-F-7y", - "outputId": "d4346895-78b6-4324-c0c1-0e7a2a691e18" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "bert_multi_cased download started this may take some time.\n", - "Approximate size to download 638.6 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "# no need for token columns \n", - "embeddings = BertEmbeddings.pretrained('bert_multi_cased','xx') \\\n", - " .setInputCols('document',\"token\") \\\n", - " .setOutputCol(\"embeddings\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-QOhEV0aF-7y" - }, - "outputs": [], - "source": [ - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " embeddings])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"BERT embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Nn6vOaJ7y3ZZ" - }, - "source": [ - "## Loading Models from local" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 275, - "status": "ok", - "timestamp": 1664910173570, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "z_HgZG10zER5", - "outputId": "7498f1a0-8a15-4dc2-cbab-118cdf0c4472" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/root/cache_pretrained\n" - ] - } - ], - "source": [ - "!cd ~/cache_pretrained && pwd" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 256, - "status": "ok", - "timestamp": 1664910176253, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "YhdSFd6ybatv", - "outputId": "b46d0fde-1dc6-49ff-f5c2-d064b0862aed" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "total 72\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:50 bert_base_uncased_en_2.6.0_2.4_1598340514223\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:58 bert_multi_cased_xx_2.6.0_2.4_1598341875191\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:39 dependency_conllu_en_3.4.4_3.0_1656845289670\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:39 dependency_typed_conllu_en_3.4.4_3.0_1656850770275\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:42 detect_language_220_xx_2.7.0_2.4_1607185721383\n", - "drwxr-xr-x 3 root root 4096 Oct 4 18:50 elmo_en_2.4.0_2.4_1580488815299\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:45 glove_100d_en_2.4.0_2.4_1579690104032\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:42 ld_wiki_tatoeba_cnn_375_xx_2.7.0_2.4_1607184873730\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:26 lemma_antbnc_en_2.0.2_2.4_1556480454569\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:52 longformer_base_4096_en_3.2.0_2.4_1628093002279\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:42 opus_mt_de_en_xx_3.1.0_2.4_1622555271468\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:34 pos_anc_en_3.0.0_3.0_1614962126490\n", - "drwxr-xr-x 3 root root 4096 Oct 4 18:42 sentence_detector_dl_xx_2.7.0_2.4_1609610616998\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:41 spellcheck_dl_en_3.4.1_3.0_1648457196011\n", - "drwxr-xr-x 4 root root 4096 Oct 4 18:40 spellcheck_norvig_en_3.1.3_3.0_1631046343759\n", - "drwxr-xr-x 3 root root 4096 Oct 4 18:40 stopwords_en_en_2.5.4_2.4_1594742439135\n", - "drwxr-xr-x 3 root root 4096 Oct 4 18:40 stopwords_es_es_2.5.4_2.4_1594742441303\n", - "drwxr-xr-x 3 root root 4096 Oct 4 18:51 tfhub_use_en_2.4.0_2.4_1587136330099\n" - ] - } - ], - "source": [ - "!cd ~/cache_pretrained && ls -l" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "B64mWzh9y7Jt" - }, - "outputs": [], - "source": [ - "glove_embeddings = WordEmbeddingsModel.load('/root/cache_pretrained/glove_100d_en_2.4.0_2.4_1579690104032').\\\n", - " setInputCols([\"document\", 'token']).\\\n", - " setOutputCol(\"glove_embeddings\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yo839CrNuzq7" - }, - "source": [ - "## Getting Sentence Embeddings from word embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5476, - "status": "ok", - "timestamp": 1664910192748, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "c8bhA7Rhuygc", - "outputId": "3775be95-61f0-4ad7-b592-8bb1f550fb84" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "glove_100d download started this may take some time.\n", - "Approximate size to download 145.3 MB\n", - "[OK!]\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "| document| sentence_embeddings|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "|Unions representing workers at Turner Newall say they are 'disappointed' after talks with stric...|[0.06131713, 0.08552574, 0.1340617, -0.22403374, -0.23798925, -0.159222, -0.21079227, 0.07760903,...|\n", - "| TORONTO, Canada A second team of rocketeers competing for the #36;10 million Ansari X Prize,...|[-0.05313636, 0.100133374, 0.19096054, -0.01998518, 0.32574704, -0.06335781, 0.1789568, 0.2931641...|\n", - "| A company founded by a chemistry researcher at the University of Louisville won a grant to devel...|[0.021042928, 0.074208066, 0.0723595, -0.09070449, 0.2993206, 0.04752573, -0.060713578, 0.155145,...|\n", - "| It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures an...|[-0.22619276, 0.18458064, 0.3589564, -0.3349514, 0.08098127, 0.09799263, -0.049265776, 0.25009856...|\n", - "| Southern California's smog fighting agency went after emissions of the bovine variety Friday, ad...|[-0.06140215, -0.040404316, 0.44608322, -0.18327837, 0.019998113, -0.09348459, -0.17261569, 0.177...|\n", - "|\"The British Department for Education and Skills (DfES) recently launched a \"\"Music Manifesto\"\" c...|[-0.01894932, 0.067353465, 0.011856217, -0.23963752, 0.30282092, 0.01841911, 0.029716793, 0.16109...|\n", - "|\"confessed author of the Netsky and Sasser viruses, is responsible for 70 percent of virus infect...|[0.013585807, 0.0013871301, 0.19375019, -0.25756782, 0.1557032, 0.101828136, 0.12016087, 0.031324...|\n", - "|\\\\FOAF/LOAF and bloom filters have a lot of interesting properties for social\\network and whitel...|[-0.13172166, 0.16039874, 0.26770115, -0.21353802, 0.094076656, 0.123355016, -0.1721528, 0.098089...|\n", - "| \"Wiltshire Police warns about \"\"phishing\"\" after its fraud squad chief was targeted.\"|[-0.014898933, -0.1829025, 0.17554195, -0.355318, -0.013315629, 0.18482018, -0.0075285546, 0.0961...|\n", - "|In its first two years, the UK's dedicated card fraud unit, has recovered 36,000 stolen cards and...|[0.0010903521, 0.17995015, 0.18142067, -0.24167466, 0.19849405, 0.19571333, 0.11464099, 0.1724940...|\n", - "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\\\n", - " .setInputCols([\"document\", \"token\"])\\\n", - " .setOutputCol(\"embeddings\")\n", - "\n", - "embeddingsSentence = SentenceEmbeddings() \\\n", - " .setInputCols([\"document\", \"embeddings\"]) \\\n", - " .setOutputCol(\"sentence_embeddings\") \\\n", - " .setPoolingStrategy(\"AVERAGE\") # or SUM\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " glove_embeddings,\n", - " embeddingsSentence])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result_df = result.select(F.explode(F.arrays_zip(result.document.result, \n", - " result.sentence_embeddings.embeddings)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"document\"),\n", - " F.expr(\"cols['1']\").alias(\"sentence_embeddings\"))\n", - "\n", - "result_df.show(truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C6-wTRfAx3rL" - }, - "source": [ - "### Cosine similarity between two embeddings (sentence similarity)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1489, - "status": "ok", - "timestamp": 1664910203619, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "pliYSeTgwpq9", - "outputId": "2927a1a6-b856-4b7f-ef3d-b39e905e270c" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0.876465618347585" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from scipy.spatial import distance\n", - "\n", - "import numpy as np\n", - "\n", - "v1 = result_df.select('sentence_embeddings').take(2)[0][0]\n", - "\n", - "v2 = result_df.select('sentence_embeddings').take(2)[1][0]\n", - "\n", - "1 - distance.cosine(np.array(v1), np.array(v2))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 776, - "status": "ok", - "timestamp": 1664910232785, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "LZNFCaboxhUO", - "outputId": "899e07a1-2824-41ee-ec29-9bb7de96cb27" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "1" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "v2 = result_df.select('sentence_embeddings').take(2)[0][0]\n", - "\n", - "1 - distance.cosine(np.array(v1), np.array(v2))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qwWi2bMTegFV" - }, - "source": [ - "## QuestionAnswering Models" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rwjTKJ8oghjG" - }, - "source": [ - "| Language Name(s) | NLU Reference | Spark NLP Reference |\n", - "|:-------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", - "|[Bengali](https://iso639-3.sil.org/code/ben) |[bn.answer_question.mbert_bengali_tydiqa_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mbert_bengali_tydiqa_qa_bn_3_0.html) |[bert_qa_mbert_bengali_tydiqa_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mbert_bengali_tydiqa_qa_bn_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_3_0.html) |[bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_distill_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.beto_base_spanish_sqac.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_beto_base_spanish_sqac_es_3_0.html) |[bert_qa_beto_base_spanish_sqac](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_beto_base_spanish_sqac_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_finetuned_sqac_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_spanish_wwm_cased_finetuned_squad2_es_MMG_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad2_es_MMG_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_finetuned_squad_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_sqac.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_sqac_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488_es_3_0.html) |\n", - "|[Castilian, Spanish](https://iso639-3.sil.org/code/spa) |[es.answer_question.bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac_es_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_finetuned_sqac_es_3_0.html) |\n", - "|[Danish](https://iso639-3.sil.org/code/dan) |[da.answer_question.danish_bert_botxo_qa_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_danish_bert_botxo_qa_squad_da_3_0.html) |[bert_qa_danish_bert_botxo_qa_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_danish_bert_botxo_qa_squad_da_3_0.html) |\n", - "|[Dutch, Flemish](https://iso639-3.sil.org/code/nld) |[nl.answer_question.bert_base_multilingual_cased_finetuned_dutch_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2_nl_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_dutch_squad2_nl_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.zero_shot.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_zero_shot_en_3_0.html) |[bert_qa_zero_shot](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_zero_shot_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.youngjae_bert_finetuned_squad_accelerate.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_youngjae_bert_finetuned_squad_accelerate_en_3_0.html) |[bert_qa_youngjae_bert_finetuned_squad_accelerate](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_youngjae_bert_finetuned_squad_accelerate_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.youngjae_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_youngjae_bert_finetuned_squad_en_3_0.html) |[bert_qa_youngjae_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_youngjae_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ydshieh_bert_base_cased_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ydshieh_bert_base_cased_squad2_en_3_0.html) |[bert_qa_ydshieh_bert_base_cased_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ydshieh_bert_base_cased_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xxlargev1_squad2_512.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlargev1_squad2_512_en_3_0.html) |[albert_qa_xxlargev1_squad2_512](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlargev1_squad2_512_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xxlarge_v2_squad2_covid_deepset.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_v2_squad2_covid_deepset_en_3_0.html) |[albert_qa_xxlarge_v2_squad2_covid_deepset](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_v2_squad2_covid_deepset_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xxlarge_v2_squad2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_v2_squad2_en_3_0.html) |[albert_qa_xxlarge_v2_squad2](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_v2_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xxlarge_v1_finetuned_squad2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_v1_finetuned_squad2_en_3_0.html) |[albert_qa_xxlarge_v1_finetuned_squad2](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_v1_finetuned_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xxlarge_tweetqa.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_tweetqa_en_3_0.html) |[albert_qa_xxlarge_tweetqa](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_tweetqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xxlarge_finetuned_squad.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_finetuned_squad_en_3_0.html) |[albert_qa_xxlarge_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xxlarge_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6_en_3_0.html) |[bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3_en_3_0.html) |[bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xtremedistil_l6_h256_uncased_finetuned_lr_2e_05_epochs_3_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9_en_3_0.html) |[bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xtremedistil_l6_h256_uncased_TQUAD_finetuned_lr_2e_05_epochs_9_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xlarge_v2_squad_v2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xlarge_v2_squad_v2_en_3_0.html) |[albert_qa_xlarge_v2_squad_v2](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xlarge_v2_squad_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xlarge_finetuned_squad.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xlarge_finetuned_squad_en_3_0.html) |[albert_qa_xlarge_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xlarge_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xlarge_finetuned.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xlarge_finetuned_en_3_0.html) |[albert_qa_xlarge_finetuned](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_xlarge_finetuned_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.xdistil_l12_h384_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xdistil_l12_h384_squad2_en_3_0.html) |[bert_qa_xdistil_l12_h384_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xdistil_l12_h384_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.vumichien_base_v2_squad2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_vumichien_base_v2_squad2_en_3_0.html) |[albert_qa_vumichien_base_v2_squad2](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_vumichien_base_v2_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.vuiseng9_bert_base_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_vuiseng9_bert_base_uncased_squad_en_3_0.html) |[bert_qa_vuiseng9_bert_base_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_vuiseng9_bert_base_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.victoraavila_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_victoraavila_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_victoraavila_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_victoraavila_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.unqover_bert_base_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_unqover_bert_base_uncased_squad_en_3_0.html) |[bert_qa_unqover_bert_base_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_unqover_bert_base_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.unqover_bert_base_uncased_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_unqover_bert_base_uncased_newsqa_en_3_0.html) |[bert_qa_unqover_bert_base_uncased_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_unqover_bert_base_uncased_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.twmkn9_bert_base_uncased_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_twmkn9_bert_base_uncased_squad2_en_3_0.html) |[bert_qa_twmkn9_bert_base_uncased_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_twmkn9_bert_base_uncased_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.triviaqa_bert_el_Danastos.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_triviaqa_bert_el_Danastos_en_3_0.html) |[bert_qa_triviaqa_bert_el_Danastos](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_triviaqa_bert_el_Danastos_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tinybert_6l_768d_squad2_large_teacher_finetuned_step1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1_en_3_0.html) |[bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_step1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tinybert_6l_768d_squad2_large_teacher_finetuned.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_en_3_0.html) |[bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tinybert_6l_768d_squad2_large_teacher_finetuned_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tinybert_6l_768d_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tinybert_6l_768d_squad2_en_3_0.html) |[bert_qa_tinybert_6l_768d_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tinybert_6l_768d_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tf_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tf_bert_finetuned_squad_en_3_0.html) |[bert_qa_tf_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tf_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tf_bert_base_cased_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tf_bert_base_cased_squad2_en_3_0.html) |[bert_qa_tf_bert_base_cased_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tf_bert_base_cased_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tests_finetuned_squad_test_bert_2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tests_finetuned_squad_test_bert_2_en_3_0.html) |[bert_qa_tests_finetuned_squad_test_bert_2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tests_finetuned_squad_test_bert_2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.tests_finetuned_squad_test_bert.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tests_finetuned_squad_test_bert_en_3_0.html) |[bert_qa_tests_finetuned_squad_test_bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_tests_finetuned_squad_test_bert_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.srmukundb_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_srmukundb_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_srmukundb_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_srmukundb_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_slp.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_squad_slp_en_3_0.html) |[albert_qa_squad_slp](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_squad_slp_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_ms_bert_base.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_ms_bert_base_en_3_0.html) |[bert_qa_squad_ms_bert_base](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_ms_bert_base_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_mbert_model_2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_model_2_en_3_0.html) |[bert_qa_squad_mbert_model_2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_model_2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_mbert_model.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_model_en_3_0.html) |[bert_qa_squad_mbert_model](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_model_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_mbert_en_de_es_vi_zh_model.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_en_de_es_vi_zh_model_en_3_0.html) |[bert_qa_squad_mbert_en_de_es_vi_zh_model](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_en_de_es_vi_zh_model_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_mbert_en_de_es_model.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_en_de_es_model_en_3_0.html) |[bert_qa_squad_mbert_en_de_es_model](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_mbert_en_de_es_model_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_en_bert_base.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_en_bert_base_en_3_0.html) |[bert_qa_squad_en_bert_base](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_en_bert_base_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_bert_el_Danastos.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_squad_bert_el_Danastos_en_3_0.html) |[bert_qa_squad_bert_el_Danastos](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_squad_bert_el_Danastos_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_baseline.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_baseline_en_3_0.html) |[bert_qa_squad_baseline](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad_baseline_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad_2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_squad_2.0_en_3_0.html) |[albert_qa_squad_2.0](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_squad_2.0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad2.0_en_3_0.html) |[bert_qa_squad2.0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad2.0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.squad1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad1.1_en_3_0.html) |[bert_qa_squad1.1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_squad1.1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_recruit_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_recruit_qa_en_3_0.html) |[bert_qa_spanbert_recruit_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_recruit_qa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_large_recruit_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_large_recruit_qa_en_3_0.html) |[bert_qa_spanbert_large_recruit_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_large_recruit_qa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_finetuned_squadv2_en_3_0.html) |[bert_qa_spanbert_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_finetuned_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_finetuned_squadv1_en_3_0.html) |[bert_qa_spanbert_finetuned_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_finetuned_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_4_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_10_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_64_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_8_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_10_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_512_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_10_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_32_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_256_finetuned_squad_seed_10_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_16_finetuned_squad_seed_42_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_8_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_4_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_128_finetuned_squad_seed_10_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_8_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_42_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_4_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_10_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_spanbert_base_cased_few_shot_k_1024_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.slp.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_slp_en_3_0.html) |[albert_qa_slp](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_slp_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.scibert_scivocab_uncased_squad_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_scibert_scivocab_uncased_squad_v2_en_3_0.html) |[bert_qa_scibert_scivocab_uncased_squad_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_scibert_scivocab_uncased_squad_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.scibert_scivocab_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_scibert_scivocab_uncased_squad_en_3_0.html) |[bert_qa_scibert_scivocab_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_scibert_scivocab_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.scibert_nli_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_scibert_nli_squad_en_3_0.html) |[bert_qa_scibert_nli_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_scibert_nli_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.sbert_large_nlu_ru_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_sbert_large_nlu_ru_finetuned_squad_en_3_0.html) |[bert_qa_sbert_large_nlu_ru_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_sbert_large_nlu_ru_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.sapbert_from_pubmedbert_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_sapbert_from_pubmedbert_squad2_en_3_0.html) |[bert_qa_sapbert_from_pubmedbert_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_sapbert_from_pubmedbert_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.salti_bert_base_multilingual_cased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_salti_bert_base_multilingual_cased_finetuned_squad_en_3_0.html) |[bert_qa_salti_bert_base_multilingual_cased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_salti_bert_base_multilingual_cased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.sagemaker_BioclinicalBERT_ADR.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_sagemaker_BioclinicalBERT_ADR_en_3_0.html) |[bert_qa_sagemaker_BioclinicalBERT_ADR](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_sagemaker_BioclinicalBERT_ADR_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.results.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_results_en_3_0.html) |[bert_qa_results](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_results_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.question_answering_zh_voidful.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_question_answering_zh_voidful_en_3_0.html) |[bert_qa_question_answering_zh_voidful](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_question_answering_zh_voidful_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.qaconv_bert_large_uncased_whole_word_masking_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2_en_3_0.html) |[bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_qaconv_bert_large_uncased_whole_word_masking_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.pubmed_bert_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_pubmed_bert_squadv2_en_3_0.html) |[bert_qa_pubmed_bert_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_pubmed_bert_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.prunebert_base_uncased_6_finepruned_w_distil_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad_en_3_0.html) |[bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_prunebert_base_uncased_6_finepruned_w_distil_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.peterhsu_bert_finetuned_squad_accelerate.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_peterhsu_bert_finetuned_squad_accelerate_en_3_0.html) |[bert_qa_peterhsu_bert_finetuned_squad_accelerate](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_peterhsu_bert_finetuned_squad_accelerate_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.peterhsu_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_peterhsu_bert_finetuned_squad_en_3_0.html) |[bert_qa_peterhsu_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_peterhsu_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.output_files.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_output_files_en_3_0.html) |[bert_qa_output_files](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_output_files_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ofirzaf_bert_large_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_ofirzaf_bert_large_uncased_squad_en_3_0.html) |[bert_qa_ofirzaf_bert_large_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_ofirzaf_bert_large_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.nq_bert_el_Danastos.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_nq_bert_el_Danastos_en_3_0.html) |[bert_qa_nq_bert_el_Danastos](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_nq_bert_el_Danastos_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.nolog_SciBert_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_nolog_SciBert_v2_en_3_0.html) |[bert_qa_nolog_SciBert_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_nolog_SciBert_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.no_need_to_name_this.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_no_need_to_name_this_en_3_0.html) |[bert_qa_no_need_to_name_this](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_no_need_to_name_this_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.nlpunibo.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_nlpunibo_en_3_0.html) |[albert_qa_nlpunibo](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_nlpunibo_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.nickmuchi_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_nickmuchi_bert_finetuned_squad_en_3_0.html) |[bert_qa_nickmuchi_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_nickmuchi_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.newsqa_bert_el_Danastos.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_newsqa_bert_el_Danastos_en_3_0.html) |[bert_qa_newsqa_bert_el_Danastos](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_newsqa_bert_el_Danastos_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.news_pretrain_bert_FT_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_news_pretrain_bert_FT_newsqa_en_3_0.html) |[bert_qa_news_pretrain_bert_FT_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_news_pretrain_bert_FT_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.news_pretrain_bert_FT_new_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_news_pretrain_bert_FT_new_newsqa_en_3_0.html) |[bert_qa_news_pretrain_bert_FT_new_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_news_pretrain_bert_FT_new_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.neuralmagic_bert_squad_12layer_0sparse.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_neuralmagic_bert_squad_12layer_0sparse_en_3_0.html) |[bert_qa_neuralmagic_bert_squad_12layer_0sparse](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_neuralmagic_bert_squad_12layer_0sparse_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.muril_large_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_large_squad2_en_3_0.html) |[bert_qa_muril_large_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_large_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.muril_large_cased_hita_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_large_cased_hita_qa_en_3_0.html) |[bert_qa_muril_large_cased_hita_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_large_cased_hita_qa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.muril_finetuned_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_finetuned_squadv1_en_3_0.html) |[bert_qa_muril_finetuned_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_finetuned_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.muril_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_finetuned_squad_en_3_0.html) |[bert_qa_muril_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_muril_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.multilingual_bert_base_cased_vietnamese.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_vietnamese_en_3_0.html) |[bert_qa_multilingual_bert_base_cased_vietnamese](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_vietnamese_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.multilingual_bert_base_cased_spanish.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_spanish_en_3_0.html) |[bert_qa_multilingual_bert_base_cased_spanish](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_spanish_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.multilingual_bert_base_cased_hindi.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_hindi_en_3_0.html) |[bert_qa_multilingual_bert_base_cased_hindi](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_hindi_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.multilingual_bert_base_cased_german.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_german_en_3_0.html) |[bert_qa_multilingual_bert_base_cased_german](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_german_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.multilingual_bert_base_cased_english.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_english_en_3_0.html) |[bert_qa_multilingual_bert_base_cased_english](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_english_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.multilingual_bert_base_cased_arabic.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_arabic_en_3_0.html) |[bert_qa_multilingual_bert_base_cased_arabic](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_multilingual_bert_base_cased_arabic_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mrp_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_mrp_bert_finetuned_squad_en_3_0.html) |[bert_qa_mrp_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_mrp_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mrbalazs5_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_mrbalazs5_bert_finetuned_squad_en_3_0.html) |[bert_qa_mrbalazs5_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_mrbalazs5_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mqa_unsupsim.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_unsupsim_en_3_0.html) |[bert_qa_mqa_unsupsim](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_unsupsim_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mqa_sim.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_sim_en_3_0.html) |[bert_qa_mqa_sim](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_sim_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mqa_cls.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_cls_en_3_0.html) |[bert_qa_mqa_cls](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_cls_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mqa_baseline.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_baseline_en_3_0.html) |[bert_qa_mqa_baseline](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mqa_baseline_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.model_output.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_model_output_en_3_0.html) |[bert_qa_model_output](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_model_output_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.minilm_uncased_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_minilm_uncased_squad2_en_3_0.html) |[bert_qa_minilm_uncased_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_minilm_uncased_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.mBERT_all_ty_SQen_SQ20_1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mBERT_all_ty_SQen_SQ20_1_en_3_0.html) |[bert_qa_mBERT_all_ty_SQen_SQ20_1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_mBERT_all_ty_SQen_SQ20_1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.linkbert_large_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_linkbert_large_finetuned_squad_en_3_0.html) |[bert_qa_linkbert_large_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_linkbert_large_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.kaporter_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_kaporter_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_kaporter_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_kaporter_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.kamilali_distilbert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_kamilali_distilbert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_kamilali_distilbert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_kamilali_distilbert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.juliusco_distilbert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_juliusco_distilbert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_juliusco_distilbert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_juliusco_distilbert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.jimypbr_bert_base_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_jimypbr_bert_base_uncased_squad_en_3_0.html) |[bert_qa_jimypbr_bert_base_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_jimypbr_bert_base_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.jatinshah_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_jatinshah_bert_finetuned_squad_en_3_0.html) |[bert_qa_jatinshah_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_jatinshah_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ixambert_finetuned_squad_eu_en_MarcBrun.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_ixambert_finetuned_squad_eu_en_MarcBrun_en_3_0.html) |[bert_qa_ixambert_finetuned_squad_eu_en_MarcBrun](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_ixambert_finetuned_squad_eu_en_MarcBrun_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ixambert_finetuned_squad_eu_MarcBrun.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_ixambert_finetuned_squad_eu_MarcBrun_en_3_0.html) |[bert_qa_ixambert_finetuned_squad_eu_MarcBrun](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_ixambert_finetuned_squad_eu_MarcBrun_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ixambert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ixambert_finetuned_squad_en_3_0.html) |[bert_qa_ixambert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ixambert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.internetoftim_bert_large_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_internetoftim_bert_large_uncased_squad_en_3_0.html) |[bert_qa_internetoftim_bert_large_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_internetoftim_bert_large_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.huggingface_course_bert_finetuned_squad_accelerate.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_huggingface_course_bert_finetuned_squad_accelerate_en_3_0.html) |[bert_qa_huggingface_course_bert_finetuned_squad_accelerate](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_huggingface_course_bert_finetuned_squad_accelerate_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.huggingface_course_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_huggingface_course_bert_finetuned_squad_en_3_0.html) |[bert_qa_huggingface_course_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_huggingface_course_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.howey_bert_large_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_howey_bert_large_uncased_squad_en_3_0.html) |[bert_qa_howey_bert_large_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_howey_bert_large_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.generic.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_generic_en_3_0.html) |[albert_qa_generic](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_generic_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fpdm_triplet_bert_FT_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_triplet_bert_FT_newsqa_en_3_0.html) |[bert_qa_fpdm_triplet_bert_FT_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_triplet_bert_FT_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fpdm_triplet_bert_FT_new_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_triplet_bert_FT_new_newsqa_en_3_0.html) |[bert_qa_fpdm_triplet_bert_FT_new_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_triplet_bert_FT_new_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fpdm_hier_bert_FT_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_hier_bert_FT_newsqa_en_3_0.html) |[bert_qa_fpdm_hier_bert_FT_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_hier_bert_FT_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fpdm_hier_bert_FT_new_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_hier_bert_FT_new_newsqa_en_3_0.html) |[bert_qa_fpdm_hier_bert_FT_new_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_hier_bert_FT_new_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fpdm_bert_FT_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_bert_FT_newsqa_en_3_0.html) |[bert_qa_fpdm_bert_FT_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_bert_FT_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fpdm_bert_FT_new_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_bert_FT_new_newsqa_en_3_0.html) |[bert_qa_fpdm_bert_FT_new_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fpdm_bert_FT_new_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.finetune_bert_base_v3.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_finetune_bert_base_v3_en_3_0.html) |[bert_qa_finetune_bert_base_v3](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_finetune_bert_base_v3_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.finetune_bert_base_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_finetune_bert_base_v2_en_3_0.html) |[bert_qa_finetune_bert_base_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_finetune_bert_base_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.finetune_bert_base_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_finetune_bert_base_v1_en_3_0.html) |[bert_qa_finetune_bert_base_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_finetune_bert_base_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fine_tuned_tweetqa_aip.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fine_tuned_tweetqa_aip_en_3_0.html) |[bert_qa_fine_tuned_tweetqa_aip](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fine_tuned_tweetqa_aip_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fine_tuned_squad_aip.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fine_tuned_squad_aip_en_3_0.html) |[bert_qa_fine_tuned_squad_aip](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fine_tuned_squad_aip_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.fewrel_zero_shot.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fewrel_zero_shot_en_3_0.html) |[bert_qa_fewrel_zero_shot](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_fewrel_zero_shot_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.eauction_section_parsing_from_pretrained.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_eauction_section_parsing_from_pretrained_en_3_0.html) |[bert_qa_eauction_section_parsing_from_pretrained](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_eauction_section_parsing_from_pretrained_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.e_cased_qa_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_base_cased_qa_squad2_en_3_0.html) |[bert_base_cased_qa_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_base_cased_qa_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.distilbert_base_uncased_finetuned_custom.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_distilbert_base_uncased_finetuned_custom_en_3_0.html) |[bert_qa_distilbert_base_uncased_finetuned_custom](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_distilbert_base_uncased_finetuned_custom_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.demo.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_demo_en_3_0.html) |[bert_qa_demo](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_demo_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.deepset_bert_base_uncased_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_deepset_bert_base_uncased_squad2_en_3_0.html) |[bert_qa_deepset_bert_base_uncased_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_deepset_bert_base_uncased_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.debug_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_debug_squad_en_3_0.html) |[bert_qa_debug_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_debug_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.datauma_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_datauma_bert_finetuned_squad_en_3_0.html) |[bert_qa_datauma_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_datauma_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.csarron_bert_base_uncased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_csarron_bert_base_uncased_squad_v1_en_3_0.html) |[bert_qa_csarron_bert_base_uncased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_csarron_bert_base_uncased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.cs224n_squad2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_cs224n_squad2.0_xxlarge_v1_en_3_0.html) |[albert_qa_cs224n_squad2.0_xxlarge_v1](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_cs224n_squad2.0_xxlarge_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.covidbert_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_covidbert_squad_en_3_0.html) |[bert_qa_covidbert_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_covidbert_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.covid_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_covid_squad_en_3_0.html) |[bert_qa_covid_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_covid_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.chemical_bert_uncased_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chemical_bert_uncased_squad2_en_3_0.html) |[bert_qa_chemical_bert_uncased_squad2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chemical_bert_uncased_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.causal_qa.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_causal_qa_en_3_0.html) |[bert_qa_causal_qa](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_causal_qa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.braquad_bert_qna.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_braquad_bert_qna_en_3_0.html) |[bert_qa_braquad_bert_qna](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_braquad_bert_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biomedical_slot_filling_reader_large.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biomedical_slot_filling_reader_large_en_3_0.html) |[bert_qa_biomedical_slot_filling_reader_large](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biomedical_slot_filling_reader_large_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biomedical_slot_filling_reader_base.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biomedical_slot_filling_reader_base_en_3_0.html) |[bert_qa_biomedical_slot_filling_reader_base](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biomedical_slot_filling_reader_base_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bioformer_cased_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bioformer_cased_v1.0_squad1_en_3_0.html) |[bert_qa_bioformer_cased_v1.0_squad1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bioformer_cased_v1.0_squad1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biobert_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_v1.1_pubmed_squad_v2_en_3_0.html) |[bert_qa_biobert_v1.1_pubmed_squad_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_v1.1_pubmed_squad_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biobert_squad2_cased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_squad2_cased_finetuned_squad_en_3_0.html) |[bert_qa_biobert_squad2_cased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_squad2_cased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biobert_squad2_cased.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_squad2_cased_en_3_0.html) |[bert_qa_biobert_squad2_cased](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_squad2_cased_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biobert_bioasq.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_bioasq_en_3_0.html) |[bert_qa_biobert_bioasq](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_bioasq_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.biobert_base_cased_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert_en_3_0.html) |[bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_biobert_base_cased_v1.1_squad_finetuned_covdrobert_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bertserini_bert_large_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bertserini_bert_large_squad_en_3_0.html) |[bert_qa_bertserini_bert_large_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bertserini_bert_large_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bertserini_bert_base_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bertserini_bert_base_squad_en_3_0.html) |[bert_qa_bertserini_bert_base_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bertserini_bert_base_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bertimbau_squad1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bertimbau_squad1.1_en_3_0.html) |[bert_qa_bertimbau_squad1.1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bertimbau_squad1.1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_6_H_128_A_2_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_6_H_128_A_2_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_6_H_128_A_2_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_6_H_128_A_2_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_768_A_12_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_768_A_12_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_768_A_12_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_768_A_12_cord19_200616_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_768_A_12_cord19_200616_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_512_A_8_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_512_A_8_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_512_A_8_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_512_A_8_cord19_200616_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_512_A_8_cord19_200616_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_256_A_4_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_256_A_4_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_256_A_4_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_4_H_256_A_4_cord19_200616_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_4_H_256_A_4_cord19_200616_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_2_H_512_A_8_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_2_H_512_A_8_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_2_H_512_A_8_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_2_H_512_A_8_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_2_H_512_A_8_cord19_200616_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_2_H_512_A_8_cord19_200616_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_10_H_512_A_8_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_10_H_512_A_8_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_10_H_512_A_8_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna_en_3_0.html) |[bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_covid_qna_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_uncased_L_10_H_512_A_8_cord19_200616_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_en_3_0.html) |[bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_uncased_L_10_H_512_A_8_cord19_200616_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_tiny_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_tiny_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_tiny_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_finetuned_squad_en_3_0.html) |[bert_qa_bert_tiny_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_tiny_5_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_5_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_tiny_5_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_5_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_tiny_4_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_4_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_tiny_4_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_4_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_tiny_3_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_3_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_tiny_3_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_3_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_tiny_2_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_2_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_tiny_2_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_tiny_2_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_wrslb_finetuned_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_wrslb_finetuned_squadv1_en_3_0.html) |[bert_qa_bert_small_wrslb_finetuned_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_wrslb_finetuned_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_pretrained_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_pretrained_finetuned_squad_en_3_0.html) |[bert_qa_bert_small_pretrained_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_pretrained_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_small_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_finetuned_squad_en_3_0.html) |[bert_qa_bert_small_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_cord19qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_cord19qa_en_3_0.html) |[bert_qa_bert_small_cord19qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_cord19qa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_cord19_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_cord19_squad2_en_3_0.html) |[bert_qa_bert_small_cord19_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_cord19_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_small_2_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_2_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_small_2_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_small_2_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_set_date_1_lr_2e_5_bs_32_ep_4.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4_en_3_0.html) |[bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_set_date_1_lr_2e_5_bs_32_ep_4_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_reader_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_reader_squad2_en_3_0.html) |[bert_qa_bert_reader_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_reader_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_qasper.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_qasper_en_3_0.html) |[bert_qa_bert_qasper](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_qasper_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_qa_vi_nvkha.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_qa_vi_nvkha_en_3_0.html) |[bert_qa_bert_qa_vi_nvkha](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_qa_vi_nvkha_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_multi_uncased_finetuned_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_uncased_finetuned_chaii_en_3_0.html) |[bert_qa_bert_multi_uncased_finetuned_chaii](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_uncased_finetuned_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_multi_cased_squad_sv_marbogusz.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_multi_cased_squad_sv_marbogusz_en_3_0.html) |[bert_qa_bert_multi_cased_squad_sv_marbogusz](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_multi_cased_squad_sv_marbogusz_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab_en_3_0.html) |[bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finetuned_xquadv1_finetuned_squad_colab_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_multi_cased_finetuned_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finetuned_chaii_en_3_0.html) |[bert_qa_bert_multi_cased_finetuned_chaii](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finetuned_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_multi_cased_finedtuned_xquad_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finedtuned_xquad_chaii_en_3_0.html) |[bert_qa_bert_multi_cased_finedtuned_xquad_chaii](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finedtuned_xquad_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_mini_wrslb_finetuned_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_mini_wrslb_finetuned_squadv1_en_3_0.html) |[bert_qa_bert_mini_wrslb_finetuned_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_mini_wrslb_finetuned_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_mini_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_mini_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_mini_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_mini_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_mini_5_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_mini_5_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_mini_5_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_mini_5_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_medium_wrslb_finetuned_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_wrslb_finetuned_squadv1_en_3_0.html) |[bert_qa_bert_medium_wrslb_finetuned_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_wrslb_finetuned_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_medium_squad2_distilled.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_squad2_distilled_en_3_0.html) |[bert_qa_bert_medium_squad2_distilled](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_squad2_distilled_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_medium_pretrained_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_pretrained_finetuned_squad_en_3_0.html) |[bert_qa_bert_medium_pretrained_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_pretrained_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_medium_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_medium_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_medium_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_finetuned_squad_en_3_0.html) |[bert_qa_bert_medium_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_medium_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_wwm_squadv2_x2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1_en_3_0.html) |[bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_wwm_squadv2_x2.15_f83.2_d25_hybrid_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_restaurant_with_neg_with_repeat_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_mit_movie_with_neg_with_repeat_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_conll2003_with_neg_with_repeat_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pwhatisthe_conll2003_with_neg_with_repeat_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_with_ner_Pistherea_conll2003_with_neg_with_repeat_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_squad2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_finetuned_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_finetuned_islamic_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_finetuned_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_finetuned_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_whole_word_masking_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_chaii_en_3_0.html) |[bert_qa_bert_large_uncased_whole_word_masking_chaii](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_whole_word_masking_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_squadv2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_squadv2_en_3_0.html) |[bert_qa_bert_large_uncased_squadv2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_squadv2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured_en_3_0.html) |[bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_squadv1.1_sparse_90_unstructured_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_squad2_covid_qa_deepset.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_squad2_covid_qa_deepset_en_3_0.html) |[bert_qa_bert_large_uncased_squad2_covid_qa_deepset](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_squad2_covid_qa_deepset_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_uncased_finetuned_docvqa.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_finetuned_docvqa_en_3_0.html) |[bert_qa_bert_large_uncased_finetuned_docvqa](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_uncased_finetuned_docvqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_question_answering_finetuned_legal.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_question_answering_finetuned_legal_en_3_0.html) |[bert_qa_bert_large_question_answering_finetuned_legal](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_question_answering_finetuned_legal_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_finetuned_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_finetuned_squad2_en_3_0.html) |[bert_qa_bert_large_finetuned_squad2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_finetuned_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_finetuned.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_finetuned_en_3_0.html) |[bert_qa_bert_large_finetuned](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_finetuned_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_faquad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_large_faquad_en_3_0.html) |[bert_qa_bert_large_faquad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_large_faquad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_large_cased_whole_word_masking_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_cased_whole_word_masking_finetuned_squad_en_3_0.html) |[bert_qa_bert_large_cased_whole_word_masking_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_cased_whole_word_masking_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_l_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_l_squadv1.1_sl384_en_3_0.html) |[bert_qa_bert_l_squadv1.1_sl384](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_l_squadv1.1_sl384_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_squad_pytorch.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad_pytorch_en_3_0.html) |[bert_qa_bert_finetuned_squad_pytorch](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad_pytorch_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_squad_accelerate_10epoch_transformerfrozen.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen_en_3_0.html) |[bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad_accelerate_10epoch_transformerfrozen_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad2_en_3_0.html) |[bert_qa_bert_finetuned_squad2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_squad1.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad1_en_3_0.html) |[bert_qa_bert_finetuned_squad1](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_squad1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_qa.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_qa_en_3_0.html) |[bert_qa_bert_finetuned_qa](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_qa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_lr2_e5_b16_ep2.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_lr2_e5_b16_ep2_en_3_0.html) |[bert_qa_bert_finetuned_lr2_e5_b16_ep2](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_lr2_e5_b16_ep2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_finetuned_jackh1995.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_jackh1995_en_3_0.html) |[bert_qa_bert_finetuned_jackh1995](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_finetuned_jackh1995_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_fa_QA_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_fa_QA_v1_en_3_0.html) |[bert_qa_bert_fa_QA_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_fa_QA_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squadv1_x2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1_en_3_0.html) |[bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squadv1_x2.44_f87.7_d26_hybrid_filled_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squadv1_x1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1_en_3_0.html) |[bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squadv1_x1.96_f88.3_d27_hybrid_filled_opt_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_3_0.html) |[bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squadv1.1_sparse_80_1x4_block_pruneofa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squad_v1_sparse0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad_v1_sparse0.25_en_3_0.html) |[bert_qa_bert_base_uncased_squad_v1_sparse0.25](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad_v1_sparse0.25_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squad_L6.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad_L6_en_3_0.html) |[bert_qa_bert_base_uncased_squad_L6](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad_L6_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squad_L3.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad_L3_en_3_0.html) |[bert_qa_bert_base_uncased_squad_L3](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad_L3_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squad2_covid_qa_deepset.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad2_covid_qa_deepset_en_3_0.html) |[bert_qa_bert_base_uncased_squad2_covid_qa_deepset](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad2_covid_qa_deepset_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_squad1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2_en_3_0.html) |[bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_squad1.1_pruned_x3.2_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_qa_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_qa_squad2_en_3_0.html) |[bert_qa_bert_base_uncased_qa_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_qa_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_fiqa_flm_sq_flit.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_fiqa_flm_sq_flit_en_3_0.html) |[bert_qa_bert_base_uncased_fiqa_flm_sq_flit](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_fiqa_flm_sq_flit_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_vi_infovqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_vi_infovqa_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_vi_infovqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_vi_infovqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_squad_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_squad_v2_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_squad_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_squad_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_squad_v1_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_squad_frozen_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_squad_frozen_v2_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_squad_frozen_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_squad_frozen_v2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_newsqa_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_infovqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_infovqa_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_infovqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_infovqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_duorc_bert.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_duorc_bert_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_duorc_bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_duorc_bert_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_finetuned_docvqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_docvqa_en_3_0.html) |[bert_qa_bert_base_uncased_finetuned_docvqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_finetuned_docvqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_64_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_512_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_32_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_256_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_16_finetuned_squad_seed_42_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_128_finetuned_squad_seed_0_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42_en_3_0.html) |[bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_few_shot_k_1024_finetuned_squad_seed_42_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_uncased_coqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_coqa_en_3_0.html) |[bert_qa_bert_base_uncased_coqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_uncased_coqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3_en_3_0.html) |[bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_turkish_cased_finetuned_lr_2e_05_epochs_3_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_swedish_cased_squad_experimental.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_swedish_cased_squad_experimental_en_3_0.html) |[bert_qa_bert_base_swedish_cased_squad_experimental](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_swedish_cased_squad_experimental_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_squadv1_en_3_0.html) |[bert_qa_bert_base_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_spanish_wwm_uncased_finetuned_qa_tar.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar_en_3_0.html) |[bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_tar_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_spanish_wwm_uncased_finetuned_qa_sqac.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac_en_3_0.html) |[bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_sqac_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_spanish_wwm_uncased_finetuned_qa_mlqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa_en_3_0.html) |[bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_uncased_finetuned_qa_mlqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_spanish_wwm_cased_finetuned_qa_tar.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar_en_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_tar_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_spanish_wwm_cased_finetuned_qa_sqac.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac_en_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_sqac_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_spanish_wwm_cased_finetuned_qa_mlqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa_en_3_0.html) |[bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_spanish_wwm_cased_finetuned_qa_mlqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_xquad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_xquad_en_3_0.html) |[bert_qa_bert_base_multilingual_xquad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_xquad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_uncased_finetuned_squad_en_3_0.html) |[bert_qa_bert_base_multilingual_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_cased_korquad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_korquad_v1_en_3_0.html) |[bert_qa_bert_base_multilingual_cased_korquad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_korquad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_cased_korquad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_korquad_en_3_0.html) |[bert_qa_bert_base_multilingual_cased_korquad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_korquad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_cased_finetuned_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_squadv1_en_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetuned_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_squadv1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_cased_finetuned_klue.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_klue_en_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetuned_klue](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_klue_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_multilingual_cased_finetuned_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_chaii_en_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetuned_chaii](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_finetuned_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_finetuned_squad2_en_3_0.html) |[bert_qa_bert_base_finetuned_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_finetuned_squad2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_faquad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_faquad_en_3_0.html) |[bert_qa_bert_base_faquad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_faquad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_cased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_squad_v1_en_3_0.html) |[bert_qa_bert_base_cased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_cased_finetuned_squad_test.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_finetuned_squad_test_en_3_0.html) |[bert_qa_bert_base_cased_finetuned_squad_test](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_finetuned_squad_test_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_cased_chaii.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_chaii_en_3_0.html) |[bert_qa_bert_base_cased_chaii](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_chaii_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_cased_IUChatbot_ontologyDts.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_IUChatbot_ontologyDts_en_3_0.html) |[bert_qa_bert_base_cased_IUChatbot_ontologyDts](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_IUChatbot_ontologyDts_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_512_full_trivia.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_512_full_trivia_en_3_0.html) |[bert_qa_bert_base_512_full_trivia](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_512_full_trivia_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_4096_full_trivia_copied_embeddings.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_4096_full_trivia_copied_embeddings_en_3_0.html) |[bert_qa_bert_base_4096_full_trivia_copied_embeddings](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_4096_full_trivia_copied_embeddings_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_2048_full_trivia_copied_embeddings.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_2048_full_trivia_copied_embeddings_en_3_0.html) |[bert_qa_bert_base_2048_full_trivia_copied_embeddings](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_2048_full_trivia_copied_embeddings_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_base_1024_full_trivia_copied_embeddings.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_1024_full_trivia_copied_embeddings_en_3_0.html) |[bert_qa_bert_base_1024_full_trivia_copied_embeddings](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_1024_full_trivia_copied_embeddings_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_all_translated.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_translated_en_3_0.html) |[bert_qa_bert_all_translated](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_translated_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_all_squad_que_translated.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_squad_que_translated_en_3_0.html) |[bert_qa_bert_all_squad_que_translated](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_squad_que_translated_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_all_squad_ben_tel_context.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_squad_ben_tel_context_en_3_0.html) |[bert_qa_bert_all_squad_ben_tel_context](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_squad_ben_tel_context_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_all_squad_all_translated.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_squad_all_translated_en_3_0.html) |[bert_qa_bert_all_squad_all_translated](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_squad_all_translated_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_all.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_en_3_0.html) |[bert_qa_bert_all](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_all_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_FT_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_FT_newsqa_en_3_0.html) |[bert_qa_bert_FT_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_FT_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert_FT_new_newsqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_FT_new_newsqa_en_3_0.html) |[bert_qa_bert_FT_new_newsqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_FT_new_newsqa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bert.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_en_3_0.html) |[bert_qa_bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.bdickson_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bdickson_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_bdickson_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bdickson_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batteryscibert_uncased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryscibert_uncased_squad_v1_en_3_0.html) |[bert_qa_batteryscibert_uncased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryscibert_uncased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batteryscibert_cased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryscibert_cased_squad_v1_en_3_0.html) |[bert_qa_batteryscibert_cased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryscibert_cased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batteryonlybert_uncased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryonlybert_uncased_squad_v1_en_3_0.html) |[bert_qa_batteryonlybert_uncased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryonlybert_uncased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batteryonlybert_cased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryonlybert_cased_squad_v1_en_3_0.html) |[bert_qa_batteryonlybert_cased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batteryonlybert_cased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batterydata_bert_base_uncased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batterydata_bert_base_uncased_squad_v1_en_3_0.html) |[bert_qa_batterydata_bert_base_uncased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batterydata_bert_base_uncased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batterybert_uncased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batterybert_uncased_squad_v1_en_3_0.html) |[bert_qa_batterybert_uncased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batterybert_uncased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.batterybert_cased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batterybert_cased_squad_v1_en_3_0.html) |[bert_qa_batterybert_cased_squad_v1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_batterybert_cased_squad_v1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.base_v2_squad.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_base_v2_squad_en_3_0.html) |[albert_qa_base_v2_squad](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_base_v2_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.augmented_Squad_Translated.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_augmented_Squad_Translated_en_3_0.html) |[bert_qa_augmented_Squad_Translated](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_augmented_Squad_Translated_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.augmented.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_augmented_en_3_0.html) |[bert_qa_augmented](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_augmented_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.araSpeedest.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_araSpeedest_en_3_0.html) |[bert_qa_araSpeedest](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_araSpeedest_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ankitkupadhyay_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_ankitkupadhyay_bert_finetuned_squad_en_3_0.html) |[bert_qa_ankitkupadhyay_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_ankitkupadhyay_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.andresestevez_bert_finetuned_squad_accelerate.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_andresestevez_bert_finetuned_squad_accelerate_en_3_0.html) |[bert_qa_andresestevez_bert_finetuned_squad_accelerate](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_andresestevez_bert_finetuned_squad_accelerate_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.andresestevez_bert_base_cased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_andresestevez_bert_base_cased_finetuned_squad_en_3_0.html) |[bert_qa_andresestevez_bert_base_cased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_andresestevez_bert_base_cased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ai_club_inductions_21_nlp.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_ai_club_inductions_21_nlp_en_3_0.html) |[albert_qa_ai_club_inductions_21_nlp](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_ai_club_inductions_21_nlp_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Trial_3_Results.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Trial_3_Results_en_3_0.html) |[bert_qa_Trial_3_Results](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Trial_3_Results_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Tianle_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Tianle_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_Tianle_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Tianle_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.SupriyaArun_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SupriyaArun_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.SreyanG_NVIDIA_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SreyanG_NVIDIA_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.SreyanG_NVIDIA_bert_base_cased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad_en_3_0.html) |[bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SreyanG_NVIDIA_bert_base_cased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Spanbert_emotion_extraction.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Spanbert_emotion_extraction_en_3_0.html) |[bert_qa_Spanbert_emotion_extraction](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Spanbert_emotion_extraction_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT.bert](https://nlp.johnsnowlabs.com) |[bert_qa_Sotireas_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT](https://nlp.johnsnowlabs.com) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_3_0.html) |[bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Shushant_BiomedNLP_PubMedBERT_base_uncased_abstract_fulltext_ContaminationQAmodel_PubmedBERT_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Seongkyu_bert_base_cased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Seongkyu_bert_base_cased_finetuned_squad_en_3_0.html) |[bert_qa_Seongkyu_bert_base_cased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Seongkyu_bert_base_cased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.SciBERT_SQuAD_QuAC.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SciBERT_SQuAD_QuAC_en_3_0.html) |[bert_qa_SciBERT_SQuAD_QuAC](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_SciBERT_SQuAD_QuAC_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.QA_1e.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_QA_1e_en_3_0.html) |[albert_qa_QA_1e](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_QA_1e_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.PruebaBert.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_PruebaBert_en_3_0.html) |[bert_qa_PruebaBert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_PruebaBert_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Paul_Vinh_bert_base_multilingual_cased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad_en_3_0.html) |[bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Paul_Vinh_bert_base_multilingual_cased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Part_2_mBERT_Model_E2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_2_mBERT_Model_E2_en_3_0.html) |[bert_qa_Part_2_mBERT_Model_E2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_2_mBERT_Model_E2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Part_2_BERT_Multilingual_Dutch_Model_E1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1_en_3_0.html) |[bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_2_BERT_Multilingual_Dutch_Model_E1_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Part_1_mBERT_Model_E2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_1_mBERT_Model_E2_en_3_0.html) |[bert_qa_Part_1_mBERT_Model_E2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_1_mBERT_Model_E2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Neulvo_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Neulvo_bert_finetuned_squad_en_3_0.html) |[bert_qa_Neulvo_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Neulvo_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Multi_ling_BERT.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Multi_ling_BERT_en_3_0.html) |[bert_qa_Multi_ling_BERT](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Multi_ling_BERT_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.MiniLM_L12_H384_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_MiniLM_L12_H384_uncased_finetuned_squad_en_3_0.html) |[bert_qa_MiniLM_L12_H384_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_MiniLM_L12_H384_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.ManuERT_for_xqua.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ManuERT_for_xqua_en_3_0.html) |[bert_qa_ManuERT_for_xqua](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ManuERT_for_xqua_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.MTL_bert_base_uncased_ww_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_MTL_bert_base_uncased_ww_squad_en_3_0.html) |[bert_qa_MTL_bert_base_uncased_ww_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_MTL_bert_base_uncased_ww_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Laikokwei_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Laikokwei_bert_finetuned_squad_en_3_0.html) |[bert_qa_Laikokwei_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Laikokwei_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Klue_CommonSense_model.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Klue_CommonSense_model_en_3_0.html) |[bert_qa_Klue_CommonSense_model](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Klue_CommonSense_model_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.KevinChoi_bert_finetuned_squad_accelerate.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_KevinChoi_bert_finetuned_squad_accelerate_en_3_0.html) |[bert_qa_KevinChoi_bert_finetuned_squad_accelerate](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_KevinChoi_bert_finetuned_squad_accelerate_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.KevinChoi_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_KevinChoi_bert_finetuned_squad_en_3_0.html) |[bert_qa_KevinChoi_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_KevinChoi_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.HomayounSadri_bert_base_uncased_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad_en_3_0.html) |[bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_HomayounSadri_bert_base_uncased_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Harsit_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Harsit_bert_finetuned_squad_en_3_0.html) |[bert_qa_Harsit_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Harsit_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Graphcore_bert_large_uncased_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Graphcore_bert_large_uncased_squad_en_3_0.html) |[bert_qa_Graphcore_bert_large_uncased_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Graphcore_bert_large_uncased_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.FardinSaboori_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_FardinSaboori_bert_finetuned_squad_en_3_0.html) |[bert_qa_FardinSaboori_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_FardinSaboori_bert_finetuned_squad_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.COVID_BERTc.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_COVID_BERTc_en_3_0.html) |[bert_qa_COVID_BERTc](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_COVID_BERTc_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.COVID_BERTb.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_COVID_BERTb_en_3_0.html) |[bert_qa_COVID_BERTb](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_COVID_BERTb_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.COVID_BERTa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_COVID_BERTa_en_3_0.html) |[bert_qa_COVID_BERTa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_COVID_BERTa_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.BioM_xxlarge_SQuAD2.albert](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_BioM_xxlarge_SQuAD2_en_3_0.html) |[albert_qa_BioM_xxlarge_SQuAD2](https://nlp.johnsnowlabs.com/2022/06/24/albert_qa_BioM_xxlarge_SQuAD2_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Bertv1_fine.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Bertv1_fine_en_3_0.html) |[bert_qa_Bertv1_fine](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Bertv1_fine_en_3_0.html) |\n", - "|[English](https://iso639-3.sil.org/code/eng) |[en.answer_question.Alexander_Learn_bert_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Alexander_Learn_bert_finetuned_squad_en_3_0.html) |[bert_qa_Alexander_Learn_bert_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_Alexander_Learn_bert_finetuned_squad_en_3_0.html) |\n", - "|[German](https://iso639-3.sil.org/code/deu) |[de.answer_question.bert_multi_english_german_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_english_german_squad2_de_3_0.html) |[bert_qa_bert_multi_english_german_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_english_german_squad2_de_3_0.html) |\n", - "|[German](https://iso639-3.sil.org/code/deu) |[de.answer_question.GBERTQnA.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_GBERTQnA_de_3_0.html) |[bert_qa_GBERTQnA](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_GBERTQnA_de_3_0.html) |\n", - "|[Hebrew](https://iso639-3.sil.org/code/heb) |[he.answer_question.hebert_finetuned_hebrew_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_hebert_finetuned_hebrew_squad_he_3_0.html) |[bert_qa_hebert_finetuned_hebrew_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_hebert_finetuned_hebrew_squad_he_3_0.html) |\n", - "|[Hungarian](https://iso639-3.sil.org/code/hun) |[hu.answer_question.huBert_fine_tuned_hungarian_squadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_huBert_fine_tuned_hungarian_squadv1_hu_3_0.html) |[bert_qa_huBert_fine_tuned_hungarian_squadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_huBert_fine_tuned_hungarian_squadv1_hu_3_0.html) |\n", - "|[Indonesian](https://iso639-3.sil.org/code/ind) |[id.answer_question.Indobert_QA.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Indobert_QA_id_3_0.html) |[bert_qa_Indobert_QA](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Indobert_QA_id_3_0.html) |\n", - "|[Italian](https://iso639-3.sil.org/code/ita) |[it.answer_question.squad_xxl_cased_hub1.bert](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_squad_xxl_cased_hub1_it_3_0.html) |[bert_qa_squad_xxl_cased_hub1](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_squad_xxl_cased_hub1_it_3_0.html) |\n", - "|[Italian](https://iso639-3.sil.org/code/ita) |[it.answer_question.bert_italian_finedtuned_squadv1_it_alfa.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_italian_finedtuned_squadv1_it_alfa_it_3_0.html) |[bert_qa_bert_italian_finedtuned_squadv1_it_alfa](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_italian_finedtuned_squadv1_it_alfa_it_3_0.html) |\n", - "|[Italian](https://iso639-3.sil.org/code/ita) |[it.answer_question.bert_base_italian_uncased_squad_it_antoniocappiello.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello_it_3_0.html) |[bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_italian_uncased_squad_it_antoniocappiello_it_3_0.html) |\n", - "|[Japanese](https://iso639-3.sil.org/code/jpn) |[ja.answer_question.large_japanese_wikipedia_ud_head.bert](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_large_japanese_wikipedia_ud_head_ja_3_0.html) |[bert_qa_large_japanese_wikipedia_ud_head](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_large_japanese_wikipedia_ud_head_ja_3_0.html) |\n", - "|[Japanese](https://iso639-3.sil.org/code/jpn) |[ja.answer_question.base_japanese_wikipedia_ud_head.bert](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_base_japanese_wikipedia_ud_head_ja_3_0.html) |[bert_qa_base_japanese_wikipedia_ud_head](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_base_japanese_wikipedia_ud_head_ja_3_0.html) |\n", - "|[Korean](https://iso639-3.sil.org/code/kor) |[ko.answer_question.klue_bert_base_aihub_mrc.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_klue_bert_base_aihub_mrc_ko_3_0.html) |[bert_qa_klue_bert_base_aihub_mrc](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_klue_bert_base_aihub_mrc_ko_3_0.html) |\n", - "|[Korean](https://iso639-3.sil.org/code/kor) |[ko.answer_question.bespin_global_klue_bert_base_mrc.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bespin_global_klue_bert_base_mrc_ko_3_0.html) |[bert_qa_bespin_global_klue_bert_base_mrc](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bespin_global_klue_bert_base_mrc_ko_3_0.html) |\n", - "|[Korean](https://iso639-3.sil.org/code/kor) |[ko.answer_question.ainize_klue_bert_base_mrc.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ainize_klue_bert_base_mrc_ko_3_0.html) |[bert_qa_ainize_klue_bert_base_mrc](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_ainize_klue_bert_base_mrc_ko_3_0.html) |\n", - "|[Modern Greek (1453-)](https://iso639-3.sil.org/code/ell) |[el.answer_question.qacombination_bert_el_Danastos.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_qacombination_bert_el_Danastos_el_3_0.html) |[bert_qa_qacombination_bert_el_Danastos](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_qacombination_bert_el_Danastos_el_3_0.html) |\n", - "|[Polish](https://iso639-3.sil.org/code/pol) |[pl.answer_question.bert_base_multilingual_cased_finetuned_polish_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2_pl_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad2_pl_3_0.html) |\n", - "|[Polish](https://iso639-3.sil.org/code/pol) |[pl.answer_question.bert_base_multilingual_cased_finetuned_polish_squad1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1_pl_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetuned_polish_squad1_pl_3_0.html) |\n", - "|[Portuguese](https://iso639-3.sil.org/code/por) |[pt.answer_question.bioBERTpt_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bioBERTpt_squad_v1.1_portuguese_pt_3_0.html) |[bert_qa_bioBERTpt_squad_v1.1_portuguese](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bioBERTpt_squad_v1.1_portuguese_pt_3_0.html) |\n", - "|[Portuguese](https://iso639-3.sil.org/code/por) |[pt.answer_question.bert_large_cased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_cased_squad_v1.1_portuguese_pt_3_0.html) |[bert_qa_bert_large_cased_squad_v1.1_portuguese](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_bert_large_cased_squad_v1.1_portuguese_pt_3_0.html) |\n", - "|[Portuguese](https://iso639-3.sil.org/code/por) |[pt.answer_question.bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488_pt_3_0.html) |[bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_base_portuguese_cased_finetuned_squad_v1_pt_mrm8488_pt_3_0.html) |\n", - "|[Portuguese](https://iso639-3.sil.org/code/por) |[pt.answer_question.bert_base_cased_squad_v1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_squad_v1.1_portuguese_pt_3_0.html) |[bert_qa_bert_base_cased_squad_v1.1_portuguese](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_cased_squad_v1.1_portuguese_pt_3_0.html) |\n", - "|[Sinhala, Sinhalese](https://iso639-3.sil.org/code/sin) |[si.answer_question.bert_base_sinhala_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_sinhala_qa_si_3_0.html) |[bert_qa_bert_base_sinhala_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_sinhala_qa_si_3_0.html) |\n", - "|[Swedish](https://iso639-3.sil.org/code/swe) |[sv.answer_question.bert_base_swedish_squad2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_swedish_squad2_sv_3_0.html) |[bert_qa_bert_base_swedish_squad2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_swedish_squad2_sv_3_0.html) |\n", - "|[Thai](https://iso639-3.sil.org/code/tha) |[th.answer_question.xquad_th_mbert_base.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xquad_th_mbert_base_th_3_0.html) |[bert_qa_xquad_th_mbert_base](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_xquad_th_mbert_base_th_3_0.html) |\n", - "|[Thai](https://iso639-3.sil.org/code/tha) |[th.answer_question.thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad_th_3_0.html) |[bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_thai_bert_multi_cased_finetuned_xquadv1_finetuned_squad_th_3_0.html) |\n", - "|[Thai](https://iso639-3.sil.org/code/tha) |[th.answer_question.bert_base_multilingual_cased_finetune_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetune_qa_th_3_0.html) |[bert_qa_bert_base_multilingual_cased_finetune_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_multilingual_cased_finetune_qa_th_3_0.html) |\n", - "|[Turkish](https://iso639-3.sil.org/code/tur) |[tr.answer_question.loodos_bert_base_uncased_QA_fine_tuned.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_loodos_bert_base_uncased_QA_fine_tuned_tr_3_0.html) |[bert_qa_loodos_bert_base_uncased_QA_fine_tuned](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_loodos_bert_base_uncased_QA_fine_tuned_tr_3_0.html) |\n", - "|[Turkish](https://iso639-3.sil.org/code/tur) |[tr.answer_question.logo_qna_model.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_logo_qna_model_tr_3_0.html) |[bert_qa_logo_qna_model](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_logo_qna_model_tr_3_0.html) |\n", - "|[Turkish](https://iso639-3.sil.org/code/tur) |[tr.answer_question.distilbert_tr_q_a.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_distilbert_tr_q_a_tr_3_0.html) |[bert_qa_distilbert_tr_q_a](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_distilbert_tr_q_a_tr_3_0.html) |\n", - "|[Turkish](https://iso639-3.sil.org/code/tur) |[tr.answer_question.bert_turkish_question_answering.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_turkish_question_answering_tr_3_0.html) |[bert_qa_bert_turkish_question_answering](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_turkish_question_answering_tr_3_0.html) |\n", - "|[Turkish](https://iso639-3.sil.org/code/tur) |[tr.answer_question.bert_base_turkish_squad.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_turkish_squad_tr_3_0.html) |[bert_qa_bert_base_turkish_squad](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_turkish_squad_tr_3_0.html) |\n", - "|[Arabic](https://iso639-3.sil.org/code/ara) |[ar.answer_question.arap_qa_bert_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_arap_qa_bert_v2_ar_3_0.html) |[bert_qa_arap_qa_bert_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_arap_qa_bert_v2_ar_3_0.html) |\n", - "|[Arabic](https://iso639-3.sil.org/code/ara) |[ar.answer_question.arap_qa_bert_large_v2.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_arap_qa_bert_large_v2_ar_3_0.html) |[bert_qa_arap_qa_bert_large_v2](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_arap_qa_bert_large_v2_ar_3_0.html) |\n", - "|[Arabic](https://iso639-3.sil.org/code/ara) |[ar.answer_question.arap_qa_bert.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_arap_qa_bert_ar_3_0.html) |[bert_qa_arap_qa_bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_arap_qa_bert_ar_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.roberta_base_chinese_extractive_qa_scratch.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_roberta_base_chinese_extractive_qa_scratch_zh_3_0.html) |[bert_qa_roberta_base_chinese_extractive_qa_scratch](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_roberta_base_chinese_extractive_qa_scratch_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.roberta_base_chinese_extractive_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_roberta_base_chinese_extractive_qa_zh_3_0.html) |[bert_qa_roberta_base_chinese_extractive_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_roberta_base_chinese_extractive_qa_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.question_answering_chinese.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_question_answering_chinese_zh_3_0.html) |[bert_qa_question_answering_chinese](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_question_answering_chinese_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.qa_roberta_base_chinese_extractive.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_qa_roberta_base_chinese_extractive_zh_3_0.html) |[bert_qa_qa_roberta_base_chinese_extractive](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_qa_roberta_base_chinese_extractive_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.multilingual_bert_base_cased_chinese.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_multilingual_bert_base_cased_chinese_zh_3_0.html) |[bert_qa_multilingual_bert_base_cased_chinese](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_multilingual_bert_base_cased_chinese_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.chinese_pretrain_mrc_roberta_wwm_ext_large.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large_zh_3_0.html) |[bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_chinese_pretrain_mrc_roberta_wwm_ext_large_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.chinese_pretrain_mrc_macbert_large.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chinese_pretrain_mrc_macbert_large_zh_3_0.html) |[bert_qa_chinese_pretrain_mrc_macbert_large](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chinese_pretrain_mrc_macbert_large_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.chinese_pert_large_open_domain_mrc.bert](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_chinese_pert_large_open_domain_mrc_zh_3_0.html) |[bert_qa_chinese_pert_large_open_domain_mrc](https://nlp.johnsnowlabs.com/2022/06/28/bert_qa_chinese_pert_large_open_domain_mrc_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.chinese_pert_large_mrc.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chinese_pert_large_mrc_zh_3_0.html) |[bert_qa_chinese_pert_large_mrc](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chinese_pert_large_mrc_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.chinese_pert_base_mrc.bert](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chinese_pert_base_mrc_zh_3_0.html) |[bert_qa_chinese_pert_base_mrc](https://nlp.johnsnowlabs.com/2022/06/06/bert_qa_chinese_pert_base_mrc_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.bert_chinese_finetuned.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_chinese_finetuned_zh_3_0.html) |[bert_qa_bert_chinese_finetuned](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_chinese_finetuned_zh_3_0.html) |\n", - "|[Chinese](https://iso639-3.sil.org/code/zho) |[zh.answer_question.bert_base_chinese_finetuned_squad_colab.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_chinese_finetuned_squad_colab_zh_3_0.html) |[bert_qa_bert_base_chinese_finetuned_squad_colab](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_chinese_finetuned_squad_colab_zh_3_0.html) |\n", - "|[Persian](https://iso639-3.sil.org/code/fas) |[fa.answer_question.bert_base_fa_qa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_fa_qa_fa_3_0.html) |[bert_qa_bert_base_fa_qa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_base_fa_qa_fa_3_0.html) |\n", - "| Multilingual |[xx.answer_question.telugu_bertu_tydiqa.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_telugu_bertu_tydiqa_xx_3_0.html) |[bert_qa_telugu_bertu_tydiqa](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_telugu_bertu_tydiqa_xx_3_0.html) |\n", - "| Multilingual |[xx.answer_question.bert_multi_uncased_finetuned_xquadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_uncased_finetuned_xquadv1_xx_3_0.html) |[bert_qa_bert_multi_uncased_finetuned_xquadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_uncased_finetuned_xquadv1_xx_3_0.html) |\n", - "| Multilingual |[xx.answer_question.bert_multi_cased_finetuned_xquadv1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finetuned_xquadv1_xx_3_0.html) |[bert_qa_bert_multi_cased_finetuned_xquadv1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_bert_multi_cased_finetuned_xquadv1_xx_3_0.html) |\n", - "| Multilingual |[xx.answer_question.bert_multi_cased_finedtuned_xquad_tydiqa_goldp.bert](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp_xx_3_0.html) |[bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp](https://nlp.johnsnowlabs.com/2022/06/03/bert_qa_bert_multi_cased_finedtuned_xquad_tydiqa_goldp_xx_3_0.html) |\n", - "| Multilingual |[xx.answer_question.Part_1_mBERT_Model_E1.bert](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_1_mBERT_Model_E1_xx_3_0.html) |[bert_qa_Part_1_mBERT_Model_E1](https://nlp.johnsnowlabs.com/2022/06/02/bert_qa_Part_1_mBERT_Model_E1_xx_3_0.html) |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4E8ChFP8nMmN" - }, - "source": [ - "### RoBertaForQuestionAnswering" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 38551, - "status": "ok", - "timestamp": 1664910292214, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "MtJPvinzl8Lz", - "outputId": "b0aef501-6d1e-47e4-f012-75350e4ab5e1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "roberta_qa_roberta_base_squad2_covid download started this may take some time.\n", - "Approximate size to download 442.8 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "document_assembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = RoBertaForQuestionAnswering.pretrained(\"roberta_qa_roberta_base_squad2_covid\",\"en\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\") \\\n", - " .setCaseSensitive(True)\n", - "\n", - "pipeline = Pipeline().setStages([document_assembler,\n", - " spanClassifier])\n", - "\n", - "data = spark.createDataFrame([[\"Do I have Covid?\", \"I have a fever and a cough and for the past few days, I have lost my sense of smell and taste. Later I was diagnosed with Covid.\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(data).transform(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3691, - "status": "ok", - "timestamp": 1664910329759, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "xeDpxyghjJVI", - "outputId": "20ee4a92-06bb-481d-825a-6a16b21dcb3c" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----------------------------------+\n", - "|result |\n", - "+----------------------------------+\n", - "|[Later I was diagnosed with Covid]|\n", - "+----------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DpDBHMaQn0iW" - }, - "source": [ - "### AlbertForQuestionAnswering" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 50871, - "status": "ok", - "timestamp": 1664910383594, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "w4wnfbn8n3_5", - "outputId": "0e0ab7fd-421f-438d-a997-e0a20ee42791" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "albert_qa_xxlargev1_squad2_512 download started this may take some time.\n", - "Approximate size to download 736.4 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = AlbertForQuestionAnswering.pretrained(\"albert_qa_xxlargev1_squad2_512\",\"en\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\")\\\n", - " .setCaseSensitive(True)\n", - " \n", - "pipeline = Pipeline(stages=[documentAssembler, \n", - " spanClassifier])\n", - "\n", - "data = spark.createDataFrame([[\"Which name is also used to describe the Amazon rainforest in English?\",\"\"\"The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain \"Amazonas\" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.\"\"\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(data).transform(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 17483, - "status": "ok", - "timestamp": 1664910401062, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "10ClpPGaoERB", - "outputId": "aba3f9ae-6290-47c6-ba05-25b66d8199bd" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------+\n", - "|result |\n", - "+-----------------+\n", - "|[theAmazonJungle]|\n", - "+-----------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rnWnzgUUXJsZ" - }, - "source": [ - "### BertForQuestionAnswering" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 30954, - "status": "ok", - "timestamp": 1664910431991, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "9fQvPY4JXJsa", - "outputId": "d6e1fcca-6050-4922-e704-f6ecba95ef71" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488 download started this may take some time.\n", - "Approximate size to download 391 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "document_assembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = BertForQuestionAnswering.pretrained(\"bert_qa_bert_base_spanish_wwm_cased_finetuned_spa_squad2_es_mrm8488\",\"es\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\") \\\n", - " .setCaseSensitive(True)\n", - "\n", - "\n", - "pipeline = Pipeline().setStages([document_assembler,\n", - " spanClassifier])\n", - "\n", - "# Question in Spanish: How many people speak Spanish?\n", - "# Context in Spanish: Spanish is the second most spoken language in the world with more than 442 million speakers\n", - "\n", - "example = spark.createDataFrame([[\"¿Cuántas personas hablan español?\", \"El español es el segundo idioma más hablado del mundo con más de 442 millones de hablantes\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(example).transform(example)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 2208, - "status": "ok", - "timestamp": 1664910434167, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Kgxml8bBXJsa", - "outputId": "ae9ebf5d-35f6-4031-da8f-230e2d930b91" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------+\n", - "|result |\n", - "+--------------+\n", - "|[442 millones]|\n", - "+--------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_kVOZDI3aK6V" - }, - "source": [ - "### DebertaForQuestionAnswering" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 19986, - "status": "ok", - "timestamp": 1664910553398, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "p3_R4F3iaK6W", - "outputId": "0ae9cbb7-13bc-4c0c-f88a-ff9215e227da" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "deberta_v3_xsmall_qa_squad2 download started this may take some time.\n", - "Approximate size to download 240.6 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = DeBertaForQuestionAnswering .pretrained(\"deberta_v3_xsmall_qa_squad2\",\"en\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\")\\\n", - " .setCaseSensitive(True)\n", - " \n", - "pipeline = Pipeline(stages=[documentAssembler, \n", - " spanClassifier])\n", - "\n", - "data = spark.createDataFrame([[\"What is my name?\", \"My name is Clara and I live in Berkeley.\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(data).transform(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 7865, - "status": "ok", - "timestamp": 1664910561256, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "u28_2t-haK6W", - "outputId": "bc945e1e-d670-4c98-9387-851218a49f33" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------+\n", - "|result |\n", - "+-------+\n", - "|[Clara]|\n", - "+-------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZA1DFJXEavFV" - }, - "source": [ - "### DistilBertForQuestionAnswering" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 20860, - "status": "ok", - "timestamp": 1664910582104, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "omf7L7L4avFV", - "outputId": "51e0742c-4704-4be8-adf2-1dab46771695" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "distilbert_base_cased_qa_squad2 download started this may take some time.\n", - "Approximate size to download 232.8 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = DistilBertForQuestionAnswering.pretrained(\"distilbert_base_cased_qa_squad2\",\"en\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\")\\\n", - " .setCaseSensitive(True)\n", - " \n", - "pipeline = Pipeline(stages=[documentAssembler, \n", - " spanClassifier])\n", - "\n", - "data = spark.createDataFrame([[\"Where do I live?\", \"My name is Wolfgang and I live in Berlin\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(data).transform(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1867, - "status": "ok", - "timestamp": 1664910583965, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "2YHTKHKYavFW", - "outputId": "1eb08447-1183-41ef-bce6-9eecf532ba55" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------+\n", - "|result |\n", - "+--------+\n", - "|[Berlin]|\n", - "+--------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "74686exgbfs9" - }, - "source": [ - "### LongformerForQuestionAnswering " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 130747, - "status": "ok", - "timestamp": 1664911424053, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "-XkL2N8Gbfs9", - "outputId": "e2da50e7-5a48-4ada-e182-0a2604bfd81b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "longformer_qa_large_4096_finetuned_triviaqa download started this may take some time.\n", - "Approximate size to download 1.5 GB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = LongformerForQuestionAnswering.pretrained(\"longformer_qa_large_4096_finetuned_triviaqa\",\"en\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\")\\\n", - " .setCaseSensitive(True)\n", - "\n", - "pipeline = Pipeline(stages=[documentAssembler, \n", - " spanClassifier])\n", - "\n", - "data = spark.createDataFrame([[\"Where did Super Bowl 50 take place?\", \"\"\"Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. \n", - "The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. \n", - "The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.\n", - "As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives,\n", - "as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"),\n", - "so that the logo could prominently feature the Arabic numerals 50.\"\"\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(data).transform(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 53939, - "status": "ok", - "timestamp": 1664911477978, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "oImAnppEbfs9", - "outputId": "e1d3d2b0-c76e-4ef8-c383-3563d2b21df1" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-----------------+\n", - "|result |\n", - "+-----------------+\n", - "|[Levi 's Stadium]|\n", - "+-----------------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Bl65Z_EcdCbE" - }, - "source": [ - "### XlmRoBertaForQuestionAnswering \n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 63421, - "status": "ok", - "timestamp": 1664911541369, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "a7Xscel0dCbF", - "outputId": "88940ef5-6fa1-40be-ad7d-7aaf238dc2bb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "xlm_roberta_base_qa_squad2 download started this may take some time.\n", - "Approximate size to download 834.5 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = MultiDocumentAssembler() \\\n", - " .setInputCols([\"question\", \"context\"]) \\\n", - " .setOutputCols([\"document_question\", \"document_context\"])\n", - "\n", - "spanClassifier = XlmRoBertaForQuestionAnswering.pretrained(\"xlm_roberta_base_qa_squad2\",\"en\") \\\n", - " .setInputCols([\"document_question\", \"document_context\"]) \\\n", - " .setOutputCol(\"answer\")\\\n", - " .setCaseSensitive(True)\n", - " \n", - "pipeline = Pipeline(stages=[documentAssembler, \n", - " spanClassifier])\n", - "\n", - "data = spark.createDataFrame([[\"What year was the Carolina Panthers franchise founded?\", \"\"\"The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP).\n", - "They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995.\n", - "The Broncos finished the regular season with a 12–4 record, and denied the New England Patriots a chance to defend their title from Super Bowl XLIX by defeating them 20–18 in the AFC Championship Game.\n", - "They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four teams that have made eight appearances in the Super Bowl.\"\"\"]]).toDF(\"question\", \"context\")\n", - "\n", - "result = pipeline.fit(data).transform(data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3809, - "status": "ok", - "timestamp": 1664911545151, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "0C5wCJt9dCbF", - "outputId": "b6e54ca4-d993-4131-c72d-b49ae2d671f8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------+\n", - "|result |\n", - "+-------+\n", - "|[1995.]|\n", - "+-------+\n", - "\n" - ] - } - ], - "source": [ - "result.select('answer.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZpYgGOEgjayA" - }, - "source": [ - "## NERDL Model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CHq-SZJgkttM" - }, - "source": [ - "![image.png]()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "71rZVJit4HAb" - }, - "source": [ - "### Public NER (CoNLL 2003)\n", - "\n", - "

Named-Entity recognition is a well-known technique in information extraction it is also known as entity identificationentity chunking and entity extraction. Knowing the relevant tags for each article help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "j7Qr_dVL4LDZ" - }, - "source": [ - "Entities\n", - "\n", - "``` PERSON, LOCATION, ORGANIZATION, MISC ```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 5964, - "status": "ok", - "timestamp": 1664911551077, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "fR6faruZfKFb", - "outputId": "64c70264-04ce-4c71-b46b-e924c5445253" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ner_dl download started this may take some time.\n", - "Approximate size to download 13.6 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "public_ner = NerDLModel.pretrained(\"ner_dl\", 'en') \\\n", - " .setInputCols([\"document\", \"token\", \"embeddings\"]) \\\n", - " .setOutputCol(\"ner\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 43, - "status": "ok", - "timestamp": 1664911551078, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Yaf2zNA9JfZF", - "outputId": "5384db93-343f-44d7-f43f-97ec602117a7" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{Param(parent='NerDLModel_d4424c9af5f4', name='includeAllConfidenceScores', doc='whether to include all confidence scores in annotation metadata or just the score of the predicted tag'): False,\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): False,\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='batchSize', doc='Size of every batch'): 32,\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='classes', doc='get the tags used to trained this NerDLModel'): ['O',\n", - " 'B-ORG',\n", - " 'B-LOC',\n", - " 'B-PER',\n", - " 'I-PER',\n", - " 'I-ORG',\n", - " 'B-MISC',\n", - " 'I-LOC',\n", - " 'I-MISC'],\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='inputCols', doc='previous annotations columns, if renamed'): ['document',\n", - " 'token',\n", - " 'embeddings'],\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='outputCol', doc='output annotation column. can be left default.'): 'ner',\n", - " Param(parent='NerDLModel_d4424c9af5f4', name='storageRef', doc='unique reference name for identification'): 'glove_100d'}" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "public_ner.extractParamMap()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 33, - "status": "ok", - "timestamp": 1664911551078, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "HzAZjzMGJ0x5", - "outputId": "91a459d7-a1de-46f4-f199-d8a8ffe9f0a1" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['O', 'B-ORG', 'B-LOC', 'B-PER', 'I-PER', 'I-ORG', 'B-MISC', 'I-LOC', 'I-MISC']" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "public_ner.getClasses()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3888, - "status": "ok", - "timestamp": 1664911653088, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "w-Li0LMw1rFS", - "outputId": "2b1073de-7cee-4ff1-e2a5-fa9eb9f78338" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "glove_100d download started this may take some time.\n", - "Approximate size to download 145.3 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "# ner_dl model is trained with glove_100d. So we use the same embeddings in the pipeline\n", - "glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d').\\\n", - " setInputCols([\"document\", 'token']).\\\n", - " setOutputCol(\"embeddings\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " glove_embeddings,\n", - " public_ner])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1977, - "status": "ok", - "timestamp": 1664911655052, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "fiBI4YSl2Jud", - "outputId": "a887b65a-a1f8-469f-d828-36e1a31c8cf2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+---------+\n", - "| token|ner_label|\n", - "+------------+---------+\n", - "| Unions| O|\n", - "|representing| O|\n", - "| workers| O|\n", - "| at| O|\n", - "| Turner| B-ORG|\n", - "| Newall| I-ORG|\n", - "| say| O|\n", - "| they| O|\n", - "| are| O|\n", - "| '| O|\n", - "|disappointed| O|\n", - "| '| O|\n", - "| after| O|\n", - "| talks| O|\n", - "| with| O|\n", - "| stricken| O|\n", - "| parent| O|\n", - "| firm| O|\n", - "| Federal| B-ORG|\n", - "| Mogul| I-ORG|\n", - "| .| O|\n", - "| TORONTO| B-LOC|\n", - "| ,| O|\n", - "| Canada| B-LOC|\n", - "| A| O|\n", - "| second| O|\n", - "| team| O|\n", - "| of| O|\n", - "| rocketeers| O|\n", - "| competing| O|\n", - "| for| O|\n", - "| the| O|\n", - "| #36;10| O|\n", - "| million| O|\n", - "| Ansari| B-MISC|\n", - "| X| I-MISC|\n", - "| Prize| I-MISC|\n", - "| ,| O|\n", - "| a| O|\n", - "| contest| O|\n", - "| for| O|\n", - "| privately| O|\n", - "| funded| O|\n", - "| suborbital| O|\n", - "| space| O|\n", - "| flight| O|\n", - "| ,| O|\n", - "| has| O|\n", - "| officially| O|\n", - "| announced| O|\n", - "+------------+---------+\n", - "only showing top 50 rows\n", - "\n" - ] - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.ner.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"ner_label\"))\n", - "\n", - "result_df.show(50, truncate=100)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1LJGcXig26wl" - }, - "source": [ - "### NerDL OntoNotes 100D" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2S-QiZwTnvuW" - }, - "source": [ - "This pipeline is based on NerDLApproach annotator with Char CNN - BiLSTM and GloVe Embeddings on the OntoNotes corpus and supports the identification of 18 entities.

Following NER types are supported in this pipeline:

TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LAWNamed documents made into laws.
LANGUAGEAny named language.
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including ”%“.
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL“first”, “second”, etc.
CARDINALNumerals that do not fall under another type.
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8fIhFYG03lAB" - }, - "source": [ - "Entities\n", - "\n", - "``` 'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART' ```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3567, - "status": "ok", - "timestamp": 1664911704123, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "KTH7biUa2ezJ", - "outputId": "751c3d89-bed4-485f-850d-5d88b042128a" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "onto_100 download started this may take some time.\n", - "Approximate size to download 13.5 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "onto_ner = NerDLModel.pretrained(\"onto_100\", 'en') \\\n", - " .setInputCols([\"document\", \"token\", \"embeddings\"]) \\\n", - " .setOutputCol(\"ner\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " glove_embeddings,\n", - " onto_ner])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1179, - "status": "ok", - "timestamp": 1664911706300, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "HQ6BJeBK3Gp-", - "outputId": "42cff7f7-49ad-434a-85cf-c50c38b64767" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+-------------+\n", - "| token| ner_label|\n", - "+------------+-------------+\n", - "| Unions| O|\n", - "|representing| O|\n", - "| workers| O|\n", - "| at| O|\n", - "| Turner| B-ORG|\n", - "| Newall| I-ORG|\n", - "| say| O|\n", - "| they| O|\n", - "| are| O|\n", - "| '| O|\n", - "|disappointed| O|\n", - "| '| O|\n", - "| after| O|\n", - "| talks| O|\n", - "| with| O|\n", - "| stricken| O|\n", - "| parent| O|\n", - "| firm| O|\n", - "| Federal| B-ORG|\n", - "| Mogul| I-ORG|\n", - "| .| O|\n", - "| TORONTO| B-GPE|\n", - "| ,| O|\n", - "| Canada| B-GPE|\n", - "| A| O|\n", - "| second| B-ORDINAL|\n", - "| team| O|\n", - "| of| O|\n", - "| rocketeers| O|\n", - "| competing| O|\n", - "| for| O|\n", - "| the| O|\n", - "| #36;10| B-CARDINAL|\n", - "| million| I-CARDINAL|\n", - "| Ansari|B-WORK_OF_ART|\n", - "| X|I-WORK_OF_ART|\n", - "| Prize|I-WORK_OF_ART|\n", - "| ,| O|\n", - "| a| O|\n", - "| contest| O|\n", - "| for| O|\n", - "| privately| O|\n", - "| funded| O|\n", - "| suborbital| O|\n", - "| space| O|\n", - "| flight| O|\n", - "| ,| O|\n", - "| has| O|\n", - "| officially| O|\n", - "| announced| O|\n", - "+------------+-------------+\n", - "only showing top 50 rows\n", - "\n" - ] - } - ], - "source": [ - "result_df = result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.ner.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", - " F.expr(\"cols['1']\").alias(\"ner_label\"))\n", - "\n", - "result_df.show(50, truncate=100)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nziP0JBB34rq" - }, - "source": [ - "### NER with Bert (CoNLL 2003)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 24545, - "status": "ok", - "timestamp": 1664911751412, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "bF6czpoU3yFb", - "outputId": "ad537a37-c629-40bd-a286-6e6e5a268b45" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "bert_base_cased download started this may take some time.\n", - "Approximate size to download 389.1 MB\n", - "[OK!]\n", - "ner_dl_bert download started this may take some time.\n", - "Approximate size to download 15.4 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "tokenizer = Tokenizer() \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"token\")\n", - "\n", - "bert_embeddings = BertEmbeddings.pretrained('bert_base_cased')\\\n", - " .setInputCols([\"document\", \"token\"])\\\n", - " .setOutputCol(\"embeddings\")\n", - "\n", - "onto_ner_bert = NerDLModel.pretrained(\"ner_dl_bert\", 'en') \\\n", - " .setInputCols([\"document\", \"token\", \"embeddings\"]) \\\n", - " .setOutputCol(\"ner\")\n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " bert_embeddings,\n", - " onto_ner_bert\n", - " ])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1220, - "status": "ok", - "timestamp": 1664911752605, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "5bIZ_1yI4Ixv", - "outputId": "6921a375-d50e-4426-871c-d74d610e105b" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[('Peter', 'I-PER'),\n", - " ('Parker', 'I-PER'),\n", - " ('is', 'O'),\n", - " ('a', 'O'),\n", - " ('nice', 'O'),\n", - " ('persn', 'O'),\n", - " ('and', 'O'),\n", - " ('lives', 'O'),\n", - " ('in', 'O'),\n", - " ('New', 'I-LOC'),\n", - " ('York', 'I-LOC'),\n", - " ('.', 'O'),\n", - " ('Bruce', 'I-PER'),\n", - " ('Wayne', 'I-PER'),\n", - " ('is', 'O'),\n", - " ('also', 'O'),\n", - " ('a', 'O'),\n", - " ('nice', 'O'),\n", - " ('guy', 'O'),\n", - " ('and', 'O'),\n", - " ('lives', 'O'),\n", - " ('in', 'O'),\n", - " ('Gotham', 'I-LOC'),\n", - " ('City', 'I-LOC'),\n", - " ('.', 'O')]" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# fullAnnotate in LightPipeline\n", - "\n", - "light_model = LightPipeline(pipelineModel)\n", - "\n", - "light_result = light_model.annotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')\n", - "\n", - "list(zip(light_result['token'], light_result['ner']))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5qinFfvSsy19" - }, - "source": [ - "### Getting the NER chunks with NER Converter" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "__OLhl663SDJ" - }, - "outputs": [], - "source": [ - "ner_converter = NerConverter() \\\n", - " .setInputCols([\"document\", \"token\", \"ner\"]) \\\n", - " .setOutputCol(\"ner_chunk\")\n", - "\n", - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " bert_embeddings,\n", - " onto_ner_bert,\n", - " ner_converter\n", - " ])\n", - "\n", - "empty_df = spark.createDataFrame([['']]).toDF(\"text\")\n", - "\n", - "pipelineModel = nlpPipeline.fit(empty_df)\n", - "\n", - "result = pipelineModel.transform(news_df.limit(10))\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 1705, - "status": "ok", - "timestamp": 1664911754640, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "nSIxpj4Osy2A", - "outputId": "cb24d626-f67e-4da6-9ca0-8cf735f5a01c" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------+---------+\n", - "|chunk |ner_label|\n", - "+-------------------------------------------+---------+\n", - "|Turner Newall |ORG |\n", - "|Federal Mogul |ORG |\n", - "|TORONTO |LOC |\n", - "|Canada |LOC |\n", - "|Ansari X Prize |MISC |\n", - "|University of Louisville |ORG |\n", - "|Mike Fitzpatrick |PER |\n", - "|Southern California's |LOC |\n", - "|British Department for Education and Skills|ORG |\n", - "|DfES |ORG |\n", - "|Netsky |MISC |\n", - "|Sasser |MISC |\n", - "|Sophos |ORG |\n", - "|Jaschan |PER |\n", - "|Germany |LOC |\n", - "|Netsky |ORG |\n", - "|Sasser |MISC |\n", - "|GPG/OpenPGP |MISC |\n", - "|FOAF |ORG |\n", - "|PGP |ORG |\n", - "+-------------------------------------------+---------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "result.select(F.explode(F.arrays_zip(result.ner_chunk.result, \n", - " result.ner_chunk.metadata)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", - " F.expr(\"cols['1']['entity']\").alias(\"ner_label\")).show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 175 - }, - "executionInfo": { - "elapsed": 1180, - "status": "ok", - "timestamp": 1664911756032, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "uauJN2Umsy2C", - "outputId": "47e57aa1-8503-4755-9460-9b97a444608b" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
chunksentities
0Peter ParkerPER
1New YorkLOC
2Bruce WaynePER
3Gotham CityLOC
\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - " \n", - "
\n", - "
\n", - " " - ], - "text/plain": [ - " chunks entities\n", - "0 Peter Parker PER\n", - "1 New York LOC\n", - "2 Bruce Wayne PER\n", - "3 Gotham City LOC" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# fullAnnotate in LightPipeline\n", - "\n", - "light_model = LightPipeline(pipelineModel)\n", - "\n", - "light_result = light_model.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')\n", - "\n", - "\n", - "chunks = []\n", - "entities = []\n", - "\n", - "for n in light_result[0]['ner_chunk']:\n", - " \n", - " chunks.append(n.result)\n", - " entities.append(n.metadata['entity']) \n", - " \n", - " \n", - "import pandas as pd\n", - "\n", - "df = pd.DataFrame({'chunks':chunks, 'entities':entities})\n", - "\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TkLVVip_7_FP" - }, - "source": [ - "### NER with BertForTokenClassification\n", - "\n", - "[BertForTokenClassification](https://nlp.johnsnowlabs.com/docs/en/transformers#bertfortokenclassification) can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.\n", - "\n", - "For more examples of BertForTokenClassification models, please check [Transformers for Token Classification in Spark NLP notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/14.Transformers_for_Token_Classification_in_Spark_NLP.ipynb). \n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vDQU8ngXTGwr" - }, - "source": [ - "Pretrained models can be loaded with `pretrained()` of the companion object. The default model is `\"bert_base_token_classifier_conll03\"`, if no name is provided.

\n", - "\n", - "**Here are Bert Based Token Classification models available in Spark NLP:**\n", - "\n", - "
\n", - "\n", - "| Title | Name | Language |\n", - "|:-----------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:-----------|\n", - "| BERT Token Classification - NER CoNLL (bert_base_token_classifier_conll03) | bert_base_token_classifier_conll03 | en |\n", - "| BERT Token Classification - NER OntoNotes (bert_base_token_classifier_ontonote) | bert_base_token_classifier_ontonote | en |\n", - "| BERT Token Classification Large - NER CoNLL (bert_large_token_classifier_conll03) | bert_large_token_classifier_conll03 | en |\n", - "| BERT Token Classification Large - NER OntoNotes (bert_large_token_classifier_ontonote) | bert_large_token_classifier_ontonote | en |\n", - "| BERT Token Classification - ParsBERT for Persian Language Understanding (bert_token_classifier_parsbert_armanner) | bert_token_classifier_parsbert_armanner | fa |\n", - "| BERT Token Classification - ParsBERT for Persian Language Understanding (bert_token_classifier_parsbert_ner) | bert_token_classifier_parsbert_ner | fa |\n", - "| BERT Token Classification - ParsBERT for Persian Language Understanding (bert_token_classifier_parsbert_peymaner) | bert_token_classifier_parsbert_peymaner | fa |\n", - "| BERT Token Classification - BETO Spanish Language Understanding (bert_token_classifier_spanish_ner) | bert_token_classifier_spanish_ner | es |\n", - "| BERT Token Classification - Swedish Language Understanding (bert_token_classifier_swedish_ner) | bert_token_classifier_swedish_ner | sv |\n", - "| BERT Token Classification - Turkish Language Understanding (bert_token_classifier_turkish_ner) | bert_token_classifier_turkish_ner | tr |\n", - "| DistilBERT Token Classification - NER CoNLL (distilbert_base_token_classifier_conll03) | distilbert_base_token_classifier_conll03 | en |\n", - "| DistilBERT Token Classification - NER OntoNotes (distilbert_base_token_classifier_ontonotes) | distilbert_base_token_classifier_ontonotes | en |\n", - "| DistilBERT Token Classification - DistilbertNER for Persian Language Understanding (distilbert_token_classifier_persian_ner) | distilbert_token_classifier_persian_ner | fa |\n", - "| BERT Token Classification - Few-NERD (bert_base_token_classifier_few_nerd) | bert_base_token_classifier_few_nerd | en |\n", - "| DistilBERT Token Classification - Few-NERD (distilbert_base_token_classifier_few_nerd) | distilbert_base_token_classifier_few_nerd | en |\n", - "| Named Entity Recognition for Japanese (BertForTokenClassification) | bert_token_classifier_ner_ud_gsd | ja |\n", - "| Detect PHI for Deidentification (BertForTokenClassifier) | bert_token_classifier_ner_deid | en |\n", - "| Detect Clinical Entities (BertForTokenClassifier) | bert_token_classifier_ner_jsl | en |\n", - "| Detect Drug Chemicals (BertForTokenClassifier) | bert_token_classifier_ner_drugs | en |\n", - "| Detect Clinical Entities (Slim version, BertForTokenClassifier) | bert_token_classifier_ner_jsl_slim | en |\n", - "| ALBERT Token Classification Base - NER CoNLL (albert_base_token_classifier_conll03) | albert_base_token_classifier_conll03 | en |\n", - "| ALBERT Token Classification Large - NER CoNLL (albert_large_token_classifier_conll03) | albert_large_token_classifier_conll03 | en |\n", - "| ALBERT Token Classification XLarge - NER CoNLL (albert_xlarge_token_classifier_conll03) | albert_xlarge_token_classifier_conll03 | en |\n", - "| DistilRoBERTa Token Classification - NER OntoNotes (distilroberta_base_token_classifier_ontonotes) | distilroberta_base_token_classifier_ontonotes | en |\n", - "| RoBERTa Token Classification Base - NER CoNLL (roberta_base_token_classifier_conll03) | roberta_base_token_classifier_conll03 | en |\n", - "| RoBERTa Token Classification Base - NER OntoNotes (roberta_base_token_classifier_ontonotes) | roberta_base_token_classifier_ontonotes | en |\n", - "| RoBERTa Token Classification Large - NER CoNLL (roberta_large_token_classifier_conll03) | roberta_large_token_classifier_conll03 | en |\n", - "| RoBERTa Token Classification Large - NER OntoNotes (roberta_large_token_classifier_ontonotes) | roberta_large_token_classifier_ontonotes | en |\n", - "| RoBERTa Token Classification For Persian (roberta_token_classifier_zwnj_base_ner) | roberta_token_classifier_zwnj_base_ner | fa |\n", - "| XLM-RoBERTa Token Classification Base - NER XTREME (xlm_roberta_token_classifier_ner_40_lang) | xlm_roberta_token_classifier_ner_40_lang | xx |\n", - "| XLNet Token Classification Base - NER CoNLL (xlnet_base_token_classifier_conll03) | xlnet_base_token_classifier_conll03 | en |\n", - "| XLNet Token Classification Large - NER CoNLL (xlnet_large_token_classifier_conll03) | xlnet_large_token_classifier_conll03 | en |\n", - "| Detect Adverse Drug Events (BertForTokenClassification) | bert_token_classifier_ner_ade | en |\n", - "| Detect Anatomical Regions (BertForTokenClassification) | bert_token_classifier_ner_anatomy | en |\n", - "| Detect Bacterial Species (BertForTokenClassification) | bert_token_classifier_ner_bacteria | en |\n", - "| XLM-RoBERTa Token Classification Base - NER CoNLL (xlm_roberta_base_token_classifier_conll03) | xlm_roberta_base_token_classifier_conll03 | en |\n", - "| XLM-RoBERTa Token Classification Base - NER OntoNotes (xlm_roberta_base_token_classifier_ontonotes) | xlm_roberta_base_token_classifier_ontonotes | en |\n", - "| Longformer Token Classification Base - NER CoNLL (longformer_base_token_classifier_conll03) | longformer_base_token_classifier_conll03 | en |\n", - "| Longformer Token Classification Base - NER CoNLL (longformer_large_token_classifier_conll03) | longformer_large_token_classifier_conll03 | en |\n", - "| Detect Chemicals in Medical text (BertForTokenClassification) | bert_token_classifier_ner_chemicals | en |\n", - "| Detect Chemical Compounds and Genes (BertForTokenClassifier) | bert_token_classifier_ner_chemprot | en |\n", - "| Detect Cancer Genetics (BertForTokenClassification) | bert_token_classifier_ner_bionlp | en |\n", - "| Detect Cellular/Molecular Biology Entities (BertForTokenClassification) | bert_token_classifier_ner_cellular | en |\n", - "| Detect concepts in drug development trials (BertForTokenClassification) | bert_token_classifier_drug_development_trials | en |\n", - "| Detect Cancer Genetics (BertForTokenClassification) | bert_token_classifier_ner_bionlp | en |\n", - "| Detect Adverse Drug Events (BertForTokenClassification) | bert_token_classifier_ner_ade | en |\n", - "| Detect Anatomical Regions (MedicalBertForTokenClassifier) | bert_token_classifier_ner_anatomy | en |\n", - "| Detect Cellular/Molecular Biology Entities (BertForTokenClassification) | bert_token_classifier_ner_cellular | en |\n", - "| Detect Chemicals in Medical text (BertForTokenClassification) | bert_token_classifier_ner_chemicals | en |\n", - "| Detect Chemical Compounds and Genes (BertForTokenClassifier) | bert_token_classifier_ner_chemprot | en |\n", - "| Detect PHI for Deidentification (BertForTokenClassifier) | bert_token_classifier_ner_deid | en |\n", - "| Detect Drug Chemicals (BertForTokenClassifier) | bert_token_classifier_ner_drugs | en |\n", - "| Detect Clinical Entities (BertForTokenClassifier) | bert_token_classifier_ner_jsl | en |\n", - "| Detect Clinical Entities (Slim version, BertForTokenClassifier) | bert_token_classifier_ner_jsl_slim | en |\n", - "| Detect Bacterial Species (BertForTokenClassification) | bert_token_classifier_ner_bacteria | en |" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "oBCnq6tnThMw" - }, - "source": [ - "**You can find all these models and more [HERE](https://nlp.johnsnowlabs.com/models?edition=Spark+NLP)**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 33616, - "status": "ok", - "timestamp": 1664911868468, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "EYaPujeVLUsu", - "outputId": "6e65404b-4f0c-4cc6-c50f-76960e38ea15" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "bert_base_token_classifier_conll03 download started this may take some time.\n", - "Approximate size to download 385.4 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "# no need for token columns \n", - "tokenClassifier = BertForTokenClassification.pretrained('bert_base_token_classifier_conll03', 'en') \\\n", - " .setInputCols('document',\"token\") \\\n", - " .setOutputCol(\"ner\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 3202, - "status": "ok", - "timestamp": 1664911871653, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Ga2fLVNG8Az7", - "outputId": "207c4249-a13f-45a7-d77a-82cea0c2c139" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------+---------+\n", - "|chunk |ner_label|\n", - "+-------------------------------------------+---------+\n", - "|Turner Newall |ORG |\n", - "|Federal Mogul |ORG |\n", - "|TORONTO |LOC |\n", - "|Canada |LOC |\n", - "|Ansari X Prize |MISC |\n", - "|University of Louisville |ORG |\n", - "|Mike Fitzpatrick |PER |\n", - "|Southern California's |LOC |\n", - "|British Department for Education and Skills|ORG |\n", - "|DfES |ORG |\n", - "|Music Manifesto |MISC |\n", - "|Netsky |MISC |\n", - "|Sasser |MISC |\n", - "|Sophos |ORG |\n", - "|Jaschan |PER |\n", - "|Germany |LOC |\n", - "|Netsky |MISC |\n", - "|Sasser |MISC |\n", - "|GPG/OpenPGP |MISC |\n", - "|PGP |MISC |\n", - "+-------------------------------------------+---------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], - "source": [ - "nlpPipeline = Pipeline(\n", - " stages=[\n", - " documentAssembler, \n", - " tokenizer,\n", - " tokenClassifier,\n", - " ner_converter\n", - " ])\n", - "\n", - "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n", - "\n", - "result.select(F.explode(F.arrays_zip(result.ner_chunk.result, \n", - " result.ner_chunk.metadata)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", - " F.expr(\"cols['1']['entity']\").alias(\"ner_label\")).show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "URB45j2dAYql" - }, - "source": [ - "### Multi-Lingual NER \n", - "These NER Models are able to extract entities from a variety of languages\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4XgIxV9O8nLb" - }, - "source": [ - "#### Multi-Lingual NER (XLM-RoBERTa)\n", - "[XlmRoBertaForTokenClassification](https://nlp.johnsnowlabs.com/docs/en/transformers#xlmrobertafortokenclassification) can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.\n", - "\n", - "\n", - "\n", - "\n", - "| Spark NLP Model Name | language | predicted_entities | Class | Number of Languages supported |\n", - "|:-----------------------------------------|:-----------|:-------------------------------------------------------|:--------------------------------|:-----------------------|\n", - "| ner_wikiner_glove_840B_300 | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |8 |\n", - "| ner_wikiner_xlm_roberta_base | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |8 |\n", - "| ner_xtreme_glove_840B_300 | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |40 |\n", - "| ner_xtreme_xlm_roberta_xtreme_base | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |40 | \n", - "| xlm_roberta_token_classifier_ner_40_lang | xx | ['LOC', 'ORG', 'PER', 'O'] | XlmRoBertaForTokenClassification |40 | \n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 68422, - "status": "ok", - "timestamp": 1664911940070, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "V5YtYY3w3OfJ", - "outputId": "8745aae9-9371-4a2e-edea-c2f16c8f03e6" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "xlm_roberta_token_classifier_ner_40_lang download started this may take some time.\n", - "Approximate size to download 921.6 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "tokenClassifier = XlmRoBertaForTokenClassification() \\\n", - " .pretrained('xlm_roberta_token_classifier_ner_40_lang', 'xx') \\\n", - " .setInputCols(['token', 'document']) \\\n", - " .setOutputCol('ner')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 4283, - "status": "ok", - "timestamp": 1664911944344, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "25M5P9ur2mf0", - "outputId": "bfd99210-c26f-49cb-b98f-ee97f171060f" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------+---------+\n", - "|token |ner_label|\n", - "+--------------+---------+\n", - "|Peter |PER |\n", - "|Parker |PER |\n", - "|is |O |\n", - "|a |O |\n", - "|nice |O |\n", - "|lad |O |\n", - "|and |O |\n", - "|lives |O |\n", - "|in |O |\n", - "|New |LOC |\n", - "|York |LOC |\n", - "|Das |O |\n", - "|Schloss |ORG |\n", - "|Charlottenburg|ORG |\n", - "|in |O |\n", - "|Berlin |LOC |\n", - "|ist |O |\n", - "|eines |O |\n", - "|der |O |\n", - "|schoensten |O |\n", - "|Staedte |O |\n", - "|in |O |\n", - "|Deutschland |LOC |\n", - "|sagen |O |\n", - "|viele |O |\n", - "|Menschen |O |\n", - "|Peter |PER |\n", - "|Parker |PER |\n", - "|est |O |\n", - "|un |O |\n", - "|gentil |O |\n", - "|garçon |O |\n", - "|et |O |\n", - "|vit |O |\n", - "|à |O |\n", - "|New |LOC |\n", - "|York |LOC |\n", - "|پیٹر |PER |\n", - "|پارکر |PER |\n", - "|ایک |O |\n", - "|اچھا |O |\n", - "|لڑکا |O |\n", - "|ہے |O |\n", - "|اور |O |\n", - "|وہ |O |\n", - "|نیو |LOC |\n", - "|یارک |LOC |\n", - "|میں |O |\n", - "|رہتا |O |\n", - "|ھے |O |\n", - "+--------------+---------+\n", - "\n" - ] - } - ], - "source": [ - "from pyspark.sql.types import StringType\n", - "from pyspark.sql import functions as F\n", - "\n", - "# No need for NER Converter\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " tokenizer,\n", - " tokenClassifier,])\n", - "\n", - "text = [\n", - "'Peter Parker is a nice lad and lives in New York', \n", - "'Das Schloss Charlottenburg in Berlin ist eines der schoensten Staedte in Deutschland sagen viele Menschen',\n", - "'Peter Parker est un gentil garçon et vit à New York',\n", - "'پیٹر پارکر ایک اچھا لڑکا ہے اور وہ نیو یارک میں رہتا ھے',\n", - "]\n", - "data_set = spark.createDataFrame(text, StringType()).toDF(\"text\")\n", - "result = nlpPipeline.fit(data_set).transform(data_set)\n", - "\n", - "\n", - "result.select(F.explode(F.arrays_zip(result.token.result, \n", - " result.ner.result)).alias(\"cols\")) \\\n", - " .select(F.expr(\"cols['0']\").alias('token'),\n", - " F.expr(\"cols['1']\").alias(\"ner_label\")).show(100,truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8FqaDA6asy2E" - }, - "source": [ - "## Highlight the entities" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "m0gqRk_uRcl9" - }, - "outputs": [], - "source": [ - "# Install spark-nlp-display\n", - "! pip install -q spark-nlp-display" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 15949, - "status": "ok", - "timestamp": 1664911974770, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "325EZWIh5X_n", - "outputId": "98d1b0fd-067e-4833-b32c-aea978af42d7" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "recognize_entities_dl download started this may take some time.\n", - "Approx size to download 160.1 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "from sparknlp.pretrained import PretrainedPipeline\n", - "\n", - "pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 2054, - "status": "ok", - "timestamp": 1664911976788, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "A-rigiYR55D_", - "outputId": "6eb9216e-c6cf-46cf-9653-bce0b4a95e6e" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "dict_keys(['entities', 'document', 'token', 'ner', 'embeddings', 'sentence'])" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ann_text = pipeline.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')[0]\n", - "ann_text.keys()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 299 - }, - "executionInfo": { - "elapsed": 25, - "status": "ok", - "timestamp": 1664911978443, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "JLXjjUFk57eO", - "outputId": "4f402e23-6231-4117-be0f-d1178c78a58d" - }, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " Peter Parker PER is a nice persn and lives in New York LOC. Bruce Wayne PER is also a nice guy and lives in Gotham City LOC." - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "\n", - "\n", - " Peter Parker PER is a nice persn and lives in New York LOC. Bruce Wayne PER is also a nice guy and lives in Gotham City LOC." - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "\n", - "\n", - " Peter Parker PER is a nice persn and lives in New York. Bruce Wayne PER is also a nice guy and lives in Gotham City." - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Color code for label: \n", - "\"LOC\": #008080\n", - "\"PER\": #800080\n" - ] - } - ], - "source": [ - "from sparknlp_display import NerVisualizer\n", - "\n", - "visualiser = NerVisualizer()\n", - "visualiser.display(ann_text, label_col='entities', document_col='document')\n", - "\n", - "# Change color of an entity label\n", - "visualiser.set_label_colors({'LOC':'#008080', 'PER':'#800080'})\n", - "visualiser.display(ann_text, label_col='entities')\n", - "\n", - "# Set label filter\n", - "visualiser.display(ann_text, label_col='entities', document_col='document',\n", - " labels=['PER'])\n", - "\n", - "print ('\\nColor code for label: \\n\"LOC\": {}\\n\"PER\": {}' .format(visualiser.get_label_color('LOC'),visualiser.get_label_color('PER')) )" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v5ZbEW96mZ03" - }, - "source": [ - "## Using Pretrained ClassifierDL and SentimentDL models" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FKjGyOcwmiQm" - }, - "source": [ - "| Name | Spark NLP Model Reference | Language |\n", - "|:------------------------------------------------------------------------------------|:-------------------------------------------|:-----------|\n", - "| TREC(50) Question Classifier | classifierdl_use_trec50 | en |\n", - "| TREC(6) Question Classifier | classifierdl_use_trec6 | en |\n", - "| Cyberbullying Classifier | classifierdl_use_cyberbullying | en |\n", - "| Emotion Detection Classifier | Emotion Classifier | en |\n", - "| Fake News Classifier | classifierdl_use_fakenews | en |\n", - "| Sarcasm Classifier | classifierdl_use_sarcasm | en |\n", - "| Spam Classifier | classifierdl_use_spam | en |\n", - "| Classifier for Adverse Drug Events | classifierdl_ade_biobert | en |\n", - "| PICO Classifier | classifierdl_pico_biobert | en |\n", - "| Classifier for Genders - BIOBERT | classifierdl_gender_biobert | en |\n", - "| Classifier for Genders - SBERT | classifierdl_gender_sbert | en |\n", - "| TREC(50) Question Classifier | classifierdl_use_trec50 | en |\n", - "| TREC(6) Question Classifier | classifierdl_use_trec6 | en |\n", - "| Cyberbullying Classifier | classifierdl_use_cyberbullying | en |\n", - "| Emotion Detection Classifier | classifierdl_use_emotion | en |\n", - "| Fake News Classifier | classifierdl_use_fakenews | en |\n", - "| Sarcasm Classifier | classifierdl_use_sarcasm | en |\n", - "| Spam Classifier | classifierdl_use_spam | en |\n", - "| Classifier for Adverse Drug Events | classifierdl_ade_biobert | en |\n", - "| Classifier for Adverse Drug Events using Clinical Bert | classifierdl_ade_clinicalbert | en |\n", - "| Classifier for Adverse Drug Events in Small Conversations | classifierdl_ade_conversational_biobert | en |\n", - "| Classifier for Genders - BIOBERT | classifierdl_gender_biobert | en |\n", - "| Classifier for Genders - SBERT | classifierdl_gender_sbert | en |\n", - "| PICO Classifier | classifierdl_pico_biobert | en |\n", - "| End-to-End (E2E) and data-driven NLG Challenge | multiclassifierdl_use_e2e | en |\n", - "| Toxic Comment Classification | multiclassifierdl_use_toxic | en |\n", - "| Toxic Comment Classification - Small | multiclassifierdl_use_toxic_sm | en |\n", - "| Intent Classification for Airline Traffic Information System queries (ATIS dataset) | classifierdl_use_atis | en |\n", - "| Identify intent in general text - SNIPS dataset | classifierdl_use_snips | en |\n", - "| News Classifier of Turkish text | classifierdl_bert_news | tr |\n", - "| News Classifier of German text | classifierdl_bert_news | de |\n", - "| Cyberbullying Classifier in Turkish texts. | classifierdl_berturk_cyberbullying | tr |\n", - "| Question Pair Classifier | classifierdl_electra_questionpair | en |\n", - "| Question Pair Classifier Pipeline | classifierdl_electra_questionpair_pipeline | en |\n", - "| News Classifier Pipeline for Turkish text | classifierdl_bert_news_pipeline | tr |" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 8393, - "status": "ok", - "timestamp": 1664911986815, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "1q7Ju4HHmagE", - "outputId": "5a82ba79-e637-411b-eb66-2d94f1ce618c" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "classifierdl_use_fakenews download started this may take some time.\n", - "Approximate size to download 21.4 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "fake_classifier = ClassifierDLModel.pretrained('classifierdl_use_fakenews', 'en') \\\n", - " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n", - " .setOutputCol(\"class\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nLYQd97Qmt1a" - }, - "source": [ - "fake_news classifier is trained on `https://raw.githubusercontent.com/joolsa/fake_real_news_dataset/master/fake_or_real_news.csv.zip`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 16, - "status": "ok", - "timestamp": 1664911986816, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "4lJPUe-KmqBN", - "outputId": "cb97cb62-41b0-45ab-bc47-691fdf2106c4" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['FAKE', 'REAL']" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fake_classifier.getClasses()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 59205, - "status": "ok", - "timestamp": 1664912050172, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "xY03ThozmwIE", - "outputId": "871908e1-5384-47a5-dc69-5b21a5bf4fa4" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - "use = UniversalSentenceEncoder.pretrained(lang=\"en\") \\\n", - " .setInputCols([\"document\"])\\\n", - " .setOutputCol(\"sentence_embeddings\")\n", - "\n", - "nlpPipeline = Pipeline(stages=[documentAssembler, \n", - " use,\n", - " fake_classifier])\n", - "\n", - "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", - "\n", - "fake_clf_model = nlpPipeline.fit(empty_data)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "dLXj1wBWm0gQ" - }, - "outputs": [], - "source": [ - "!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/spam_ham_dataset.csv" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 359, - "status": "ok", - "timestamp": 1664912051072, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "HlFOegD3ofvP", - "outputId": "ab303357-4482-4bd0-84a5-3d01e4d060fe" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{'document': ['BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump'],\n", - " 'sentence_embeddings': ['BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump'],\n", - " 'class': ['FAKE']}" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fake_lp_pipeline = LightPipeline(fake_clf_model)\n", - "\n", - "text = 'BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump'\n", - "\n", - "fake_lp_pipeline.annotate(text)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 232, - "status": "ok", - "timestamp": 1664912051299, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "MrOQLnBcofh5", - "outputId": "88ed92e7-3f07-44fd-b865-567de8654a41" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------------------------------------------------------------+\n", - "|text |\n", - "+-------------------------------------------------------------------------------------------------+\n", - "|BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump|\n", - "+-------------------------------------------------------------------------------------------------+\n", - "\n" - ] - } - ], - "source": [ - "sample_data = spark.createDataFrame([[text]]).toDF(\"text\")\n", - "\n", - "sample_data.show(truncate=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 526, - "status": "ok", - "timestamp": 1664912051821, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "ZutmHe8tofOv", - "outputId": "b65f3a39-c765-4baf-8bb1-b3256a1695c5" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+--------------------+--------------------+--------------------+--------------------+\n", - "| text| document| sentence_embeddings| class|\n", - "+--------------------+--------------------+--------------------+--------------------+\n", - "|BREAKING: Leaked ...|[{document, 0, 96...|[{sentence_embedd...|[{category, 0, 96...|\n", - "+--------------------+--------------------+--------------------+--------------------+\n", - "\n" - ] - } - ], - "source": [ - "pred = fake_clf_model.transform(sample_data)\n", - "\n", - "pred.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 748, - "status": "ok", - "timestamp": 1664912052565, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "S1VaVxd7pCAI", - "outputId": "a367a2cc-fe82-4f78-c6bd-f37a3b76af31" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+-------------------------------------------------------------------------------------------------+------+\n", - "|text |result|\n", - "+-------------------------------------------------------------------------------------------------+------+\n", - "|BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump|[FAKE]|\n", - "+-------------------------------------------------------------------------------------------------+------+\n", - "\n" - ] - } - ], - "source": [ - "pred.select('text','class.result').show(truncate=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5wDEdQ99pIw0" - }, - "source": [ - "you can find more samples here >> `https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 27, - "status": "ok", - "timestamp": 1664912052566, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "J1RzlrnzS9Ry", - "outputId": "54b72f89-43fd-43ce-8a6c-cf2117fef07f" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{'document': ['Joseph Robinette Biden Jr. is an American politician who is the 46th and current president of the United States.'],\n", - " 'sentence_embeddings': ['Joseph Robinette Biden Jr. is an American politician who is the 46th and current president of the United States.'],\n", - " 'class': ['REAL']}" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fake_lp_pipeline = LightPipeline(fake_clf_model)\n", - "\n", - "text = \"Joseph Robinette Biden Jr. is an American politician who is the 46th and current president of the United States.\"\n", - "\n", - "fake_lp_pipeline.annotate(text)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8X5ftW_kpVS-" - }, - "source": [ - "## Generic classifier function" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2AsjZNMFpVy2" - }, - "outputs": [], - "source": [ - "def get_clf_lp(model_name, sentiment_dl=False, pretrained=True):\n", - "\n", - " documentAssembler = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - "\n", - " use = UniversalSentenceEncoder.pretrained(lang=\"en\") \\\n", - " .setInputCols([\"document\"])\\\n", - " .setOutputCol(\"sentence_embeddings\")\n", - "\n", - "\n", - " if pretrained:\n", - "\n", - " if sentiment_dl:\n", - "\n", - " document_classifier = SentimentDLModel.pretrained(model_name, 'en') \\\n", - " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n", - " .setOutputCol(\"class\")\n", - " else:\n", - " document_classifier = ClassifierDLModel.pretrained(model_name, 'en') \\\n", - " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n", - " .setOutputCol(\"class\")\n", - "\n", - " else:\n", - "\n", - " if sentiment_dl:\n", - "\n", - " document_classifier = SentimentDLModel.load(model_name) \\\n", - " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n", - " .setOutputCol(\"class\")\n", - " else:\n", - " document_classifier = ClassifierDLModel.load(model_name) \\\n", - " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n", - " .setOutputCol(\"class\")\n", - "\n", - " print ('classes:',document_classifier.getClasses())\n", - "\n", - " nlpPipeline = Pipeline(stages=[\n", - " documentAssembler, \n", - " use,\n", - " document_classifier\n", - " ])\n", - "\n", - " empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", - "\n", - " clf_pipelineFit = nlpPipeline.fit(empty_data)\n", - "\n", - " clf_lp_pipeline = LightPipeline(clf_pipelineFit)\n", - "\n", - " return clf_lp_pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 12017, - "status": "ok", - "timestamp": 1664912064562, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "Sv0HYuokpYWv", - "outputId": "5b809660-2674-45aa-96b7-9d4220984ef2" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n", - "classifierdl_use_trec50 download started this may take some time.\n", - "Approximate size to download 21.2 MB\n", - "[OK!]\n", - "classes: [' ENTY_color', ' ENTY_techmeth', ' DESC_manner', ' NUM_volsize', ' ENTY_letter', ' NUM_temp', ' ENTY_body', ' NUM_count', ' ENTY_instru', ' NUM_period', ' NUM_speed', ' DESC_reason', ' ENTY_symbol', ' ENTY_event', ' HUM_desc', ' NUM_perc', ' ENTY_dismed', ' NUM_ord', ' HUM_gr', ' LOC_mount', ' ABBR_abb', ' DESC_desc', ' NUM_dist', ' HUM_title', ' ENTY_lang', ' ENTY_sport', ' ENTY_plant', ' NUM_code', ' NUM_other', ' ENTY_word', ' ENTY_animal', ' ENTY_substance', ' ENTY_veh', ' ENTY_product', ' LOC_state', ' ENTY_religion', ' ENTY_currency', ' NUM_date', ' LOC_country', ' ENTY_cremat', ' NUM_money', ' LOC_other', ' DESC_def', ' LOC_city', ' HUM_ind', ' ENTY_other', ' ENTY_termeq', ' ENTY_food', ' ABBR_exp', ' NUM_weight']\n" - ] - } - ], - "source": [ - "clf_lp_pipeline = get_clf_lp('classifierdl_use_trec50')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KPQpn6hGpeXR" - }, - "source": [ - "trained on the TREC datasets:\n", - "\n", - "Classify open-domain, fact-based questions into one of the following broad semantic categories: \n", - "\n", - "```Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values.```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 32, - "status": "ok", - "timestamp": 1664912064563, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "qhszgL_epe7W", - "outputId": "a0adb084-08f4-417a-9bb0-c2f892cc4f7c" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[' NUM_count']" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text = 'What was the number of member nations of the U.N. in 2000?'\n", - "\n", - "clf_lp_pipeline.annotate(text)['class']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 35 - }, - "executionInfo": { - "elapsed": 378, - "status": "ok", - "timestamp": 1664912064920, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "lU8mhnX-pn5-", - "outputId": "d0c47dde-37dc-4c45-c116-c3b0e7d22514" - }, - "outputs": [ - { - "data": { - "application/vnd.google.colaboratory.intrinsic+json": { - "type": "string" - }, - "text/plain": [ - "' NUM_count'" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "clf_lp_pipeline.fullAnnotate(text)[0]['class'][0].result" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 32, - "status": "ok", - "timestamp": 1664912064921, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "boKPGhgoppto", - "outputId": "f2e6d882-42e4-4179-9cf3-e48ae075a3c3" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{' ENTY_dismed': '3.768739E-22',\n", - " ' ENTY_product': '2.4015744E-24',\n", - " ' ENTY_techmeth': '1.5787039E-22',\n", - " ' NUM_speed': '7.948464E-23',\n", - " ' NUM_volsize': '2.5315113E-25',\n", - " ' LOC_state': '6.3784123E-25',\n", - " ' NUM_code': '1.4549451E-25',\n", - " ' NUM_count': '0.9992601',\n", - " ' ENTY_food': '1.3031208E-24',\n", - " ' ENTY_animal': '1.6743833E-24',\n", - " ' NUM_period': '6.8075115E-21',\n", - " ' ENTY_religion': '5.9194734E-23',\n", - " ' LOC_country': '5.3062683E-21',\n", - " ' LOC_mount': '3.2177816E-25',\n", - " ' ENTY_termeq': '9.790085E-26',\n", - " ' ENTY_color': '1.1446835E-22',\n", - " ' ENTY_lang': '6.333391E-24',\n", - " ' ENTY_sport': '8.0773835E-25',\n", - " ' DESC_def': '2.4284432E-27',\n", - " ' HUM_gr': '4.4863106E-21',\n", - " ' ENTY_symbol': '4.1271923E-25',\n", - " ' ENTY_currency': '8.156541E-29',\n", - " ' ENTY_veh': '5.414701E-22',\n", - " ' LOC_other': '5.5141072E-11',\n", - " ' ENTY_word': '5.3265024E-23',\n", - " ' NUM_temp': '2.0907158E-23',\n", - " ' NUM_dist': '1.2542656E-24',\n", - " ' DESC_desc': '1.0926973E-12',\n", - " ' DESC_manner': '9.258374E-23',\n", - " ' NUM_ord': '2.2395288E-25',\n", - " ' NUM_other': '3.9771262E-27',\n", - " ' DESC_reason': '1.1718967E-6',\n", - " ' NUM_weight': '1.5373857E-24',\n", - " ' ENTY_instru': '5.9354656E-21',\n", - " ' ENTY_letter': '1.1453239E-25',\n", - " ' ENTY_event': '3.706315E-25',\n", - " ' ENTY_substance': '6.890844E-25',\n", - " ' ABBR_exp': '5.6048268E-24',\n", - " ' ENTY_body': '6.423101E-23',\n", - " ' ENTY_other': '7.378E-4',\n", - " ' NUM_money': '1.6745677E-25',\n", - " ' LOC_city': '4.7003377E-22',\n", - " ' NUM_date': '5.2122506E-16',\n", - " ' NUM_perc': '6.3761288E-24',\n", - " ' ABBR_abb': '7.101014E-26',\n", - " ' ENTY_plant': '5.543376E-24',\n", - " ' HUM_title': '1.0681953E-24',\n", - " ' ENTY_cremat': '1.1165376E-24',\n", - " ' HUM_ind': '8.063818E-7',\n", - " ' HUM_desc': '4.3701275E-23',\n", - " 'sentence': '0'}" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "clf_lp_pipeline.fullAnnotate(text)[0]['class'][0].metadata" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 20, - "status": "ok", - "timestamp": 1664912064922, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "P87b4fzRpr6-", - "outputId": "2a07fdc8-4a2e-4bfe-9a80-152ca249baa7" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[' HUM_ind']" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text = 'What animal was the first mammal successfully cloned from adult cells?'\n", - "\n", - "clf_lp_pipeline.annotate(text)['class']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 11360, - "status": "ok", - "timestamp": 1664912076270, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "bNgGwNDwpuPV", - "outputId": "39db1753-994b-46d5-84e0-00743eb71f02" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n", - "classifierdl_use_cyberbullying download started this may take some time.\n", - "Approximate size to download 21.3 MB\n", - "[OK!]\n", - "classes: ['sexism', 'neutral', 'racism']\n" - ] - } - ], - "source": [ - "clf_lp_pipeline = get_clf_lp('classifierdl_use_cyberbullying')\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 362, - "status": "ok", - "timestamp": 1664912076600, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "MPGnM_gipwPf", - "outputId": "ab83a3dd-65d8-4390-8457-50c7ef196a9e" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['sexism']" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text ='RT @EBeisner @ahall012 I agree with you!! I would rather brush my teeth with sandpaper then watch football with a girl!!'\n", - "\n", - "clf_lp_pipeline.annotate(text)['class']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 6402, - "status": "ok", - "timestamp": 1664912082988, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "3u9ktm9xp2aF", - "outputId": "4947d747-787d-4932-ca58-37fd709ed84d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n", - "classifierdl_use_fakenews download started this may take some time.\n", - "Approximate size to download 21.4 MB\n", - "[OK!]\n", - "classes: ['FAKE', 'REAL']\n" - ] - } - ], - "source": [ - "clf_lp_pipeline = get_clf_lp('classifierdl_use_fakenews')\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 54, - "status": "ok", - "timestamp": 1664912082989, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "CNkFNiS4p5GM", - "outputId": "8f2624e0-9e47-4e6b-c090-07813eda3171" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['FAKE']" - ] - }, - "execution_count": 50, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text ='Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton campaign accusation that Donald Trump is a KGB spy is about as weak and baseless a claim as a Salem witch hunt or McCarthy era trial. It’s only because Hillary Clinton is losing that she is lobbing conspiracy theory. Citizen Quasar The way I see it, one of two things will happen: 1. Trump will win by a landslide but the election will be stolen via electronic voting, just like I have been predicting for over a decade, and the American People will accept the skewed election results just like they accept the TSA into their crotches. 2. Somebody will bust a cap in Hillary’s @$$ killing her and the election will be postponed. Follow AMTV!'\n", - "\n", - "clf_lp_pipeline.annotate(text)['class']\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 39, - "status": "ok", - "timestamp": 1664912082990, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "L77nGOzsp859", - "outputId": "f8bbbec1-41f7-4d3f-b4f5-af5ed8ad8f87" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['REAL']" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text ='Sen. Marco Rubio (R-Fla.) is adding a veteran New Hampshire political operative to his team as he continues mulling a possible 2016 presidential bid, the latest sign that he is seriously preparing to launch a campaign later this year.Jim Merrill, who worked for former GOP presidential nominee Mitt Romney and ran his 2008 and 2012 New Hampshire primary campaigns, joined Rubio’s fledgling campaign on Monday, aides to the senator said.Merrill will be joining Rubio’s Reclaim America PAC to focus on Rubio’s New Hampshire and broader Northeast political operations.\"Marco has always been well received in New Hampshire, and should he run for president, he would be very competitive there,\" Terry Sullivan, who runs Reclaim America, said in a statement. \"Jim certainly knows how to win in New Hampshire and in the Northeast, and will be a great addition to our team at Reclaim America.”News of Merrill’s hire was first reported by The New York Times.'\n", - "\n", - "clf_lp_pipeline.annotate(text)['class']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 9046, - "status": "ok", - "timestamp": 1664912092005, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "awBvudTRqApS", - "outputId": "b4ca676b-400e-445b-b3b6-652187c638eb" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n", - "sentimentdl_use_twitter download started this may take some time.\n", - "Approximate size to download 11.4 MB\n", - "[OK!]\n", - "classes: ['positive', 'negative']\n" - ] - } - ], - "source": [ - "sentiment_lp_pipeline = get_clf_lp('sentimentdl_use_twitter', sentiment_dl=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 42, - "status": "ok", - "timestamp": 1664912092006, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "5RQ4aVPNqDHn", - "outputId": "c02f64f4-94c4-4af0-8a26-7ebda1c310b5" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['positive']" - ] - }, - "execution_count": 53, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "text ='I am SO happy the news came out in time for my birthday this weekend! My inner 7-year-old cannot WAIT!'\n", - "\n", - "sentiment_lp_pipeline.annotate(text)['class']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 11183, - "status": "ok", - "timestamp": 1664912103161, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "YON-nknFqG5m", - "outputId": "49552554-f9b4-46c4-c4bf-4c1cc5821b6b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "tfhub_use download started this may take some time.\n", - "Approximate size to download 923.7 MB\n", - "[OK!]\n", - "classifierdl_use_emotion download started this may take some time.\n", - "Approximate size to download 21.3 MB\n", - "[OK!]\n", - "classes: ['joy', 'fear', 'surprise', 'sadness']\n" - ] - } - ], - "source": [ - "sentiment_lp_pipeline = get_clf_lp('classifierdl_use_emotion', sentiment_dl=False)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 357, - "status": "ok", - "timestamp": 1664912103493, - "user": { - "displayName": "Merve Ertas Uslu", - "userId": "01451729557099986551" - }, - "user_tz": -120 - }, - "id": "yW4Yim4HqJXn", - "outputId": "542aebb2-1f41-4fe6-ee19-91f7312d1fb5" - }, - "outputs": [ - { - "data": { - "text/plain": [ - "['surprise']" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sentiment_lp_pipeline.annotate(text)['class']" - ] - } - ], - "metadata": { - "accelerator": "TPU", - "colab": { - "collapsed_sections": [], - "machine_shape": "hm", - "provenance": [], - "toc_visible": true - }, - "gpuClass": "standard", - "kernelspec": { - "display_name": "base", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12 (main, Apr 5 2022, 06:56:58) \n[GCC 7.5.0]" - }, - "vscode": { - "interpreter": { - "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf" - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/examples/bakSentenceDetectorDL.ipynb b/examples/bakSentenceDetectorDL.ipynb deleted file mode 100644 index ea02f287ab815a..00000000000000 --- a/examples/bakSentenceDetectorDL.ipynb +++ /dev/null @@ -1,787 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "E0pAkKvH6v7K" - }, - "source": [ - "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OzsBWso169YV" - }, - "source": [ - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "id": "jirVPUT0F9bB" - }, - "source": [ - "# SentenceDetectorDL" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2mg5E3wl8yHp" - }, - "source": [ - "`SentenceDetectorDL` (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.\n", - "\n", - "In this model, we treated the sentence boundary detection task as a classification problem using a DL CNN architecture. We also modified the original implemenation a little bit to cover broken sentences and some impossible end of line chars.\n", - "\n", - "We are releasing two pretrained SDDL models: `english` and `multilanguage` that are trained on `SETimes corpus (Tyers and Alperen, 2010)` and ` Europarl. Wong et al. (2014)` datasets.\n", - "\n", - "Here are the test metrics on various languages for `multilang` model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KvNuyGXpD7Nt" - }, - "source": [ - "![image.png]()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "adrTGL-6ECtF" - }, - "source": [ - "**Supported Languages**\n", - "\n", - "`bg Bulgarian`\n", - "\n", - "`bs Bosnian`\n", - "\n", - "`da Danish`\n", - "\n", - "`de German`\n", - "\n", - "`el Greek`\n", - "\n", - "`en English`\n", - "\n", - "`es Spanish`\n", - "\n", - "`fi Finnish`\n", - "\n", - "`fr French`\n", - "\n", - "`hr Croatian`\n", - "\n", - "`it Italian`\n", - "\n", - "`mk Macedonian`\n", - "\n", - "`nl Dutch`\n", - "\n", - "`pt Portuguese`\n", - "\n", - "`ro Romanian`\n", - "\n", - "`sq Albanian`\n", - "\n", - "`sr Serbian`\n", - "\n", - "`sv Swedish`\n", - "\n", - "`tr Turkish`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "s8h1ee-GaEsn" - }, - "outputs": [], - "source": [ - "! pip install -q pyspark==3.3.0 spark-nlp==4.3.0" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 254 - }, - "executionInfo": { - "elapsed": 23293, - "status": "ok", - "timestamp": 1664975024014, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "6_cR3Syj8wTd", - "outputId": "67700eee-d59b-48b8-c0cc-54dd7dc2f23d" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Spark NLP version 4.3.0\n", - "Apache Spark version: 3.3.0\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
\n", - "

SparkSession - in-memory

\n", - " \n", - "
\n", - "

SparkContext

\n", - "\n", - "

Spark UI

\n", - "\n", - "
\n", - "
Version
\n", - "
v3.3.0
\n", - "
Master
\n", - "
local[*]
\n", - "
AppName
\n", - "
Spark NLP
\n", - "
\n", - "
\n", - " \n", - "
\n", - " " - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import sparknlp\n", - "\n", - "from pyspark.ml import PipelineModel\n", - "from sparknlp.annotator import *\n", - "from sparknlp.base import *\n", - "\n", - "spark = sparknlp.start()\n", - "\n", - "print(\"Spark NLP version\", sparknlp.version())\n", - "print(\"Apache Spark version:\", spark.version)\n", - "\n", - "spark" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 16765, - "status": "ok", - "timestamp": 1664975040774, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "zS46q8E0Aidy", - "outputId": "f067a993-5fcf-4e84-e41d-2c17202ade55" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence_detector_dl download started this may take some time.\n", - "Approximate size to download 354.6 KB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "documenter = DocumentAssembler()\\\n", - " .setInputCol(\"text\")\\\n", - " .setOutputCol(\"document\")\n", - " \n", - "sentencerDL = SentenceDetectorDLModel\\\n", - " .pretrained(\"sentence_detector_dl\", \"en\") \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"sentences\")\n", - "\n", - "sd_pipeline = PipelineModel(stages=[documenter, sentencerDL])" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "executionInfo": { - "elapsed": 16, - "status": "ok", - "timestamp": 1664975040775, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "kHWj4IxqBnwK" - }, - "outputs": [], - "source": [ - "sd_model = LightPipeline(sd_pipeline)" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 388, - "status": "ok", - "timestamp": 1664975041148, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "sBJqnxE6-1uz", - "outputId": "9ce4d659-70e9-459a-cd36-fc2f382bb716" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0\t0\t15\tJohn loves Mary.\n", - "1\t16\t31\tmary loves Peter\n", - "2\t43\t61\tPeter loves Helen .\n", - "3\t62\t78\tHelen loves John;\n", - "4\t91\t119\tTotal: four. people involved.\n" - ] - } - ], - "source": [ - "text = \"\"\"John loves Mary.mary loves Peter\n", - " Peter loves Helen .Helen loves John; \n", - " Total: four. people involved.\"\"\"\n", - "\n", - "for anno in sd_model.fullAnnotate(text)[0][\"sentences\"]:\n", - " print(\"{}\\t{}\\t{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.begin, anno.end, anno.result))\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-ihAtVvIDSJh" - }, - "source": [ - "### Testing with a broken text (random `\\n` chars added)" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 796, - "status": "ok", - "timestamp": 1664975041942, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "NzRiZYqiCX4t", - "outputId": "e381bffc-04f3-4522-fa12-e4d99d8dd308" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0\t1\t104\tThere are many NLP tasks like text summarization, question-answering, sentence prediction to name a few.\n", - "1\t106\t170\tOne method to get these tasks done is using a pre-trained model.\n", - "2\t172\t362\tInstead of training a model from scratch for NLP tasks using millions of annotated texts each time, a general language representation is created by training a model on a huge amount of data.\n", - "3\t364\t398\tThis is called a pre-trained model.\n", - "4\t400\t479\tThis pre-trained model is then fine-tuned for each NLP tasks according to need.\n", - "5\t481\t520\tLet’s just peek into the pre-BERT world…\n", - "6\t522\t634\tFor creating models, we need words to be represented in a form understood by the training network, ie, numbers.\n", - "7\t636\t731\tThus many algorithms were used to convert words into vectors or more precisely, word embeddings.\n", - "8\t734\t798\tOne of the earliest algorithms used for this purpose is word2vec.\n", - "9\t800\t872\tHowever, the drawback of word2vec models was that they were context-free.\n", - "10\t874\t941\tOne problem caused by this is that they cannot accommodate polysemy.\n", - "11\t943\t1022\tFor example, the word ‘letter’ has a different meaning according to the context.\n", - "12\t1024\t1106\tIt can mean ‘single element of alphabet’ or ‘document addressed to another person’.\n", - "13\t1108\t1163\tBut in word2vec both the letter returns same embeddings.\n" - ] - } - ], - "source": [ - "text = '''\n", - "There are many NLP tasks like text summarization, question-answering, sentence prediction to name a few. One method to get\\n these tasks done is using a pre-trained model. Instead of training \n", - "a model from scratch for NLP tasks using millions of annotated texts each time, a general language representation is created by training a model on a huge amount of data. This is called a pre-trained model. This pre-trained model is \n", - "then fine-tuned for each NLP tasks according to need.\n", - "Let’s just peek into the pre-BERT world…\n", - "For creating models, we need words to be represented in a form \\n understood by the training network, ie, numbers. Thus many algorithms were used to convert words into vectors or more precisely, word embeddings. \n", - "One of the earliest algorithms used for this purpose is word2vec. However, the drawback of word2vec models was that they were context-free. One problem caused by this is that they cannot accommodate polysemy. For example, the word ‘letter’ has a different meaning according to the context. It can mean ‘single element of alphabet’ or ‘document addressed to another person’. But in word2vec both the letter returns same embeddings.\n", - "'''\n", - "\n", - "for anno in sd_model.fullAnnotate(text)[0][\"sentences\"]:\n", - " \n", - " print(\"{}\\t{}\\t{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.begin, anno.end, anno.result.replace('\\n',''))) # removing \\n to beutify printing\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "P1uBCqrnElmi" - }, - "source": [ - "## Compare with Spacy Sentence Splitter" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "executionInfo": { - "elapsed": 6, - "status": "ok", - "timestamp": 1664975041943, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "8PtDPgliWEdu" - }, - "outputs": [], - "source": [ - "# !pip install spacy" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "executionInfo": { - "elapsed": 15720, - "status": "ok", - "timestamp": 1664975057658, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "5tHHalKGEoTu" - }, - "outputs": [], - "source": [ - "import spacy\n", - "\n", - "nlp = spacy.load(\"en_core_web_sm\")" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 20, - "status": "ok", - "timestamp": 1664975057659, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "euCxCFU-Erpv", - "outputId": "4cc0b189-2211-43a6-93db-186ea1bb0f5b" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "John loves Mary.mary loves Peter\n", - "Peter loves Helen .Helen\n", - "loves John; \n", - "Total: four.\n", - "people involved.\n" - ] - } - ], - "source": [ - "text = \"\"\"John loves Mary.mary loves Peter\n", - "Peter loves Helen .Helen loves John; \n", - "Total: four. people involved.\"\"\"\n", - "\n", - "for sent in nlp(text).sents:\n", - " print(sent)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Rq1LuycdbpAf" - }, - "source": [ - "## Test with another random broken sentence " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 404, - "status": "ok", - "timestamp": 1664975058045, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "S6EUTjAlFFVd", - "outputId": "6f5b24f6-8cf0-493c-dc62-9a25097ad007" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "with Spark NLP SentenceDetectorDL\n", - "===================================\n", - "0\tA California woman who vanished in Utah’s Zion National Park earlier,this month was found and reunited with her family officials said Sunday.\n", - "1\tHolly Suzanne Courtier, 38, was located within the park after a visitor saw her and alerted rangers, the National. Park Service said in a statement.\n", - "2\tAdditional details about how she survived or where she was found were not immediately available.\n", - "3\tIn the statement, Courtier’s relatives said they were “overjoyed” that she’d been found.\n", - "4\tCourtier, of Los Angeles, disappeared after a private shuttle dropped her off on Oct. 6. at the Grotto park area inside the 232-square-mile national park.\n", - "5\tShe was scheduled to be picked up later that afternoon but didn't show up, park officials said.\n", - "6\tThe search included K.9. units and federal, state and local rescue teams;\n", - "7\tVolunteers also joined the effort.\n", - "\n", - "with Spacy Sentence Detection\n", - "===================================\n", - "0 \t A California woman who vanished in Utah’s Zion National Park earlier,this month was found and reunited with her family officials said Sunday.\n", - "1 \t Holly Suzanne Courtier, 38, was located within the park after a visitor saw her and alerted rangers, the National.\n", - "2 \t Park Service said in a statement.\n", - "3 \t Additional details about how she survived or where she was found were not immediately available.\n", - "4 \t In the statement, Courtier’s relatives said they were “overjoyed” that she’d been found.\n", - "5 \t Courtier, of Los Angeles, disappeared after a private shuttle dropped her off on Oct. 6.\n", - "6 \t at the Grotto park area inside the 232-square-mile national park.\n", - "7 \t She was scheduled to be picked up later that afternoon but didn't show up, park officials said.\n", - "8 \t The search included K.9.\n", - "9 \t units and federal, state and local rescue teams; Volunteers also joined the effort.\n" - ] - } - ], - "source": [ - "random_broken_text = '''\n", - "A California woman who vanished in Utah’s Zion National Park earlier,\n", - "this month was found and reunited with her family \n", - "officials said Sunday. Holly Suzanne Courtier, \n", - "38, was located within the park after a visitor saw \n", - "her and alerted rangers, the National. Park Service said in a statement.\n", - "Additional details about how she \n", - "survived or where she was found were not immediately available. In the statement, \n", - "Courtier’s relatives said they were “overjoyed” that she’d been found.\n", - "Courtier, of Los Angeles, disappeared after a private shuttle dropped her off on Oct. 6. at the Grotto park area \n", - "inside the 232-square-mile national park. She was scheduled to be picked up later that \n", - "afternoon but didn't show up, park officials said. The search included K.9. units and federal, \n", - "state and local rescue teams; Volunteers also joined the effort.\n", - "'''\n", - "\n", - "print ('with Spark NLP SentenceDetectorDL')\n", - "print ('===================================')\n", - "\n", - "for anno in sd_model.fullAnnotate(random_broken_text)[0][\"sentences\"]:\n", - " \n", - " print(\"{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.result.replace('\\n',''))) # removing \\n to beutify printing\n", - "\n", - "print()\n", - "print ('with Spacy Sentence Detection')\n", - "print ('===================================')\n", - "for i,sent in enumerate(nlp(random_broken_text).sents):\n", - " print(i, '\\t',str(sent).replace('\\n',''))# removing \\n to beutify printing" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WiU0yHsvmxSv" - }, - "source": [ - "## Multilanguage Sentence Detector DL" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 6293, - "status": "ok", - "timestamp": 1664975064334, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "ULFgE7KmkbMa", - "outputId": "42e9545e-4542-470c-fb5c-a3d3b9d537d0" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sentence_detector_dl download started this may take some time.\n", - "Approximate size to download 514.9 KB\n", - "[OK!]\n" - ] - } - ], - "source": [ - "sentencerDL_multilang = SentenceDetectorDLModel\\\n", - " .pretrained(\"sentence_detector_dl\", \"xx\") \\\n", - " .setInputCols([\"document\"]) \\\n", - " .setOutputCol(\"sentences\")\n", - "\n", - "sd_pipeline_multi = PipelineModel(stages=[documenter, sentencerDL_multilang])\n", - "\n", - "sd_model_multi = LightPipeline(sd_pipeline_multi)" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 758, - "status": "ok", - "timestamp": 1664975065076, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "YPR4MqEZbPo7", - "outputId": "bd3410e4-5c56-4c82-9c86-af3a6e676c33" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "with Spark NLP SentenceDetectorDL\n", - "===================================\n", - "0\tΌπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται.\n", - "1\tΣτη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC.\n", - "2\tΠροφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη.\n", - "3\tΌσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές.\n", - "4\tΤα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.\n", - "\n", - "with Spacy Sentence Detection\n", - "===================================\n", - "0 \t Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται.\n", - "1 \t Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC.Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη.\n", - "2 \t Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές.\n", - "3 \t Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει\n", - "4 \t στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.\n" - ] - } - ], - "source": [ - "gr_text= '''\n", - "Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει \n", - "λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται. Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη \n", - "λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC.\n", - "Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι \n", - "οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη. Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η \n", - "εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές. Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα \n", - "ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.\n", - "'''\n", - "\n", - "print ('with Spark NLP SentenceDetectorDL')\n", - "print ('===================================')\n", - "\n", - "for anno in sd_model_multi.fullAnnotate(gr_text)[0][\"sentences\"]:\n", - " \n", - " print(\"{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.result.replace('\\n',''))) # removing \\n to beutify printing\n", - "\n", - "print()\n", - "print ('with Spacy Sentence Detection')\n", - "print ('===================================')\n", - "for i,sent in enumerate(nlp(gr_text).sents):\n", - " print(i, '\\t',str(sent).replace('\\n',''))# removing \\n to beutify printing" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "executionInfo": { - "elapsed": 9, - "status": "ok", - "timestamp": 1664975065077, - "user": { - "displayName": "Halil SAGLAMLAR", - "userId": "07259164328506563794" - }, - "user_tz": -180 - }, - "id": "sUc7n1wsktJs", - "outputId": "aeb2e4b3-1696-4025-829c-6f8ea9dac4b8" - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "with Spark NLP SentenceDetectorDL\n", - "===================================\n", - "0\tB чeтвъpтъĸ Gооglе oбяви няĸoлĸo aĸтyaлизaции нa cвoятa тъpcaчĸa, зaявявaйĸи чe e въвeлa изĸycтвeн интeлeĸт (Аl) и мaшиннo oбyчeниe зa пoдoбpявaнe нa пoтpeбитeлcĸoтo изживявaнe.\n", - "1\tΠoтpeбитeлитe вeчe мoгaт дa cи тaнaниĸaт, cвиpят или пeят мeлoдия нa пeceн нa Gооglе чpeз мoбилнoтo пpилoжeниe, ĸaтo дoĸocнaт иĸoнaтa нa миĸpoфoнa и зaдaдaт въпpoca: Koя e тaзи пeceн?\n", - "2\tTaнaниĸaнeтo в пpoдължeниe нa 10-15 ceĸyнди щe дaдe шaнc нa aлгopитъмa c мaшиннo oбyчeниe нa Gооglе дa нaмepи и извeдe peзyлтaт ĸoя e пpипявaнaтa пeceн.\n", - "3\tΠoнacтoящeм фyнĸциятa e дocтъпнa нa aнглийcĸи eзиĸ зa Іоѕ и нa oĸoлo 20 eзиĸa зa Аndrоіd, ĸaтo в бъдeщe и зa двeтe oпepaциoнни cиcтeми щe бъдe пpeдлoжeн eднaĸъв нaбop oт пoддъpжaни eзици, ĸaзвaт oт Gооglе.\n", - "4\tAl aĸтyaлизaциитe нa тъpceщия гигaнт cъщo oбxвaщaт пpaвoпиca и oбщитe зaявĸи зa тъpceнe.\n", - "5\tCpeд пoдoбpeниятa e вĸлючeн нoв пpaвoпиceн aлгopитъм, ĸoйтo изпoлзвa нeвpoннa мpeжa c дълбoĸo oбyчeниe, зa ĸoятo Gооglе твъpди, чe идвa cъc знaчитeлнo пoдoбpeнa cпocoбнocт зa дeшифpиpaнe нa пpaвoпиcни гpeшĸи.\n", - "\n", - "with Spacy Sentence Detection\n", - "===================================\n", - "0 \t B чeтвъpтъĸ Gооglе oбяви няĸoлĸo aĸтyaлизaции нa cвoятa тъpcaчĸa, зaявявaйĸи чe e въвeлa изĸycтвeн интeлeĸт\n", - "1 \t (Аl) и мaшиннo oбyчeниe зa пoдoбpявaнe нa пoтpeбитeлcĸoтo изживявaнe.\n", - "2 \t Πoтpeбитeлитe вeчe мoгaт дa cи тaнaниĸaт, cвиpят или пeят мeлoдия нa пeceн нa Gооglе чpeз мoбилнoтo пpилoжeниe, ĸaтo дoĸocнaт иĸoнaтa нa миĸpoфoнa и зaдaдaт въпpoca:\n", - "3 \t Koя e тaзи пeceн?Taнaниĸaнeтo в пpoдължeниe нa 10-15 ceĸyнди щe дaдe шaнc нa aлгopитъмa c мaшиннo oбyчeниe\n", - "4 \t нa\n", - "5 \t Gооglе дa нaмepи и извeдe peзyлтaт ĸoя e пpипявaнaтa пeceн.\n", - "6 \t Πoнacтoящeм фyнĸциятa e дocтъпнa нa aнглийcĸи eзиĸ\n", - "7 \t зa\n", - "8 \t Іоѕ и нa oĸoлo 20 eзиĸa зa Аndrоіd, ĸaтo в бъдeщe и зa двeтe oпepaциoнни cиcтeми щe бъдe пpeдлoжeн eднaĸъв нaбop oт пoддъpжaни eзици, ĸaзвaт oт Gооglе.\n", - "9 \t Al aĸтyaлизaциитe нa тъpceщия гигaнт cъщo oбxвaщaт пpaвoпиca и oбщитe зaявĸи зa тъpceнe.\n", - "10 \t Cpeд пoдoбpeниятa e вĸлючeн нoв пpaвoпиceн aлгopитъм, ĸoйтo изпoлзвa нeвpoннa мpeжa c дълбoĸo oбyчeниe, зa ĸoятo Gооglе твъpди, чe идвa cъc знaчитeлнo пoдoбpeнa cпocoбнocт\n", - "11 \t зa дeшифpиpaнe\n", - "12 \t нa пpaвoпиcни гpeшĸи.\n" - ] - } - ], - "source": [ - "cyrillic_text = '''\n", - "B чeтвъpтъĸ Gооglе oбяви няĸoлĸo aĸтyaлизaции нa cвoятa тъpcaчĸa, зaявявaйĸи чe e \n", - "въвeлa изĸycтвeн интeлeĸт (Аl) и мaшиннo oбyчeниe зa пoдoбpявaнe нa пoтpeбитeлcĸoтo изживявaнe.\n", - "Πoтpeбитeлитe вeчe мoгaт дa cи тaнaниĸaт, cвиpят или пeят мeлoдия нa пeceн нa Gооglе чpeз мoбилнoтo пpилoжeниe, \n", - "ĸaтo дoĸocнaт иĸoнaтa нa миĸpoфoнa и зaдaдaт въпpoca: Koя e тaзи пeceн?\n", - "Taнaниĸaнeтo в пpoдължeниe нa 10-15 ceĸyнди щe дaдe шaнc нa aлгopитъмa c мaшиннo oбyчeниe нa Gооglе дa нaмepи и извeдe peзyлтaт ĸoя e пpипявaнaтa пeceн.\n", - "Πoнacтoящeм фyнĸциятa e дocтъпнa нa aнглийcĸи eзиĸ зa Іоѕ и нa oĸoлo 20 eзиĸa зa Аndrоіd, \n", - "ĸaтo в бъдeщe и зa двeтe oпepaциoнни cиcтeми щe бъдe пpeдлoжeн eднaĸъв нaбop oт пoддъpжaни eзици, ĸaзвaт oт Gооglе.\n", - "Al aĸтyaлизaциитe нa тъpceщия гигaнт cъщo oбxвaщaт пpaвoпиca и oбщитe зaявĸи зa тъpceнe.\n", - "Cpeд пoдoбpeниятa e вĸлючeн нoв пpaвoпиceн aлгopитъм, ĸoйтo изпoлзвa нeвpoннa мpeжa \n", - "c дълбoĸo oбyчeниe, зa ĸoятo Gооglе твъpди, чe идвa cъc знaчитeлнo пoдoбpeнa cпocoбнocт зa \n", - "дeшифpиpaнe нa пpaвoпиcни гpeшĸи.\n", - "'''\n", - "\n", - "print ('with Spark NLP SentenceDetectorDL')\n", - "print ('===================================')\n", - "\n", - "for anno in sd_model_multi.fullAnnotate(cyrillic_text)[0][\"sentences\"]:\n", - " \n", - " print(\"{}\\t{}\".format(\n", - " anno.metadata[\"sentence\"], anno.result.replace('\\n',''))) # removing \\n to beutify printing\n", - "\n", - "print()\n", - "print ('with Spacy Sentence Detection')\n", - "print ('===================================')\n", - "for i,sent in enumerate(nlp(cyrillic_text).sents):\n", - " print(i, '\\t',str(sent).replace('\\n',''))# removing \\n to beutify printing" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "provenance": [] - }, - "kernelspec": { - "display_name": "base", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.12 (main, Apr 5 2022, 06:56:58) \n[GCC 7.5.0]" - }, - "vscode": { - "interpreter": { - "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf" - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb b/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb new file mode 100644 index 00000000000000..fe22efffec7348 --- /dev/null +++ b/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb @@ -0,0 +1,275 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", + "metadata": {}, + "source": [ + "![JohnSnowLabs](https://johnsnowlabs.com/assets/images/logo.png)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "21e9eafb", + "metadata": {}, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "212325cc-182f-4565-abed-9b46864d6d69", + "metadata": {}, + "source": [ + "# Named Entity Recognition with ZeroShotNer" + ] + }, + { + "cell_type": "markdown", + "id": "216EshxBJ9ra", + "metadata": {}, + "source": [ + "## Colab Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f6e6c12b", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q pyspark==3.3.0 spark-nlp==4.3.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc39c840", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Spark NLP version: 4.2.8\n", + "Apache Spark version: 3.3.0\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + "

SparkSession - in-memory

\n", + " \n", + "
\n", + "

SparkContext

\n", + "\n", + "

Spark UI

\n", + "\n", + "
\n", + "
Version
\n", + "
v3.3.0
\n", + "
Master
\n", + "
local[*]
\n", + "
AppName
\n", + "
Spark NLP
\n", + "
\n", + "
\n", + " \n", + "
\n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import sparknlp\n", + "\n", + "spark = sparknlp.start()\n", + "\n", + "print(\"Spark NLP version: \", sparknlp.version())\n", + "print(\"Apache Spark version: \", spark.version)\n", + "\n", + "spark" + ] + }, + { + "cell_type": "markdown", + "id": "5a32aee6", + "metadata": {}, + "source": [ + "# Zero-shot Named Entity Recognition" + ] + }, + { + "cell_type": "markdown", + "id": "43420eee-1c29-4148-b1c8-fa7884eff9b3", + "metadata": {}, + "source": [ + "`Zero-shot` is a new inference paradigm which allows us to use a model for prediction without any previous training step.\n", + "\n", + "For doing that, several examples (_hypotheses_) are provided and sent to the Language model, which will use `NLI (Natural Language Inference)` to check if the any information found in the text matches the examples (confirm the hypotheses).\n", + "\n", + "NLI usually works by trying to _confirm or reject an hypotheses_. The _hypotheses_ are the `prompts` or examples we are going to provide. If any piece of information confirm the constructed hypotheses (answer the examples we are given), then the hypotheses is confirmed and the Zero-shot is triggered.\n", + "\n", + "Let's see it in action.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2948d346-d522-43b9-9cd7-99430882621f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "finner_roberta_zeroshot download started this may take some time.\n", + "[OK!]\n" + ] + } + ], + "source": [ + "from sparknlp.base import *\n", + "from sparknlp.annotator import *\n", + "from pyspark.ml import Pipeline\n", + "\n", + "documentAssembler = DocumentAssembler()\\\n", + " .setInputCol(\"text\")\\\n", + " .setOutputCol(\"document\")\n", + "\n", + "sen = SentenceDetector()\\\n", + " .setInputCols([\"document\"])\\\n", + " .setOutputCol(\"sentence\")\n", + "\n", + "sparktokenizer = Tokenizer()\\\n", + " .setInputCols(\"sentence\")\\\n", + " .setOutputCol(\"token\")\n", + "\n", + "zero_shot_ner = ZeroShotNerModel.pretrained(\"finner_roberta_zeroshot\", \"en\")\\\n", + " .setInputCols([\"sentence\", \"token\"])\\\n", + " .setOutputCol(\"zero_shot_ner\")\\\n", + " .setEntityDefinitions(\n", + " {\n", + " \"DATE\": ['When was the company acquisition?', 'When was the company purchase agreement?'],\n", + " \"ORG\": [\"Which company was acquired?\"],\n", + " \"PRODUCT\": [\"Which product?\"],\n", + " \"PROFIT_INCREASE\": [\"How much has the gross profit increased?\"],\n", + " \"REVENUES_DECLINED\": [\"How much has the revenues declined?\"],\n", + " \"OPERATING_LOSS_2020\": [\"Which was the operating loss in 2020\"],\n", + " \"OPERATING_LOSS_2019\": [\"Which was the operating loss in 2019\"]\n", + " })\n", + "\n", + "nerconverter = NerConverter()\\\n", + " .setInputCols([\"sentence\", \"token\", \"zero_shot_ner\"])\\\n", + " .setOutputCol(\"ner_chunk\")\n", + "\n", + "pipeline = Pipeline(stages=[\n", + " documentAssembler,\n", + " sen,\n", + " sparktokenizer,\n", + " zero_shot_ner,\n", + " nerconverter\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b005b29f-f0c0-44dd-baac-590166d6bf8c", + "metadata": {}, + "outputs": [], + "source": [ + "from pyspark.sql.types import StructType,StructField, StringType\n", + "sample_text = [\"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.\",\n", + " \"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.\",\n", + " \"While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.\",\n", + " \"We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019.\"]\n", + "\n", + "p_model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n", + "\n", + "res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF(\"text\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "183fb2db-1cee-4f78-a486-dd6c9f6abd57", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+------------------+-------------------+\n", + "|chunk |ner_label |\n", + "+------------------+-------------------+\n", + "|March 2012 |DATE |\n", + "|Vertro |ORG |\n", + "|ALOT |PRODUCT |\n", + "|February 2017 |DATE |\n", + "|NetSeer |ORG |\n", + "|81.4% |PROFIT_INCREASE |\n", + "|27% |REVENUES_DECLINED |\n", + "|$8,048,581 million|OPERATING_LOSS_2020|\n", + "|$7,738,193 |OPERATING_LOSS_2019|\n", + "|2019 |DATE |\n", + "+------------------+-------------------+\n", + "\n" + ] + } + ], + "source": [ + "from pyspark.sql import functions as F\n", + "\n", + "res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias(\"cols\")) \\\n", + " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", + " F.expr(\"cols['3']['entity']\").alias(\"ner_label\"))\\\n", + " .filter(\"ner_label!='O'\")\\\n", + " .show(truncate=False)" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "provenance": [] + }, + "gpuClass": "standard", + "kernelspec": { + "display_name": "nlpdev", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3" + }, + "vscode": { + "interpreter": { + "hash": "cf73c0c97d90b2660ff29b0c9bed4b851524d3484a00df4555e25832aa5cf188" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/python/sparknlp/annotator/ner/zero_shot_ner_model.py b/python/sparknlp/annotator/ner/zero_shot_ner_model.py index 508f33820c32ef..91412258299775 100644 --- a/python/sparknlp/annotator/ner/zero_shot_ner_model.py +++ b/python/sparknlp/annotator/ner/zero_shot_ner_model.py @@ -25,6 +25,9 @@ class ZeroShotNerModel(RoBertaForQuestionAnswering, HasEngine): specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering. + For more extended examples see the + `Examples `__. + Pretrained models can be loaded with ``pretrained`` of the companion object: .. code-block:: python diff --git a/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/ZeroShotNerModel.scala b/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/ZeroShotNerModel.scala index b99a7809e81f43..2dd777c72510fb 100644 --- a/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/ZeroShotNerModel.scala +++ b/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/ZeroShotNerModel.scala @@ -40,6 +40,9 @@ import scala.collection.JavaConverters._ * specifying a set of questions for each entity. The model is based on * RoBertaForQuestionAnswering. * + * For more extended examples see the + * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb Examples]] + * * Pretrained models can be loaded with `pretrained` of the companion object: * {{{ * val zeroShotNer = ZeroShotNerModel.pretrained()