diff --git a/docs/en/transformer_entries/ZeroShotNer.md b/docs/en/transformer_entries/ZeroShotNer.md
index f9623d9330980c..1d7a96e1528921 100644
--- a/docs/en/transformer_entries/ZeroShotNer.md
+++ b/docs/en/transformer_entries/ZeroShotNer.md
@@ -11,6 +11,9 @@ used to recognize entities. The definitions of entities is given by a dictionary
specifying a set of questions for each entity. The model is based on
RoBertaForQuestionAnswering.
+For more extended examples see the
+[Examples](https://github.com/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb).
+
Pretrained models can be loaded with `pretrained` of the companion object:
```scala
diff --git a/examples/bak2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb b/examples/bak2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
deleted file mode 100644
index cd9972a1a0bb43..00000000000000
--- a/examples/bak2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb
+++ /dev/null
@@ -1,7131 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "sXatvRX899i0"
- },
- "source": [
- "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "9XsAEBYVxeB-"
- },
- "source": [
- "\n",
- "\n",
- "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Xj5fx5ir-wMt"
- },
- "source": [
- "# **Text Preprocessing with Spark NLP**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "H_SG0VCrix5p"
- },
- "source": [
- "**Note** Read this article if you want to understand the basic concepts in Spark NLP.\n",
- "\n",
- "https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "MfkkKkbVF309"
- },
- "source": [
- "## **0. Colab Setup**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "iMkMQtZNF2n-"
- },
- "outputs": [],
- "source": [
- "!pip install -q pyspark==3.3.0 spark-nlp==4.3.0"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "SS07N80gEtSt"
- },
- "source": [
- "### **1. Annotators and Transformer Concepts**"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "g3_ic8K7E0sy"
- },
- "source": [
- "In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.\n",
- "In Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel\n",
- "AnnotatorApproach extends Estimators from Spark ML, which are meant to be trained through fit(), and AnnotatorModel extends Transformers which are meant to transform data frames through transform().\n",
- "Some of Spark NLP annotators have a Model suffix and some do not. The model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers but do not contain the suffix Model since they are not trained, annotators. Model annotators have a pre-trained() on its static object, to retrieve the public pre-trained version of a model.\n",
- "Long story short, if it trains on a DataFrame and produces a model, it’s an AnnotatorApproach; and if it transforms one DataFrame into another DataFrame through some models, it’s an AnnotatorModel (e.g. WordEmbeddingsModel) and it doesn’t take Model suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "x6SaPXwtFBM-"
- },
- "source": [
- "By convention, there are three possible names:\n",
- "\n",
- "**Approach** — Trainable annotator\n",
- "\n",
- "**Model** — Trained annotator\n",
- "\n",
- "**nothing** — Either a non-trainable annotator with pre-processing\n",
- "step or shorthand for a model\n",
- "\n",
- "So for example, Stemmer doesn’t say Approach nor Model, however, it is a Model. On the other hand, Tokenizer doesn’t say Approach nor Model, but it has a TokenizerModel(). Because it is not “training” anything, but it is doing some preprocessing before converting into a Model.\n",
- "When in doubt, please refer to official documentation and API reference.\n",
- "Even though we will do many hands-on practices in the following articles, let us give you a glimpse to let you understand the difference between AnnotatorApproach and AnnotatorModel.\n",
- "As stated above, Tokenizer is an AnnotatorModel. So we need to call fit() and then transform()."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "ALiQ2TsOFNyc"
- },
- "source": [
- "Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.\n",
- "\n",
- "- Split text into sentences\n",
- "- Tokenize\n",
- "- Normalize\n",
- "- Get word embeddings"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "K0Yy8L-pFb27"
- },
- "source": [
- "![image.png]()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "suLa96N7Fijt"
- },
- "source": [
- "**What’s actually happening under the hood?**\n",
- "\n",
- "When we fit() on the pipeline with Spark data frame (df), its text column is fed into DocumentAssembler() transformer at first and then a new column “document” is created in Document type (AnnotatorType). As we mentioned before, this transformer is basically the initial entry point to Spark NLP for any Spark data frame. Then its document column is fed into SentenceDetector() (AnnotatorApproach) and the text is split into an array of sentences and a new column “sentences” in Document type is created. Then “sentences” column is fed into Tokenizer() (AnnotatorModel) and each sentence is tokenized and a new column “token” in Token type is created. And so on. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 254
- },
- "executionInfo": {
- "elapsed": 25398,
- "status": "ok",
- "timestamp": 1664906807242,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "SDasO3DbKu2Z",
- "outputId": "41f67d0d-9012-4c34-f111-b57a8109c482"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Spark NLP version: 4.3.0\n",
- "Apache Spark version: 3.3.0\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "
\n",
- " "
- ],
- "text/plain": [
- " regexToken\n",
- "0 1\n",
- "1 .\n",
- "2 T1\n",
- "3 -\n",
- "4 T2\n",
- "5 DATE\n",
- "6 *\n",
- "7 *\n",
- "8 [\n",
- "9 12/24/13\n",
- "10 ]\n",
- "11 $\n",
- "12 1\n",
- "13 .\n",
- "14 99\n",
- "15 ()\n",
- "16 (10/12)\n",
- "17 ,\n",
- "18 ph\n",
- "19 +\n",
- "20 90\n",
- "21 %"
- ]
- },
- "execution_count": 57,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import pyspark.sql.functions as F\n",
- "\n",
- "result_df = result.select(F.explode(result.regexToken.result).alias('regexToken')).toPandas()\n",
- "result_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "l_LM44ZzgYhs"
- },
- "source": [
- "## Stacking Spark NLP Annotators in Spark ML Pipeline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "bm0mUMQMhFPU"
- },
- "source": [
- "Spark NLP provides an easy API to integrate with Spark ML Pipelines and all the Spark NLP annotators and transformers can be used within Spark ML Pipelines. So, it’s better to explain Pipeline concept through Spark ML official documentation.\n",
- "\n",
- "What is a Pipeline anyway? In machine learning, it is common to run a sequence of algorithms to process and learn from data. \n",
- "\n",
- "Apache Spark ML represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.\n",
- "\n",
- "In simple terms, a pipeline chains multiple Transformers and Estimators together to specify an ML workflow. We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow.\n",
- "\n",
- "The figure below is for the training time usage of a Pipeline."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "jK5AAYQqhRlG"
- },
- "source": [
- "![image.png]()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "dwLlY7i4hhq1"
- },
- "source": [
- "A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage. With the help of Pipelines, we can ensure that training and test data go through identical feature processing steps.\n",
- "\n",
- "Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.\n",
- "\n",
- "- Split text into sentences\n",
- "- Tokenize\n",
- "\n",
- "And here is how we code this pipeline up in Spark NLP."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "_2mZXDVehhDU"
- },
- "outputs": [],
- "source": [
- "from pyspark.ml import Pipeline\n",
- "\n",
- "documentAssembler = DocumentAssembler()\\\n",
- " .setInputCol(\"text\")\\\n",
- " .setOutputCol(\"document\")\n",
- "\n",
- "sentenceDetector = SentenceDetector()\\\n",
- " .setInputCols(['document'])\\\n",
- " .setOutputCol('sentences')\n",
- "\n",
- "tokenizer = Tokenizer() \\\n",
- " .setInputCols([\"sentences\"]) \\\n",
- " .setOutputCol(\"token\")\n",
- "\n",
- "nlpPipeline = Pipeline(stages=[documentAssembler, \n",
- " sentenceDetector,\n",
- " tokenizer])\n",
- "\n",
- "spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')\n",
- "\n",
- "pipelineModel = nlpPipeline.fit(spark_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 374,
- "status": "ok",
- "timestamp": 1664907213434,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "9Rq_CRWN6Zge",
- "outputId": "22f861c9-ea47-4f60-9817-acb0e9c8cd53"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+-----------------------------------------------------------------------------+\n",
- "|text |\n",
- "+-----------------------------------------------------------------------------+\n",
- "|Peter is a very good person. |\n",
- "|My life in Russia is very interesting. |\n",
- "|John and Peter are brothers. However they don't support each other that much.|\n",
- "|Lucas Nogal Dunbercker is no longer happy. He has a good car though. |\n",
- "|Europe is very culture rich. There are huge churches! and big houses! |\n",
- "+-----------------------------------------------------------------------------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')\n",
- "\n",
- "spark_df.show(truncate=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "JuhTX4-Vk-cd"
- },
- "outputs": [],
- "source": [
- "result = pipelineModel.transform(spark_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 448,
- "status": "ok",
- "timestamp": 1664907217901,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "iaWf94QPlT51",
- "outputId": "7a44f67b-af29-48c5-f830-203b51459e6e"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n",
- "| text| document| sentences| token|\n",
- "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n",
- "| Peter is a very good person.|[{document, 0, 27, Peter is a very go...|[{document, 0, 27, Peter is a very go...|[{token, 0, 4, Peter, {sentence -> 0}...|\n",
- "| My life in Russia is very interesting.|[{document, 0, 37, My life in Russia ...|[{document, 0, 37, My life in Russia ...|[{token, 0, 1, My, {sentence -> 0}, [...|\n",
- "|John and Peter are brothers. However ...|[{document, 0, 76, John and Peter are...|[{document, 0, 27, John and Peter are...|[{token, 0, 3, John, {sentence -> 0},...|\n",
- "|Lucas Nogal Dunbercker is no longer h...|[{document, 0, 67, Lucas Nogal Dunber...|[{document, 0, 41, Lucas Nogal Dunber...|[{token, 0, 4, Lucas, {sentence -> 0}...|\n",
- "|Europe is very culture rich. There ar...|[{document, 0, 68, Europe is very cul...|[{document, 0, 27, Europe is very cul...|[{token, 0, 5, Europe, {sentence -> 0...|\n",
- "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "result.show(truncate=40)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 4,
- "status": "ok",
- "timestamp": 1664907219318,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "zfz0_-eFlXzk",
- "outputId": "13fe22c1-11a7-451b-dfc3-2bc50a653bd8"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "root\n",
- " |-- text: string (nullable = true)\n",
- " |-- document: array (nullable = true)\n",
- " | |-- element: struct (containsNull = true)\n",
- " | | |-- annotatorType: string (nullable = true)\n",
- " | | |-- begin: integer (nullable = false)\n",
- " | | |-- end: integer (nullable = false)\n",
- " | | |-- result: string (nullable = true)\n",
- " | | |-- metadata: map (nullable = true)\n",
- " | | | |-- key: string\n",
- " | | | |-- value: string (valueContainsNull = true)\n",
- " | | |-- embeddings: array (nullable = true)\n",
- " | | | |-- element: float (containsNull = false)\n",
- " |-- sentences: array (nullable = true)\n",
- " | |-- element: struct (containsNull = true)\n",
- " | | |-- annotatorType: string (nullable = true)\n",
- " | | |-- begin: integer (nullable = false)\n",
- " | | |-- end: integer (nullable = false)\n",
- " | | |-- result: string (nullable = true)\n",
- " | | |-- metadata: map (nullable = true)\n",
- " | | | |-- key: string\n",
- " | | | |-- value: string (valueContainsNull = true)\n",
- " | | |-- embeddings: array (nullable = true)\n",
- " | | | |-- element: float (containsNull = false)\n",
- " |-- token: array (nullable = true)\n",
- " | |-- element: struct (containsNull = true)\n",
- " | | |-- annotatorType: string (nullable = true)\n",
- " | | |-- begin: integer (nullable = false)\n",
- " | | |-- end: integer (nullable = false)\n",
- " | | |-- result: string (nullable = true)\n",
- " | | |-- metadata: map (nullable = true)\n",
- " | | | |-- key: string\n",
- " | | | |-- value: string (valueContainsNull = true)\n",
- " | | |-- embeddings: array (nullable = true)\n",
- " | | | |-- element: float (containsNull = false)\n",
- "\n"
- ]
- }
- ],
- "source": [
- "result.printSchema()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 323,
- "status": "ok",
- "timestamp": 1664907221583,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "599Y4hQsl_mF",
- "outputId": "a64afdac-956c-43f9-a386-f27fc17f0dc9"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Row(result=['Peter is a very good person.']),\n",
- " Row(result=['My life in Russia is very interesting.']),\n",
- " Row(result=['John and Peter are brothers.', \"However they don't support each other that much.\"])]"
- ]
- },
- "execution_count": 63,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "result.select('sentences.result').take(3)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 8,
- "status": "ok",
- "timestamp": 1664907223220,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "ehzhHXu6luaF",
- "outputId": "b79b06aa-3f1a-47c4-d688-f1c78052f2c5"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Row(token=[Row(annotatorType='token', begin=0, end=3, result='John', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=5, end=7, result='and', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=13, result='Peter', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=15, end=17, result='are', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=19, end=26, result='brothers', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=27, end=27, result='.', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=29, end=35, result='However', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=37, end=40, result='they', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=42, end=46, result=\"don't\", metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=48, end=54, result='support', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=56, end=59, result='each', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=61, end=65, result='other', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=67, end=70, result='that', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=72, end=75, result='much', metadata={'sentence': '1'}, embeddings=[]), Row(annotatorType='token', begin=76, end=76, result='.', metadata={'sentence': '1'}, embeddings=[])])"
- ]
- },
- "execution_count": 64,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "result.select('token').take(3)[2]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "42dSp9dGmtmr"
- },
- "source": [
- "## Normalizer"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "spOjcducnAsR"
- },
- "source": [
- "Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary\n",
- "\n",
- "`setCleanupPatterns(patterns)`: Regular expressions list for normalization, defaults [^A-Za-z]\n",
- "\n",
- "`setLowercase(value)`: lowercase tokens, default false\n",
- "\n",
- "`setSlangDictionary(path)`: txt file with delimited words to be transformed into something else\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 35
- },
- "executionInfo": {
- "elapsed": 523,
- "status": "ok",
- "timestamp": 1664907226445,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "h6XKX2l7_Jqk",
- "outputId": "49463712-e54d-4651-ca8e-5686901f2c7c"
- },
- "outputs": [
- {
- "data": {
- "application/vnd.google.colaboratory.intrinsic+json": {
- "type": "string"
- },
- "text/plain": [
- "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
- ]
- },
- "execution_count": 65,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import string\n",
- "string.punctuation"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "6hq2ZBWl_WMu"
- },
- "outputs": [],
- "source": [
- "from sparknlp.base import *\n",
- "from sparknlp.annotator import *\n",
- "\n",
- "documentAssembler = DocumentAssembler()\\\n",
- " .setInputCol(\"text\")\\\n",
- " .setOutputCol(\"document\")\n",
- "\n",
- "tokenizer = Tokenizer() \\\n",
- " .setInputCols([\"document\"]) \\\n",
- " .setOutputCol(\"token\")\n",
- " \n",
- "normalizer = Normalizer() \\\n",
- " .setInputCols([\"token\"]) \\\n",
- " .setOutputCol(\"normalized\")\\\n",
- " .setLowercase(True)\\\n",
- " .setCleanupPatterns([\"[^\\w\\d\\s]\"]) # remove punctuations (keep alphanumeric chars)\n",
- " # if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])\n",
- "\n",
- "nlpPipeline = Pipeline(stages=[documentAssembler, \n",
- " tokenizer,\n",
- " normalizer])\n",
- "\n",
- "result = nlpPipeline.fit(spark_df).transform(spark_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 358,
- "status": "ok",
- "timestamp": 1664907247428,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "25YyJJYXppji",
- "outputId": "6de6f1d5-668e-44d2-ea63-0245d72f5902"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[DocumentAssembler_d4926309c3ee,\n",
- " REGEX_TOKENIZER_a4789e51e51c,\n",
- " NORMALIZER_81bbdb7b0bdb]"
- ]
- },
- "execution_count": 68,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "nlpPipeline.fit(spark_df).stages"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 795,
- "status": "ok",
- "timestamp": 1664907249847,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "oUp4au5eoYrw",
- "outputId": "fc0588b5-3e3f-4895-f9ab-2800eaf253ce"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n",
- "| text| document| token| normalized|\n",
- "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n",
- "| Peter is a very good person.|[{document, 0, 27, Peter is a very go...|[{token, 0, 4, Peter, {sentence -> 0}...|[{token, 0, 4, peter, {sentence -> 0}...|\n",
- "| My life in Russia is very interesting.|[{document, 0, 37, My life in Russia ...|[{token, 0, 1, My, {sentence -> 0}, [...|[{token, 0, 1, my, {sentence -> 0}, [...|\n",
- "|John and Peter are brothers. However ...|[{document, 0, 76, John and Peter are...|[{token, 0, 3, John, {sentence -> 0},...|[{token, 0, 3, john, {sentence -> 0},...|\n",
- "|Lucas Nogal Dunbercker is no longer h...|[{document, 0, 67, Lucas Nogal Dunber...|[{token, 0, 4, Lucas, {sentence -> 0}...|[{token, 0, 4, lucas, {sentence -> 0}...|\n",
- "|Europe is very culture rich. There ar...|[{document, 0, 68, Europe is very cul...|[{token, 0, 5, Europe, {sentence -> 0...|[{token, 0, 5, europe, {sentence -> 0...|\n",
- "+----------------------------------------+----------------------------------------+----------------------------------------+----------------------------------------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "result.show(truncate=40)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 7,
- "status": "ok",
- "timestamp": 1664907250848,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "zxS0MEoM02wl",
- "outputId": "c24602d7-feda-425a-c138-fee038f592cf"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Row(token=[Row(annotatorType='token', begin=0, end=4, result='Peter', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=7, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=9, result='a', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=14, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=19, result='good', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=26, result='person', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=27, end=27, result='.', metadata={'sentence': '0'}, embeddings=[])]),\n",
- " Row(token=[Row(annotatorType='token', begin=0, end=1, result='My', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=6, result='life', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=9, result='in', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=16, result='Russia', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=19, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=24, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=26, end=36, result='interesting', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=37, end=37, result='.', metadata={'sentence': '0'}, embeddings=[])])]"
- ]
- },
- "execution_count": 70,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "result.select('token').take(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 542,
- "status": "ok",
- "timestamp": 1664907252725,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "xYQcnFVloa8R",
- "outputId": "ff7780a5-bfc7-4523-8820-b6577918bb5f"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Row(result=['peter', 'is', 'a', 'very', 'good', 'person']),\n",
- " Row(result=['my', 'life', 'in', 'russia', 'is', 'very', 'interesting'])]"
- ]
- },
- "execution_count": 71,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "result.select('normalized.result').take(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 9,
- "status": "ok",
- "timestamp": 1664907253166,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "dy6TLD9c1LTg",
- "outputId": "ad940727-807d-42c5-e9ca-a415d419e080"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[Row(normalized=[Row(annotatorType='token', begin=0, end=4, result='peter', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=6, end=7, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=9, end=9, result='a', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=14, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=16, end=19, result='good', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=26, result='person', metadata={'sentence': '0'}, embeddings=[])]),\n",
- " Row(normalized=[Row(annotatorType='token', begin=0, end=1, result='my', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=3, end=6, result='life', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=8, end=9, result='in', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=11, end=16, result='russia', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=18, end=19, result='is', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=21, end=24, result='very', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='token', begin=26, end=36, result='interesting', metadata={'sentence': '0'}, embeddings=[])])]"
- ]
- },
- "execution_count": 72,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "result.select('normalized').take(2)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "__iJ4EMeVb3n"
- },
- "source": [
- "## Document Normalizer"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "wfLIupJFZi3c"
- },
- "source": [
- "The DocumentNormalizer is an annotator that can be used after the DocumentAssembler to narmalize documents once that they have been processed and indexed .\n",
- "It takes in input annotated documents of type Array AnnotatorType.DOCUMENT and gives as output annotated document of type AnnotatorType.DOCUMENT .\n",
- "\n",
- "Parameters are: \n",
- "\n",
- "| Parametre | Description |\n",
- "| - | - |\n",
- "|**inputCol** |input column name string which targets a column of type Array(AnnotatorType.DOCUMENT).|\n",
- "|**outputCol** |output column name string which targets a column of type AnnotatorType.DOCUMENT.|\n",
- "|**action** |action string to perform applying regex patterns, i.e. (clean | extract). Default is \"clean\".|\n",
- "|**cleanupPatterns** |normalization regex patterns which match will be removed from document. Default is \"<[^>]*>\" (e.g., it removes all HTML tags).|\n",
- "|**replacement** |replacement string to apply when regexes match. Default is \" \".|\n",
- "|**lowercase** |whether to convert strings to lowercase. Default is False.|\n",
- "|**removalPolicy** |removalPolicy to remove patterns from text with a given policy. Valid policy values are: \"all\", \"pretty_all\", \"first\", \"pretty_first\". Defaults is \"pretty_all\". |\n",
- "|**encoding** |file encoding to apply on normalized documents. Supported encodings are: UTF_8, UTF_16, US_ASCII, ISO-8859-1, UTF-16BE, UTF-16LE. Default is \"UTF-8\".|\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "8Tj1c6UYhSzK"
- },
- "outputs": [],
- "source": [
- "text = '''\n",
- "
\n",
- " THE WORLD'S LARGEST WEB DEVELOPER SITE\n",
- "
THE WORLD'S LARGEST WEB DEVELOPER SITE
\n",
- "
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..
\n",
- " "
- ],
- "text/plain": [
- " chunks entities\n",
- "0 Peter Parker PER\n",
- "1 New York LOC\n",
- "2 Bruce Wayne PER\n",
- "3 Gotham City LOC"
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# fullAnnotate in LightPipeline\n",
- "\n",
- "light_model = LightPipeline(pipelineModel)\n",
- "\n",
- "light_result = light_model.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')\n",
- "\n",
- "\n",
- "chunks = []\n",
- "entities = []\n",
- "\n",
- "for n in light_result[0]['ner_chunk']:\n",
- " \n",
- " chunks.append(n.result)\n",
- " entities.append(n.metadata['entity']) \n",
- " \n",
- " \n",
- "import pandas as pd\n",
- "\n",
- "df = pd.DataFrame({'chunks':chunks, 'entities':entities})\n",
- "\n",
- "df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "TkLVVip_7_FP"
- },
- "source": [
- "### NER with BertForTokenClassification\n",
- "\n",
- "[BertForTokenClassification](https://nlp.johnsnowlabs.com/docs/en/transformers#bertfortokenclassification) can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.\n",
- "\n",
- "For more examples of BertForTokenClassification models, please check [Transformers for Token Classification in Spark NLP notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/14.Transformers_for_Token_Classification_in_Spark_NLP.ipynb). \n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "vDQU8ngXTGwr"
- },
- "source": [
- "Pretrained models can be loaded with `pretrained()` of the companion object. The default model is `\"bert_base_token_classifier_conll03\"`, if no name is provided.
\n",
- "\n",
- "**Here are Bert Based Token Classification models available in Spark NLP:**\n",
- "\n",
- " \n",
- "\n",
- "| Title | Name | Language |\n",
- "|:-----------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:-----------|\n",
- "| BERT Token Classification - NER CoNLL (bert_base_token_classifier_conll03) | bert_base_token_classifier_conll03 | en |\n",
- "| BERT Token Classification - NER OntoNotes (bert_base_token_classifier_ontonote) | bert_base_token_classifier_ontonote | en |\n",
- "| BERT Token Classification Large - NER CoNLL (bert_large_token_classifier_conll03) | bert_large_token_classifier_conll03 | en |\n",
- "| BERT Token Classification Large - NER OntoNotes (bert_large_token_classifier_ontonote) | bert_large_token_classifier_ontonote | en |\n",
- "| BERT Token Classification - ParsBERT for Persian Language Understanding (bert_token_classifier_parsbert_armanner) | bert_token_classifier_parsbert_armanner | fa |\n",
- "| BERT Token Classification - ParsBERT for Persian Language Understanding (bert_token_classifier_parsbert_ner) | bert_token_classifier_parsbert_ner | fa |\n",
- "| BERT Token Classification - ParsBERT for Persian Language Understanding (bert_token_classifier_parsbert_peymaner) | bert_token_classifier_parsbert_peymaner | fa |\n",
- "| BERT Token Classification - BETO Spanish Language Understanding (bert_token_classifier_spanish_ner) | bert_token_classifier_spanish_ner | es |\n",
- "| BERT Token Classification - Swedish Language Understanding (bert_token_classifier_swedish_ner) | bert_token_classifier_swedish_ner | sv |\n",
- "| BERT Token Classification - Turkish Language Understanding (bert_token_classifier_turkish_ner) | bert_token_classifier_turkish_ner | tr |\n",
- "| DistilBERT Token Classification - NER CoNLL (distilbert_base_token_classifier_conll03) | distilbert_base_token_classifier_conll03 | en |\n",
- "| DistilBERT Token Classification - NER OntoNotes (distilbert_base_token_classifier_ontonotes) | distilbert_base_token_classifier_ontonotes | en |\n",
- "| DistilBERT Token Classification - DistilbertNER for Persian Language Understanding (distilbert_token_classifier_persian_ner) | distilbert_token_classifier_persian_ner | fa |\n",
- "| BERT Token Classification - Few-NERD (bert_base_token_classifier_few_nerd) | bert_base_token_classifier_few_nerd | en |\n",
- "| DistilBERT Token Classification - Few-NERD (distilbert_base_token_classifier_few_nerd) | distilbert_base_token_classifier_few_nerd | en |\n",
- "| Named Entity Recognition for Japanese (BertForTokenClassification) | bert_token_classifier_ner_ud_gsd | ja |\n",
- "| Detect PHI for Deidentification (BertForTokenClassifier) | bert_token_classifier_ner_deid | en |\n",
- "| Detect Clinical Entities (BertForTokenClassifier) | bert_token_classifier_ner_jsl | en |\n",
- "| Detect Drug Chemicals (BertForTokenClassifier) | bert_token_classifier_ner_drugs | en |\n",
- "| Detect Clinical Entities (Slim version, BertForTokenClassifier) | bert_token_classifier_ner_jsl_slim | en |\n",
- "| ALBERT Token Classification Base - NER CoNLL (albert_base_token_classifier_conll03) | albert_base_token_classifier_conll03 | en |\n",
- "| ALBERT Token Classification Large - NER CoNLL (albert_large_token_classifier_conll03) | albert_large_token_classifier_conll03 | en |\n",
- "| ALBERT Token Classification XLarge - NER CoNLL (albert_xlarge_token_classifier_conll03) | albert_xlarge_token_classifier_conll03 | en |\n",
- "| DistilRoBERTa Token Classification - NER OntoNotes (distilroberta_base_token_classifier_ontonotes) | distilroberta_base_token_classifier_ontonotes | en |\n",
- "| RoBERTa Token Classification Base - NER CoNLL (roberta_base_token_classifier_conll03) | roberta_base_token_classifier_conll03 | en |\n",
- "| RoBERTa Token Classification Base - NER OntoNotes (roberta_base_token_classifier_ontonotes) | roberta_base_token_classifier_ontonotes | en |\n",
- "| RoBERTa Token Classification Large - NER CoNLL (roberta_large_token_classifier_conll03) | roberta_large_token_classifier_conll03 | en |\n",
- "| RoBERTa Token Classification Large - NER OntoNotes (roberta_large_token_classifier_ontonotes) | roberta_large_token_classifier_ontonotes | en |\n",
- "| RoBERTa Token Classification For Persian (roberta_token_classifier_zwnj_base_ner) | roberta_token_classifier_zwnj_base_ner | fa |\n",
- "| XLM-RoBERTa Token Classification Base - NER XTREME (xlm_roberta_token_classifier_ner_40_lang) | xlm_roberta_token_classifier_ner_40_lang | xx |\n",
- "| XLNet Token Classification Base - NER CoNLL (xlnet_base_token_classifier_conll03) | xlnet_base_token_classifier_conll03 | en |\n",
- "| XLNet Token Classification Large - NER CoNLL (xlnet_large_token_classifier_conll03) | xlnet_large_token_classifier_conll03 | en |\n",
- "| Detect Adverse Drug Events (BertForTokenClassification) | bert_token_classifier_ner_ade | en |\n",
- "| Detect Anatomical Regions (BertForTokenClassification) | bert_token_classifier_ner_anatomy | en |\n",
- "| Detect Bacterial Species (BertForTokenClassification) | bert_token_classifier_ner_bacteria | en |\n",
- "| XLM-RoBERTa Token Classification Base - NER CoNLL (xlm_roberta_base_token_classifier_conll03) | xlm_roberta_base_token_classifier_conll03 | en |\n",
- "| XLM-RoBERTa Token Classification Base - NER OntoNotes (xlm_roberta_base_token_classifier_ontonotes) | xlm_roberta_base_token_classifier_ontonotes | en |\n",
- "| Longformer Token Classification Base - NER CoNLL (longformer_base_token_classifier_conll03) | longformer_base_token_classifier_conll03 | en |\n",
- "| Longformer Token Classification Base - NER CoNLL (longformer_large_token_classifier_conll03) | longformer_large_token_classifier_conll03 | en |\n",
- "| Detect Chemicals in Medical text (BertForTokenClassification) | bert_token_classifier_ner_chemicals | en |\n",
- "| Detect Chemical Compounds and Genes (BertForTokenClassifier) | bert_token_classifier_ner_chemprot | en |\n",
- "| Detect Cancer Genetics (BertForTokenClassification) | bert_token_classifier_ner_bionlp | en |\n",
- "| Detect Cellular/Molecular Biology Entities (BertForTokenClassification) | bert_token_classifier_ner_cellular | en |\n",
- "| Detect concepts in drug development trials (BertForTokenClassification) | bert_token_classifier_drug_development_trials | en |\n",
- "| Detect Cancer Genetics (BertForTokenClassification) | bert_token_classifier_ner_bionlp | en |\n",
- "| Detect Adverse Drug Events (BertForTokenClassification) | bert_token_classifier_ner_ade | en |\n",
- "| Detect Anatomical Regions (MedicalBertForTokenClassifier) | bert_token_classifier_ner_anatomy | en |\n",
- "| Detect Cellular/Molecular Biology Entities (BertForTokenClassification) | bert_token_classifier_ner_cellular | en |\n",
- "| Detect Chemicals in Medical text (BertForTokenClassification) | bert_token_classifier_ner_chemicals | en |\n",
- "| Detect Chemical Compounds and Genes (BertForTokenClassifier) | bert_token_classifier_ner_chemprot | en |\n",
- "| Detect PHI for Deidentification (BertForTokenClassifier) | bert_token_classifier_ner_deid | en |\n",
- "| Detect Drug Chemicals (BertForTokenClassifier) | bert_token_classifier_ner_drugs | en |\n",
- "| Detect Clinical Entities (BertForTokenClassifier) | bert_token_classifier_ner_jsl | en |\n",
- "| Detect Clinical Entities (Slim version, BertForTokenClassifier) | bert_token_classifier_ner_jsl_slim | en |\n",
- "| Detect Bacterial Species (BertForTokenClassification) | bert_token_classifier_ner_bacteria | en |"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "oBCnq6tnThMw"
- },
- "source": [
- "**You can find all these models and more [HERE](https://nlp.johnsnowlabs.com/models?edition=Spark+NLP)**"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 33616,
- "status": "ok",
- "timestamp": 1664911868468,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "EYaPujeVLUsu",
- "outputId": "6e65404b-4f0c-4cc6-c50f-76960e38ea15"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "bert_base_token_classifier_conll03 download started this may take some time.\n",
- "Approximate size to download 385.4 MB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "# no need for token columns \n",
- "tokenClassifier = BertForTokenClassification.pretrained('bert_base_token_classifier_conll03', 'en') \\\n",
- " .setInputCols('document',\"token\") \\\n",
- " .setOutputCol(\"ner\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 3202,
- "status": "ok",
- "timestamp": 1664911871653,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "Ga2fLVNG8Az7",
- "outputId": "207c4249-a13f-45a7-d77a-82cea0c2c139"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+-------------------------------------------+---------+\n",
- "|chunk |ner_label|\n",
- "+-------------------------------------------+---------+\n",
- "|Turner Newall |ORG |\n",
- "|Federal Mogul |ORG |\n",
- "|TORONTO |LOC |\n",
- "|Canada |LOC |\n",
- "|Ansari X Prize |MISC |\n",
- "|University of Louisville |ORG |\n",
- "|Mike Fitzpatrick |PER |\n",
- "|Southern California's |LOC |\n",
- "|British Department for Education and Skills|ORG |\n",
- "|DfES |ORG |\n",
- "|Music Manifesto |MISC |\n",
- "|Netsky |MISC |\n",
- "|Sasser |MISC |\n",
- "|Sophos |ORG |\n",
- "|Jaschan |PER |\n",
- "|Germany |LOC |\n",
- "|Netsky |MISC |\n",
- "|Sasser |MISC |\n",
- "|GPG/OpenPGP |MISC |\n",
- "|PGP |MISC |\n",
- "+-------------------------------------------+---------+\n",
- "only showing top 20 rows\n",
- "\n"
- ]
- }
- ],
- "source": [
- "nlpPipeline = Pipeline(\n",
- " stages=[\n",
- " documentAssembler, \n",
- " tokenizer,\n",
- " tokenClassifier,\n",
- " ner_converter\n",
- " ])\n",
- "\n",
- "result = nlpPipeline.fit(news_df).transform(news_df.limit(10))\n",
- "\n",
- "result.select(F.explode(F.arrays_zip(result.ner_chunk.result, \n",
- " result.ner_chunk.metadata)).alias(\"cols\")) \\\n",
- " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n",
- " F.expr(\"cols['1']['entity']\").alias(\"ner_label\")).show(truncate=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "URB45j2dAYql"
- },
- "source": [
- "### Multi-Lingual NER \n",
- "These NER Models are able to extract entities from a variety of languages\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "4XgIxV9O8nLb"
- },
- "source": [
- "#### Multi-Lingual NER (XLM-RoBERTa)\n",
- "[XlmRoBertaForTokenClassification](https://nlp.johnsnowlabs.com/docs/en/transformers#xlmrobertafortokenclassification) can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.\n",
- "\n",
- "\n",
- "\n",
- "\n",
- "| Spark NLP Model Name | language | predicted_entities | Class | Number of Languages supported |\n",
- "|:-----------------------------------------|:-----------|:-------------------------------------------------------|:--------------------------------|:-----------------------|\n",
- "| ner_wikiner_glove_840B_300 | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |8 |\n",
- "| ner_wikiner_xlm_roberta_base | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |8 |\n",
- "| ner_xtreme_glove_840B_300 | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |40 |\n",
- "| ner_xtreme_xlm_roberta_xtreme_base | xx | ['B-LOC', 'I-LOC', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER'] | NerDLModel |40 | \n",
- "| xlm_roberta_token_classifier_ner_40_lang | xx | ['LOC', 'ORG', 'PER', 'O'] | XlmRoBertaForTokenClassification |40 | \n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 68422,
- "status": "ok",
- "timestamp": 1664911940070,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "V5YtYY3w3OfJ",
- "outputId": "8745aae9-9371-4a2e-edea-c2f16c8f03e6"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "xlm_roberta_token_classifier_ner_40_lang download started this may take some time.\n",
- "Approximate size to download 921.6 MB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "tokenClassifier = XlmRoBertaForTokenClassification() \\\n",
- " .pretrained('xlm_roberta_token_classifier_ner_40_lang', 'xx') \\\n",
- " .setInputCols(['token', 'document']) \\\n",
- " .setOutputCol('ner')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 4283,
- "status": "ok",
- "timestamp": 1664911944344,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "25M5P9ur2mf0",
- "outputId": "bfd99210-c26f-49cb-b98f-ee97f171060f"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+--------------+---------+\n",
- "|token |ner_label|\n",
- "+--------------+---------+\n",
- "|Peter |PER |\n",
- "|Parker |PER |\n",
- "|is |O |\n",
- "|a |O |\n",
- "|nice |O |\n",
- "|lad |O |\n",
- "|and |O |\n",
- "|lives |O |\n",
- "|in |O |\n",
- "|New |LOC |\n",
- "|York |LOC |\n",
- "|Das |O |\n",
- "|Schloss |ORG |\n",
- "|Charlottenburg|ORG |\n",
- "|in |O |\n",
- "|Berlin |LOC |\n",
- "|ist |O |\n",
- "|eines |O |\n",
- "|der |O |\n",
- "|schoensten |O |\n",
- "|Staedte |O |\n",
- "|in |O |\n",
- "|Deutschland |LOC |\n",
- "|sagen |O |\n",
- "|viele |O |\n",
- "|Menschen |O |\n",
- "|Peter |PER |\n",
- "|Parker |PER |\n",
- "|est |O |\n",
- "|un |O |\n",
- "|gentil |O |\n",
- "|garçon |O |\n",
- "|et |O |\n",
- "|vit |O |\n",
- "|à |O |\n",
- "|New |LOC |\n",
- "|York |LOC |\n",
- "|پیٹر |PER |\n",
- "|پارکر |PER |\n",
- "|ایک |O |\n",
- "|اچھا |O |\n",
- "|لڑکا |O |\n",
- "|ہے |O |\n",
- "|اور |O |\n",
- "|وہ |O |\n",
- "|نیو |LOC |\n",
- "|یارک |LOC |\n",
- "|میں |O |\n",
- "|رہتا |O |\n",
- "|ھے |O |\n",
- "+--------------+---------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "from pyspark.sql.types import StringType\n",
- "from pyspark.sql import functions as F\n",
- "\n",
- "# No need for NER Converter\n",
- "nlpPipeline = Pipeline(stages=[documentAssembler, \n",
- " tokenizer,\n",
- " tokenClassifier,])\n",
- "\n",
- "text = [\n",
- "'Peter Parker is a nice lad and lives in New York', \n",
- "'Das Schloss Charlottenburg in Berlin ist eines der schoensten Staedte in Deutschland sagen viele Menschen',\n",
- "'Peter Parker est un gentil garçon et vit à New York',\n",
- "'پیٹر پارکر ایک اچھا لڑکا ہے اور وہ نیو یارک میں رہتا ھے',\n",
- "]\n",
- "data_set = spark.createDataFrame(text, StringType()).toDF(\"text\")\n",
- "result = nlpPipeline.fit(data_set).transform(data_set)\n",
- "\n",
- "\n",
- "result.select(F.explode(F.arrays_zip(result.token.result, \n",
- " result.ner.result)).alias(\"cols\")) \\\n",
- " .select(F.expr(\"cols['0']\").alias('token'),\n",
- " F.expr(\"cols['1']\").alias(\"ner_label\")).show(100,truncate=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8FqaDA6asy2E"
- },
- "source": [
- "## Highlight the entities"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "m0gqRk_uRcl9"
- },
- "outputs": [],
- "source": [
- "# Install spark-nlp-display\n",
- "! pip install -q spark-nlp-display"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 15949,
- "status": "ok",
- "timestamp": 1664911974770,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "325EZWIh5X_n",
- "outputId": "98d1b0fd-067e-4833-b32c-aea978af42d7"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "recognize_entities_dl download started this may take some time.\n",
- "Approx size to download 160.1 MB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "from sparknlp.pretrained import PretrainedPipeline\n",
- "\n",
- "pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 2054,
- "status": "ok",
- "timestamp": 1664911976788,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "A-rigiYR55D_",
- "outputId": "6eb9216e-c6cf-46cf-9653-bce0b4a95e6e"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "dict_keys(['entities', 'document', 'token', 'ner', 'embeddings', 'sentence'])"
- ]
- },
- "execution_count": 30,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "ann_text = pipeline.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')[0]\n",
- "ann_text.keys()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 299
- },
- "executionInfo": {
- "elapsed": 25,
- "status": "ok",
- "timestamp": 1664911978443,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "JLXjjUFk57eO",
- "outputId": "4f402e23-6231-4117-be0f-d1178c78a58d"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- " Peter Parker PER is a nice persn and lives in New York LOC. Bruce Wayne PER is also a nice guy and lives in Gotham City LOC."
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- " Peter Parker PER is a nice persn and lives in New York LOC. Bruce Wayne PER is also a nice guy and lives in Gotham City LOC."
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- " Peter Parker PER is a nice persn and lives in New York. Bruce Wayne PER is also a nice guy and lives in Gotham City."
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Color code for label: \n",
- "\"LOC\": #008080\n",
- "\"PER\": #800080\n"
- ]
- }
- ],
- "source": [
- "from sparknlp_display import NerVisualizer\n",
- "\n",
- "visualiser = NerVisualizer()\n",
- "visualiser.display(ann_text, label_col='entities', document_col='document')\n",
- "\n",
- "# Change color of an entity label\n",
- "visualiser.set_label_colors({'LOC':'#008080', 'PER':'#800080'})\n",
- "visualiser.display(ann_text, label_col='entities')\n",
- "\n",
- "# Set label filter\n",
- "visualiser.display(ann_text, label_col='entities', document_col='document',\n",
- " labels=['PER'])\n",
- "\n",
- "print ('\\nColor code for label: \\n\"LOC\": {}\\n\"PER\": {}' .format(visualiser.get_label_color('LOC'),visualiser.get_label_color('PER')) )"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "v5ZbEW96mZ03"
- },
- "source": [
- "## Using Pretrained ClassifierDL and SentimentDL models"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "FKjGyOcwmiQm"
- },
- "source": [
- "| Name | Spark NLP Model Reference | Language |\n",
- "|:------------------------------------------------------------------------------------|:-------------------------------------------|:-----------|\n",
- "| TREC(50) Question Classifier | classifierdl_use_trec50 | en |\n",
- "| TREC(6) Question Classifier | classifierdl_use_trec6 | en |\n",
- "| Cyberbullying Classifier | classifierdl_use_cyberbullying | en |\n",
- "| Emotion Detection Classifier | Emotion Classifier | en |\n",
- "| Fake News Classifier | classifierdl_use_fakenews | en |\n",
- "| Sarcasm Classifier | classifierdl_use_sarcasm | en |\n",
- "| Spam Classifier | classifierdl_use_spam | en |\n",
- "| Classifier for Adverse Drug Events | classifierdl_ade_biobert | en |\n",
- "| PICO Classifier | classifierdl_pico_biobert | en |\n",
- "| Classifier for Genders - BIOBERT | classifierdl_gender_biobert | en |\n",
- "| Classifier for Genders - SBERT | classifierdl_gender_sbert | en |\n",
- "| TREC(50) Question Classifier | classifierdl_use_trec50 | en |\n",
- "| TREC(6) Question Classifier | classifierdl_use_trec6 | en |\n",
- "| Cyberbullying Classifier | classifierdl_use_cyberbullying | en |\n",
- "| Emotion Detection Classifier | classifierdl_use_emotion | en |\n",
- "| Fake News Classifier | classifierdl_use_fakenews | en |\n",
- "| Sarcasm Classifier | classifierdl_use_sarcasm | en |\n",
- "| Spam Classifier | classifierdl_use_spam | en |\n",
- "| Classifier for Adverse Drug Events | classifierdl_ade_biobert | en |\n",
- "| Classifier for Adverse Drug Events using Clinical Bert | classifierdl_ade_clinicalbert | en |\n",
- "| Classifier for Adverse Drug Events in Small Conversations | classifierdl_ade_conversational_biobert | en |\n",
- "| Classifier for Genders - BIOBERT | classifierdl_gender_biobert | en |\n",
- "| Classifier for Genders - SBERT | classifierdl_gender_sbert | en |\n",
- "| PICO Classifier | classifierdl_pico_biobert | en |\n",
- "| End-to-End (E2E) and data-driven NLG Challenge | multiclassifierdl_use_e2e | en |\n",
- "| Toxic Comment Classification | multiclassifierdl_use_toxic | en |\n",
- "| Toxic Comment Classification - Small | multiclassifierdl_use_toxic_sm | en |\n",
- "| Intent Classification for Airline Traffic Information System queries (ATIS dataset) | classifierdl_use_atis | en |\n",
- "| Identify intent in general text - SNIPS dataset | classifierdl_use_snips | en |\n",
- "| News Classifier of Turkish text | classifierdl_bert_news | tr |\n",
- "| News Classifier of German text | classifierdl_bert_news | de |\n",
- "| Cyberbullying Classifier in Turkish texts. | classifierdl_berturk_cyberbullying | tr |\n",
- "| Question Pair Classifier | classifierdl_electra_questionpair | en |\n",
- "| Question Pair Classifier Pipeline | classifierdl_electra_questionpair_pipeline | en |\n",
- "| News Classifier Pipeline for Turkish text | classifierdl_bert_news_pipeline | tr |"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 8393,
- "status": "ok",
- "timestamp": 1664911986815,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "1q7Ju4HHmagE",
- "outputId": "5a82ba79-e637-411b-eb66-2d94f1ce618c"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "classifierdl_use_fakenews download started this may take some time.\n",
- "Approximate size to download 21.4 MB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "fake_classifier = ClassifierDLModel.pretrained('classifierdl_use_fakenews', 'en') \\\n",
- " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n",
- " .setOutputCol(\"class\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "nLYQd97Qmt1a"
- },
- "source": [
- "fake_news classifier is trained on `https://raw.githubusercontent.com/joolsa/fake_real_news_dataset/master/fake_or_real_news.csv.zip`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 16,
- "status": "ok",
- "timestamp": 1664911986816,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "4lJPUe-KmqBN",
- "outputId": "cb97cb62-41b0-45ab-bc47-691fdf2106c4"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['FAKE', 'REAL']"
- ]
- },
- "execution_count": 33,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "fake_classifier.getClasses()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 59205,
- "status": "ok",
- "timestamp": 1664912050172,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "xY03ThozmwIE",
- "outputId": "871908e1-5384-47a5-dc69-5b21a5bf4fa4"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "tfhub_use download started this may take some time.\n",
- "Approximate size to download 923.7 MB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "documentAssembler = DocumentAssembler()\\\n",
- " .setInputCol(\"text\")\\\n",
- " .setOutputCol(\"document\")\n",
- "\n",
- "use = UniversalSentenceEncoder.pretrained(lang=\"en\") \\\n",
- " .setInputCols([\"document\"])\\\n",
- " .setOutputCol(\"sentence_embeddings\")\n",
- "\n",
- "nlpPipeline = Pipeline(stages=[documentAssembler, \n",
- " use,\n",
- " fake_classifier])\n",
- "\n",
- "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
- "\n",
- "fake_clf_model = nlpPipeline.fit(empty_data)\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "dLXj1wBWm0gQ"
- },
- "outputs": [],
- "source": [
- "!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/spam_ham_dataset.csv"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 359,
- "status": "ok",
- "timestamp": 1664912051072,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "HlFOegD3ofvP",
- "outputId": "ab303357-4482-4bd0-84a5-3d01e4d060fe"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'document': ['BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump'],\n",
- " 'sentence_embeddings': ['BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump'],\n",
- " 'class': ['FAKE']}"
- ]
- },
- "execution_count": 36,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "fake_lp_pipeline = LightPipeline(fake_clf_model)\n",
- "\n",
- "text = 'BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump'\n",
- "\n",
- "fake_lp_pipeline.annotate(text)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 232,
- "status": "ok",
- "timestamp": 1664912051299,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "MrOQLnBcofh5",
- "outputId": "88ed92e7-3f07-44fd-b865-567de8654a41"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+-------------------------------------------------------------------------------------------------+\n",
- "|text |\n",
- "+-------------------------------------------------------------------------------------------------+\n",
- "|BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump|\n",
- "+-------------------------------------------------------------------------------------------------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "sample_data = spark.createDataFrame([[text]]).toDF(\"text\")\n",
- "\n",
- "sample_data.show(truncate=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 526,
- "status": "ok",
- "timestamp": 1664912051821,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "ZutmHe8tofOv",
- "outputId": "b65f3a39-c765-4baf-8bb1-b3256a1695c5"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+--------------------+--------------------+--------------------+--------------------+\n",
- "| text| document| sentence_embeddings| class|\n",
- "+--------------------+--------------------+--------------------+--------------------+\n",
- "|BREAKING: Leaked ...|[{document, 0, 96...|[{sentence_embedd...|[{category, 0, 96...|\n",
- "+--------------------+--------------------+--------------------+--------------------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "pred = fake_clf_model.transform(sample_data)\n",
- "\n",
- "pred.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 748,
- "status": "ok",
- "timestamp": 1664912052565,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "S1VaVxd7pCAI",
- "outputId": "a367a2cc-fe82-4f78-c6bd-f37a3b76af31"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "+-------------------------------------------------------------------------------------------------+------+\n",
- "|text |result|\n",
- "+-------------------------------------------------------------------------------------------------+------+\n",
- "|BREAKING: Leaked Picture Of Obama Being Dragged Before A Judge In Handcuffs For Wiretapping Trump|[FAKE]|\n",
- "+-------------------------------------------------------------------------------------------------+------+\n",
- "\n"
- ]
- }
- ],
- "source": [
- "pred.select('text','class.result').show(truncate=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "5wDEdQ99pIw0"
- },
- "source": [
- "you can find more samples here >> `https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset`\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 27,
- "status": "ok",
- "timestamp": 1664912052566,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "J1RzlrnzS9Ry",
- "outputId": "54b72f89-43fd-43ce-8a6c-cf2117fef07f"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{'document': ['Joseph Robinette Biden Jr. is an American politician who is the 46th and current president of the United States.'],\n",
- " 'sentence_embeddings': ['Joseph Robinette Biden Jr. is an American politician who is the 46th and current president of the United States.'],\n",
- " 'class': ['REAL']}"
- ]
- },
- "execution_count": 40,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "fake_lp_pipeline = LightPipeline(fake_clf_model)\n",
- "\n",
- "text = \"Joseph Robinette Biden Jr. is an American politician who is the 46th and current president of the United States.\"\n",
- "\n",
- "fake_lp_pipeline.annotate(text)\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8X5ftW_kpVS-"
- },
- "source": [
- "## Generic classifier function"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "2AsjZNMFpVy2"
- },
- "outputs": [],
- "source": [
- "def get_clf_lp(model_name, sentiment_dl=False, pretrained=True):\n",
- "\n",
- " documentAssembler = DocumentAssembler()\\\n",
- " .setInputCol(\"text\")\\\n",
- " .setOutputCol(\"document\")\n",
- "\n",
- " use = UniversalSentenceEncoder.pretrained(lang=\"en\") \\\n",
- " .setInputCols([\"document\"])\\\n",
- " .setOutputCol(\"sentence_embeddings\")\n",
- "\n",
- "\n",
- " if pretrained:\n",
- "\n",
- " if sentiment_dl:\n",
- "\n",
- " document_classifier = SentimentDLModel.pretrained(model_name, 'en') \\\n",
- " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n",
- " .setOutputCol(\"class\")\n",
- " else:\n",
- " document_classifier = ClassifierDLModel.pretrained(model_name, 'en') \\\n",
- " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n",
- " .setOutputCol(\"class\")\n",
- "\n",
- " else:\n",
- "\n",
- " if sentiment_dl:\n",
- "\n",
- " document_classifier = SentimentDLModel.load(model_name) \\\n",
- " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n",
- " .setOutputCol(\"class\")\n",
- " else:\n",
- " document_classifier = ClassifierDLModel.load(model_name) \\\n",
- " .setInputCols([\"document\", \"sentence_embeddings\"]) \\\n",
- " .setOutputCol(\"class\")\n",
- "\n",
- " print ('classes:',document_classifier.getClasses())\n",
- "\n",
- " nlpPipeline = Pipeline(stages=[\n",
- " documentAssembler, \n",
- " use,\n",
- " document_classifier\n",
- " ])\n",
- "\n",
- " empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
- "\n",
- " clf_pipelineFit = nlpPipeline.fit(empty_data)\n",
- "\n",
- " clf_lp_pipeline = LightPipeline(clf_pipelineFit)\n",
- "\n",
- " return clf_lp_pipeline"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 12017,
- "status": "ok",
- "timestamp": 1664912064562,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "Sv0HYuokpYWv",
- "outputId": "5b809660-2674-45aa-96b7-9d4220984ef2"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "tfhub_use download started this may take some time.\n",
- "Approximate size to download 923.7 MB\n",
- "[OK!]\n",
- "classifierdl_use_trec50 download started this may take some time.\n",
- "Approximate size to download 21.2 MB\n",
- "[OK!]\n",
- "classes: [' ENTY_color', ' ENTY_techmeth', ' DESC_manner', ' NUM_volsize', ' ENTY_letter', ' NUM_temp', ' ENTY_body', ' NUM_count', ' ENTY_instru', ' NUM_period', ' NUM_speed', ' DESC_reason', ' ENTY_symbol', ' ENTY_event', ' HUM_desc', ' NUM_perc', ' ENTY_dismed', ' NUM_ord', ' HUM_gr', ' LOC_mount', ' ABBR_abb', ' DESC_desc', ' NUM_dist', ' HUM_title', ' ENTY_lang', ' ENTY_sport', ' ENTY_plant', ' NUM_code', ' NUM_other', ' ENTY_word', ' ENTY_animal', ' ENTY_substance', ' ENTY_veh', ' ENTY_product', ' LOC_state', ' ENTY_religion', ' ENTY_currency', ' NUM_date', ' LOC_country', ' ENTY_cremat', ' NUM_money', ' LOC_other', ' DESC_def', ' LOC_city', ' HUM_ind', ' ENTY_other', ' ENTY_termeq', ' ENTY_food', ' ABBR_exp', ' NUM_weight']\n"
- ]
- }
- ],
- "source": [
- "clf_lp_pipeline = get_clf_lp('classifierdl_use_trec50')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "KPQpn6hGpeXR"
- },
- "source": [
- "trained on the TREC datasets:\n",
- "\n",
- "Classify open-domain, fact-based questions into one of the following broad semantic categories: \n",
- "\n",
- "```Abbreviation, Description, Entities, Human Beings, Locations or Numeric Values.```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 32,
- "status": "ok",
- "timestamp": 1664912064563,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "qhszgL_epe7W",
- "outputId": "a0adb084-08f4-417a-9bb0-c2f892cc4f7c"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[' NUM_count']"
- ]
- },
- "execution_count": 43,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text = 'What was the number of member nations of the U.N. in 2000?'\n",
- "\n",
- "clf_lp_pipeline.annotate(text)['class']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 35
- },
- "executionInfo": {
- "elapsed": 378,
- "status": "ok",
- "timestamp": 1664912064920,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "lU8mhnX-pn5-",
- "outputId": "d0c47dde-37dc-4c45-c116-c3b0e7d22514"
- },
- "outputs": [
- {
- "data": {
- "application/vnd.google.colaboratory.intrinsic+json": {
- "type": "string"
- },
- "text/plain": [
- "' NUM_count'"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "clf_lp_pipeline.fullAnnotate(text)[0]['class'][0].result"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 32,
- "status": "ok",
- "timestamp": 1664912064921,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "boKPGhgoppto",
- "outputId": "f2e6d882-42e4-4179-9cf3-e48ae075a3c3"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "{' ENTY_dismed': '3.768739E-22',\n",
- " ' ENTY_product': '2.4015744E-24',\n",
- " ' ENTY_techmeth': '1.5787039E-22',\n",
- " ' NUM_speed': '7.948464E-23',\n",
- " ' NUM_volsize': '2.5315113E-25',\n",
- " ' LOC_state': '6.3784123E-25',\n",
- " ' NUM_code': '1.4549451E-25',\n",
- " ' NUM_count': '0.9992601',\n",
- " ' ENTY_food': '1.3031208E-24',\n",
- " ' ENTY_animal': '1.6743833E-24',\n",
- " ' NUM_period': '6.8075115E-21',\n",
- " ' ENTY_religion': '5.9194734E-23',\n",
- " ' LOC_country': '5.3062683E-21',\n",
- " ' LOC_mount': '3.2177816E-25',\n",
- " ' ENTY_termeq': '9.790085E-26',\n",
- " ' ENTY_color': '1.1446835E-22',\n",
- " ' ENTY_lang': '6.333391E-24',\n",
- " ' ENTY_sport': '8.0773835E-25',\n",
- " ' DESC_def': '2.4284432E-27',\n",
- " ' HUM_gr': '4.4863106E-21',\n",
- " ' ENTY_symbol': '4.1271923E-25',\n",
- " ' ENTY_currency': '8.156541E-29',\n",
- " ' ENTY_veh': '5.414701E-22',\n",
- " ' LOC_other': '5.5141072E-11',\n",
- " ' ENTY_word': '5.3265024E-23',\n",
- " ' NUM_temp': '2.0907158E-23',\n",
- " ' NUM_dist': '1.2542656E-24',\n",
- " ' DESC_desc': '1.0926973E-12',\n",
- " ' DESC_manner': '9.258374E-23',\n",
- " ' NUM_ord': '2.2395288E-25',\n",
- " ' NUM_other': '3.9771262E-27',\n",
- " ' DESC_reason': '1.1718967E-6',\n",
- " ' NUM_weight': '1.5373857E-24',\n",
- " ' ENTY_instru': '5.9354656E-21',\n",
- " ' ENTY_letter': '1.1453239E-25',\n",
- " ' ENTY_event': '3.706315E-25',\n",
- " ' ENTY_substance': '6.890844E-25',\n",
- " ' ABBR_exp': '5.6048268E-24',\n",
- " ' ENTY_body': '6.423101E-23',\n",
- " ' ENTY_other': '7.378E-4',\n",
- " ' NUM_money': '1.6745677E-25',\n",
- " ' LOC_city': '4.7003377E-22',\n",
- " ' NUM_date': '5.2122506E-16',\n",
- " ' NUM_perc': '6.3761288E-24',\n",
- " ' ABBR_abb': '7.101014E-26',\n",
- " ' ENTY_plant': '5.543376E-24',\n",
- " ' HUM_title': '1.0681953E-24',\n",
- " ' ENTY_cremat': '1.1165376E-24',\n",
- " ' HUM_ind': '8.063818E-7',\n",
- " ' HUM_desc': '4.3701275E-23',\n",
- " 'sentence': '0'}"
- ]
- },
- "execution_count": 45,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "clf_lp_pipeline.fullAnnotate(text)[0]['class'][0].metadata"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 20,
- "status": "ok",
- "timestamp": 1664912064922,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "P87b4fzRpr6-",
- "outputId": "2a07fdc8-4a2e-4bfe-9a80-152ca249baa7"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[' HUM_ind']"
- ]
- },
- "execution_count": 46,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text = 'What animal was the first mammal successfully cloned from adult cells?'\n",
- "\n",
- "clf_lp_pipeline.annotate(text)['class']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 11360,
- "status": "ok",
- "timestamp": 1664912076270,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "bNgGwNDwpuPV",
- "outputId": "39db1753-994b-46d5-84e0-00743eb71f02"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "tfhub_use download started this may take some time.\n",
- "Approximate size to download 923.7 MB\n",
- "[OK!]\n",
- "classifierdl_use_cyberbullying download started this may take some time.\n",
- "Approximate size to download 21.3 MB\n",
- "[OK!]\n",
- "classes: ['sexism', 'neutral', 'racism']\n"
- ]
- }
- ],
- "source": [
- "clf_lp_pipeline = get_clf_lp('classifierdl_use_cyberbullying')\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 362,
- "status": "ok",
- "timestamp": 1664912076600,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "MPGnM_gipwPf",
- "outputId": "ab83a3dd-65d8-4390-8457-50c7ef196a9e"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['sexism']"
- ]
- },
- "execution_count": 48,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text ='RT @EBeisner @ahall012 I agree with you!! I would rather brush my teeth with sandpaper then watch football with a girl!!'\n",
- "\n",
- "clf_lp_pipeline.annotate(text)['class']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 6402,
- "status": "ok",
- "timestamp": 1664912082988,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "3u9ktm9xp2aF",
- "outputId": "4947d747-787d-4932-ca58-37fd709ed84d"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "tfhub_use download started this may take some time.\n",
- "Approximate size to download 923.7 MB\n",
- "[OK!]\n",
- "classifierdl_use_fakenews download started this may take some time.\n",
- "Approximate size to download 21.4 MB\n",
- "[OK!]\n",
- "classes: ['FAKE', 'REAL']\n"
- ]
- }
- ],
- "source": [
- "clf_lp_pipeline = get_clf_lp('classifierdl_use_fakenews')\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 54,
- "status": "ok",
- "timestamp": 1664912082989,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "CNkFNiS4p5GM",
- "outputId": "8f2624e0-9e47-4e6b-c090-07813eda3171"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['FAKE']"
- ]
- },
- "execution_count": 50,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text ='Donald Trump a KGB Spy? 11/02/2016 In today’s video, Christopher Greene of AMTV reports Hillary Clinton campaign accusation that Donald Trump is a KGB spy is about as weak and baseless a claim as a Salem witch hunt or McCarthy era trial. It’s only because Hillary Clinton is losing that she is lobbing conspiracy theory. Citizen Quasar The way I see it, one of two things will happen: 1. Trump will win by a landslide but the election will be stolen via electronic voting, just like I have been predicting for over a decade, and the American People will accept the skewed election results just like they accept the TSA into their crotches. 2. Somebody will bust a cap in Hillary’s @$$ killing her and the election will be postponed. Follow AMTV!'\n",
- "\n",
- "clf_lp_pipeline.annotate(text)['class']\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 39,
- "status": "ok",
- "timestamp": 1664912082990,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "L77nGOzsp859",
- "outputId": "f8bbbec1-41f7-4d3f-b4f5-af5ed8ad8f87"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['REAL']"
- ]
- },
- "execution_count": 51,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text ='Sen. Marco Rubio (R-Fla.) is adding a veteran New Hampshire political operative to his team as he continues mulling a possible 2016 presidential bid, the latest sign that he is seriously preparing to launch a campaign later this year.Jim Merrill, who worked for former GOP presidential nominee Mitt Romney and ran his 2008 and 2012 New Hampshire primary campaigns, joined Rubio’s fledgling campaign on Monday, aides to the senator said.Merrill will be joining Rubio’s Reclaim America PAC to focus on Rubio’s New Hampshire and broader Northeast political operations.\"Marco has always been well received in New Hampshire, and should he run for president, he would be very competitive there,\" Terry Sullivan, who runs Reclaim America, said in a statement. \"Jim certainly knows how to win in New Hampshire and in the Northeast, and will be a great addition to our team at Reclaim America.”News of Merrill’s hire was first reported by The New York Times.'\n",
- "\n",
- "clf_lp_pipeline.annotate(text)['class']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 9046,
- "status": "ok",
- "timestamp": 1664912092005,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "awBvudTRqApS",
- "outputId": "b4ca676b-400e-445b-b3b6-652187c638eb"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "tfhub_use download started this may take some time.\n",
- "Approximate size to download 923.7 MB\n",
- "[OK!]\n",
- "sentimentdl_use_twitter download started this may take some time.\n",
- "Approximate size to download 11.4 MB\n",
- "[OK!]\n",
- "classes: ['positive', 'negative']\n"
- ]
- }
- ],
- "source": [
- "sentiment_lp_pipeline = get_clf_lp('sentimentdl_use_twitter', sentiment_dl=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 42,
- "status": "ok",
- "timestamp": 1664912092006,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "5RQ4aVPNqDHn",
- "outputId": "c02f64f4-94c4-4af0-8a26-7ebda1c310b5"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['positive']"
- ]
- },
- "execution_count": 53,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text ='I am SO happy the news came out in time for my birthday this weekend! My inner 7-year-old cannot WAIT!'\n",
- "\n",
- "sentiment_lp_pipeline.annotate(text)['class']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 11183,
- "status": "ok",
- "timestamp": 1664912103161,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "YON-nknFqG5m",
- "outputId": "49552554-f9b4-46c4-c4bf-4c1cc5821b6b"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "tfhub_use download started this may take some time.\n",
- "Approximate size to download 923.7 MB\n",
- "[OK!]\n",
- "classifierdl_use_emotion download started this may take some time.\n",
- "Approximate size to download 21.3 MB\n",
- "[OK!]\n",
- "classes: ['joy', 'fear', 'surprise', 'sadness']\n"
- ]
- }
- ],
- "source": [
- "sentiment_lp_pipeline = get_clf_lp('classifierdl_use_emotion', sentiment_dl=False)\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 357,
- "status": "ok",
- "timestamp": 1664912103493,
- "user": {
- "displayName": "Merve Ertas Uslu",
- "userId": "01451729557099986551"
- },
- "user_tz": -120
- },
- "id": "yW4Yim4HqJXn",
- "outputId": "542aebb2-1f41-4fe6-ee19-91f7312d1fb5"
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['surprise']"
- ]
- },
- "execution_count": 55,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "sentiment_lp_pipeline.annotate(text)['class']"
- ]
- }
- ],
- "metadata": {
- "accelerator": "TPU",
- "colab": {
- "collapsed_sections": [],
- "machine_shape": "hm",
- "provenance": [],
- "toc_visible": true
- },
- "gpuClass": "standard",
- "kernelspec": {
- "display_name": "base",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.12 (main, Apr 5 2022, 06:56:58) \n[GCC 7.5.0]"
- },
- "vscode": {
- "interpreter": {
- "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
diff --git a/examples/bakSentenceDetectorDL.ipynb b/examples/bakSentenceDetectorDL.ipynb
deleted file mode 100644
index ea02f287ab815a..00000000000000
--- a/examples/bakSentenceDetectorDL.ipynb
+++ /dev/null
@@ -1,787 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "E0pAkKvH6v7K"
- },
- "source": [
- "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "OzsBWso169YV"
- },
- "source": [
- "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb)"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {
- "id": "jirVPUT0F9bB"
- },
- "source": [
- "# SentenceDetectorDL"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "2mg5E3wl8yHp"
- },
- "source": [
- "`SentenceDetectorDL` (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.\n",
- "\n",
- "In this model, we treated the sentence boundary detection task as a classification problem using a DL CNN architecture. We also modified the original implemenation a little bit to cover broken sentences and some impossible end of line chars.\n",
- "\n",
- "We are releasing two pretrained SDDL models: `english` and `multilanguage` that are trained on `SETimes corpus (Tyers and Alperen, 2010)` and ` Europarl. Wong et al. (2014)` datasets.\n",
- "\n",
- "Here are the test metrics on various languages for `multilang` model"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "KvNuyGXpD7Nt"
- },
- "source": [
- "![image.png]()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "adrTGL-6ECtF"
- },
- "source": [
- "**Supported Languages**\n",
- "\n",
- "`bg Bulgarian`\n",
- "\n",
- "`bs Bosnian`\n",
- "\n",
- "`da Danish`\n",
- "\n",
- "`de German`\n",
- "\n",
- "`el Greek`\n",
- "\n",
- "`en English`\n",
- "\n",
- "`es Spanish`\n",
- "\n",
- "`fi Finnish`\n",
- "\n",
- "`fr French`\n",
- "\n",
- "`hr Croatian`\n",
- "\n",
- "`it Italian`\n",
- "\n",
- "`mk Macedonian`\n",
- "\n",
- "`nl Dutch`\n",
- "\n",
- "`pt Portuguese`\n",
- "\n",
- "`ro Romanian`\n",
- "\n",
- "`sq Albanian`\n",
- "\n",
- "`sr Serbian`\n",
- "\n",
- "`sv Swedish`\n",
- "\n",
- "`tr Turkish`\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "s8h1ee-GaEsn"
- },
- "outputs": [],
- "source": [
- "! pip install -q pyspark==3.3.0 spark-nlp==4.3.0"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 254
- },
- "executionInfo": {
- "elapsed": 23293,
- "status": "ok",
- "timestamp": 1664975024014,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "6_cR3Syj8wTd",
- "outputId": "67700eee-d59b-48b8-c0cc-54dd7dc2f23d"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Spark NLP version 4.3.0\n",
- "Apache Spark version: 3.3.0\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- "
\n",
- " "
- ],
- "text/plain": [
- ""
- ]
- },
- "execution_count": 2,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import sparknlp\n",
- "\n",
- "from pyspark.ml import PipelineModel\n",
- "from sparknlp.annotator import *\n",
- "from sparknlp.base import *\n",
- "\n",
- "spark = sparknlp.start()\n",
- "\n",
- "print(\"Spark NLP version\", sparknlp.version())\n",
- "print(\"Apache Spark version:\", spark.version)\n",
- "\n",
- "spark"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 16765,
- "status": "ok",
- "timestamp": 1664975040774,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "zS46q8E0Aidy",
- "outputId": "f067a993-5fcf-4e84-e41d-2c17202ade55"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "sentence_detector_dl download started this may take some time.\n",
- "Approximate size to download 354.6 KB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "documenter = DocumentAssembler()\\\n",
- " .setInputCol(\"text\")\\\n",
- " .setOutputCol(\"document\")\n",
- " \n",
- "sentencerDL = SentenceDetectorDLModel\\\n",
- " .pretrained(\"sentence_detector_dl\", \"en\") \\\n",
- " .setInputCols([\"document\"]) \\\n",
- " .setOutputCol(\"sentences\")\n",
- "\n",
- "sd_pipeline = PipelineModel(stages=[documenter, sentencerDL])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "executionInfo": {
- "elapsed": 16,
- "status": "ok",
- "timestamp": 1664975040775,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "kHWj4IxqBnwK"
- },
- "outputs": [],
- "source": [
- "sd_model = LightPipeline(sd_pipeline)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 388,
- "status": "ok",
- "timestamp": 1664975041148,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "sBJqnxE6-1uz",
- "outputId": "9ce4d659-70e9-459a-cd36-fc2f382bb716"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "0\t0\t15\tJohn loves Mary.\n",
- "1\t16\t31\tmary loves Peter\n",
- "2\t43\t61\tPeter loves Helen .\n",
- "3\t62\t78\tHelen loves John;\n",
- "4\t91\t119\tTotal: four. people involved.\n"
- ]
- }
- ],
- "source": [
- "text = \"\"\"John loves Mary.mary loves Peter\n",
- " Peter loves Helen .Helen loves John; \n",
- " Total: four. people involved.\"\"\"\n",
- "\n",
- "for anno in sd_model.fullAnnotate(text)[0][\"sentences\"]:\n",
- " print(\"{}\\t{}\\t{}\\t{}\".format(\n",
- " anno.metadata[\"sentence\"], anno.begin, anno.end, anno.result))\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "-ihAtVvIDSJh"
- },
- "source": [
- "### Testing with a broken text (random `\\n` chars added)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 796,
- "status": "ok",
- "timestamp": 1664975041942,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "NzRiZYqiCX4t",
- "outputId": "e381bffc-04f3-4522-fa12-e4d99d8dd308"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "0\t1\t104\tThere are many NLP tasks like text summarization, question-answering, sentence prediction to name a few.\n",
- "1\t106\t170\tOne method to get these tasks done is using a pre-trained model.\n",
- "2\t172\t362\tInstead of training a model from scratch for NLP tasks using millions of annotated texts each time, a general language representation is created by training a model on a huge amount of data.\n",
- "3\t364\t398\tThis is called a pre-trained model.\n",
- "4\t400\t479\tThis pre-trained model is then fine-tuned for each NLP tasks according to need.\n",
- "5\t481\t520\tLet’s just peek into the pre-BERT world…\n",
- "6\t522\t634\tFor creating models, we need words to be represented in a form understood by the training network, ie, numbers.\n",
- "7\t636\t731\tThus many algorithms were used to convert words into vectors or more precisely, word embeddings.\n",
- "8\t734\t798\tOne of the earliest algorithms used for this purpose is word2vec.\n",
- "9\t800\t872\tHowever, the drawback of word2vec models was that they were context-free.\n",
- "10\t874\t941\tOne problem caused by this is that they cannot accommodate polysemy.\n",
- "11\t943\t1022\tFor example, the word ‘letter’ has a different meaning according to the context.\n",
- "12\t1024\t1106\tIt can mean ‘single element of alphabet’ or ‘document addressed to another person’.\n",
- "13\t1108\t1163\tBut in word2vec both the letter returns same embeddings.\n"
- ]
- }
- ],
- "source": [
- "text = '''\n",
- "There are many NLP tasks like text summarization, question-answering, sentence prediction to name a few. One method to get\\n these tasks done is using a pre-trained model. Instead of training \n",
- "a model from scratch for NLP tasks using millions of annotated texts each time, a general language representation is created by training a model on a huge amount of data. This is called a pre-trained model. This pre-trained model is \n",
- "then fine-tuned for each NLP tasks according to need.\n",
- "Let’s just peek into the pre-BERT world…\n",
- "For creating models, we need words to be represented in a form \\n understood by the training network, ie, numbers. Thus many algorithms were used to convert words into vectors or more precisely, word embeddings. \n",
- "One of the earliest algorithms used for this purpose is word2vec. However, the drawback of word2vec models was that they were context-free. One problem caused by this is that they cannot accommodate polysemy. For example, the word ‘letter’ has a different meaning according to the context. It can mean ‘single element of alphabet’ or ‘document addressed to another person’. But in word2vec both the letter returns same embeddings.\n",
- "'''\n",
- "\n",
- "for anno in sd_model.fullAnnotate(text)[0][\"sentences\"]:\n",
- " \n",
- " print(\"{}\\t{}\\t{}\\t{}\".format(\n",
- " anno.metadata[\"sentence\"], anno.begin, anno.end, anno.result.replace('\\n',''))) # removing \\n to beutify printing\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "P1uBCqrnElmi"
- },
- "source": [
- "## Compare with Spacy Sentence Splitter"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "executionInfo": {
- "elapsed": 6,
- "status": "ok",
- "timestamp": 1664975041943,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "8PtDPgliWEdu"
- },
- "outputs": [],
- "source": [
- "# !pip install spacy"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "executionInfo": {
- "elapsed": 15720,
- "status": "ok",
- "timestamp": 1664975057658,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "5tHHalKGEoTu"
- },
- "outputs": [],
- "source": [
- "import spacy\n",
- "\n",
- "nlp = spacy.load(\"en_core_web_sm\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 20,
- "status": "ok",
- "timestamp": 1664975057659,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "euCxCFU-Erpv",
- "outputId": "4cc0b189-2211-43a6-93db-186ea1bb0f5b"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "John loves Mary.mary loves Peter\n",
- "Peter loves Helen .Helen\n",
- "loves John; \n",
- "Total: four.\n",
- "people involved.\n"
- ]
- }
- ],
- "source": [
- "text = \"\"\"John loves Mary.mary loves Peter\n",
- "Peter loves Helen .Helen loves John; \n",
- "Total: four. people involved.\"\"\"\n",
- "\n",
- "for sent in nlp(text).sents:\n",
- " print(sent)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "Rq1LuycdbpAf"
- },
- "source": [
- "## Test with another random broken sentence "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 404,
- "status": "ok",
- "timestamp": 1664975058045,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "S6EUTjAlFFVd",
- "outputId": "6f5b24f6-8cf0-493c-dc62-9a25097ad007"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "with Spark NLP SentenceDetectorDL\n",
- "===================================\n",
- "0\tA California woman who vanished in Utah’s Zion National Park earlier,this month was found and reunited with her family officials said Sunday.\n",
- "1\tHolly Suzanne Courtier, 38, was located within the park after a visitor saw her and alerted rangers, the National. Park Service said in a statement.\n",
- "2\tAdditional details about how she survived or where she was found were not immediately available.\n",
- "3\tIn the statement, Courtier’s relatives said they were “overjoyed” that she’d been found.\n",
- "4\tCourtier, of Los Angeles, disappeared after a private shuttle dropped her off on Oct. 6. at the Grotto park area inside the 232-square-mile national park.\n",
- "5\tShe was scheduled to be picked up later that afternoon but didn't show up, park officials said.\n",
- "6\tThe search included K.9. units and federal, state and local rescue teams;\n",
- "7\tVolunteers also joined the effort.\n",
- "\n",
- "with Spacy Sentence Detection\n",
- "===================================\n",
- "0 \t A California woman who vanished in Utah’s Zion National Park earlier,this month was found and reunited with her family officials said Sunday.\n",
- "1 \t Holly Suzanne Courtier, 38, was located within the park after a visitor saw her and alerted rangers, the National.\n",
- "2 \t Park Service said in a statement.\n",
- "3 \t Additional details about how she survived or where she was found were not immediately available.\n",
- "4 \t In the statement, Courtier’s relatives said they were “overjoyed” that she’d been found.\n",
- "5 \t Courtier, of Los Angeles, disappeared after a private shuttle dropped her off on Oct. 6.\n",
- "6 \t at the Grotto park area inside the 232-square-mile national park.\n",
- "7 \t She was scheduled to be picked up later that afternoon but didn't show up, park officials said.\n",
- "8 \t The search included K.9.\n",
- "9 \t units and federal, state and local rescue teams; Volunteers also joined the effort.\n"
- ]
- }
- ],
- "source": [
- "random_broken_text = '''\n",
- "A California woman who vanished in Utah’s Zion National Park earlier,\n",
- "this month was found and reunited with her family \n",
- "officials said Sunday. Holly Suzanne Courtier, \n",
- "38, was located within the park after a visitor saw \n",
- "her and alerted rangers, the National. Park Service said in a statement.\n",
- "Additional details about how she \n",
- "survived or where she was found were not immediately available. In the statement, \n",
- "Courtier’s relatives said they were “overjoyed” that she’d been found.\n",
- "Courtier, of Los Angeles, disappeared after a private shuttle dropped her off on Oct. 6. at the Grotto park area \n",
- "inside the 232-square-mile national park. She was scheduled to be picked up later that \n",
- "afternoon but didn't show up, park officials said. The search included K.9. units and federal, \n",
- "state and local rescue teams; Volunteers also joined the effort.\n",
- "'''\n",
- "\n",
- "print ('with Spark NLP SentenceDetectorDL')\n",
- "print ('===================================')\n",
- "\n",
- "for anno in sd_model.fullAnnotate(random_broken_text)[0][\"sentences\"]:\n",
- " \n",
- " print(\"{}\\t{}\".format(\n",
- " anno.metadata[\"sentence\"], anno.result.replace('\\n',''))) # removing \\n to beutify printing\n",
- "\n",
- "print()\n",
- "print ('with Spacy Sentence Detection')\n",
- "print ('===================================')\n",
- "for i,sent in enumerate(nlp(random_broken_text).sents):\n",
- " print(i, '\\t',str(sent).replace('\\n',''))# removing \\n to beutify printing"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "WiU0yHsvmxSv"
- },
- "source": [
- "## Multilanguage Sentence Detector DL"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 6293,
- "status": "ok",
- "timestamp": 1664975064334,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "ULFgE7KmkbMa",
- "outputId": "42e9545e-4542-470c-fb5c-a3d3b9d537d0"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "sentence_detector_dl download started this may take some time.\n",
- "Approximate size to download 514.9 KB\n",
- "[OK!]\n"
- ]
- }
- ],
- "source": [
- "sentencerDL_multilang = SentenceDetectorDLModel\\\n",
- " .pretrained(\"sentence_detector_dl\", \"xx\") \\\n",
- " .setInputCols([\"document\"]) \\\n",
- " .setOutputCol(\"sentences\")\n",
- "\n",
- "sd_pipeline_multi = PipelineModel(stages=[documenter, sentencerDL_multilang])\n",
- "\n",
- "sd_model_multi = LightPipeline(sd_pipeline_multi)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 758,
- "status": "ok",
- "timestamp": 1664975065076,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "YPR4MqEZbPo7",
- "outputId": "bd3410e4-5c56-4c82-9c86-af3a6e676c33"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "with Spark NLP SentenceDetectorDL\n",
- "===================================\n",
- "0\tΌπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται.\n",
- "1\tΣτη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC.\n",
- "2\tΠροφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη.\n",
- "3\tΌσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές.\n",
- "4\tΤα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.\n",
- "\n",
- "with Spacy Sentence Detection\n",
- "===================================\n",
- "0 \t Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται.\n",
- "1 \t Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC.Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη.\n",
- "2 \t Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές.\n",
- "3 \t Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει\n",
- "4 \t στο πρώτο μισό του 2021, οι εφαρμογές θα ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.\n"
- ]
- }
- ],
- "source": [
- "gr_text= '''\n",
- "Όπως ίσως θα γνωρίζει, όταν εγκαθιστάς μια νέα εφαρμογή, θα έχεις διαπιστώσει \n",
- "λίγο μετά, ότι το PC αρχίζει να επιβραδύνεται. Στη συνέχεια, όταν επισκέπτεσαι την οθόνη ή από την διαχείριση εργασιών, θα διαπιστώσεις ότι η εν λόγω εφαρμογή έχει προστεθεί στη \n",
- "λίστα των προγραμμάτων που εκκινούν αυτόματα, όταν ξεκινάς το PC.\n",
- "Προφανώς, κάτι τέτοιο δεν αποτελεί μια ιδανική κατάσταση, ιδίως για τους λιγότερο γνώστες, οι \n",
- "οποίοι ίσως δεν θα συνειδητοποιήσουν ότι κάτι τέτοιο συνέβη. Όσο περισσότερες εφαρμογές στη λίστα αυτή, τόσο πιο αργή γίνεται η \n",
- "εκκίνηση, ιδίως αν πρόκειται για απαιτητικές εφαρμογές. Τα ευχάριστα νέα είναι ότι η τελευταία και πιο πρόσφατη preview build της έκδοσης των Windows 10 που θα καταφθάσει στο πρώτο μισό του 2021, οι εφαρμογές θα \n",
- "ενημερώνουν το χρήστη ότι έχουν προστεθεί στη λίστα των εφαρμογών που εκκινούν μόλις ανοίγεις το PC.\n",
- "'''\n",
- "\n",
- "print ('with Spark NLP SentenceDetectorDL')\n",
- "print ('===================================')\n",
- "\n",
- "for anno in sd_model_multi.fullAnnotate(gr_text)[0][\"sentences\"]:\n",
- " \n",
- " print(\"{}\\t{}\".format(\n",
- " anno.metadata[\"sentence\"], anno.result.replace('\\n',''))) # removing \\n to beutify printing\n",
- "\n",
- "print()\n",
- "print ('with Spacy Sentence Detection')\n",
- "print ('===================================')\n",
- "for i,sent in enumerate(nlp(gr_text).sents):\n",
- " print(i, '\\t',str(sent).replace('\\n',''))# removing \\n to beutify printing"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "executionInfo": {
- "elapsed": 9,
- "status": "ok",
- "timestamp": 1664975065077,
- "user": {
- "displayName": "Halil SAGLAMLAR",
- "userId": "07259164328506563794"
- },
- "user_tz": -180
- },
- "id": "sUc7n1wsktJs",
- "outputId": "aeb2e4b3-1696-4025-829c-6f8ea9dac4b8"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "with Spark NLP SentenceDetectorDL\n",
- "===================================\n",
- "0\tB чeтвъpтъĸ Gооglе oбяви няĸoлĸo aĸтyaлизaции нa cвoятa тъpcaчĸa, зaявявaйĸи чe e въвeлa изĸycтвeн интeлeĸт (Аl) и мaшиннo oбyчeниe зa пoдoбpявaнe нa пoтpeбитeлcĸoтo изживявaнe.\n",
- "1\tΠoтpeбитeлитe вeчe мoгaт дa cи тaнaниĸaт, cвиpят или пeят мeлoдия нa пeceн нa Gооglе чpeз мoбилнoтo пpилoжeниe, ĸaтo дoĸocнaт иĸoнaтa нa миĸpoфoнa и зaдaдaт въпpoca: Koя e тaзи пeceн?\n",
- "2\tTaнaниĸaнeтo в пpoдължeниe нa 10-15 ceĸyнди щe дaдe шaнc нa aлгopитъмa c мaшиннo oбyчeниe нa Gооglе дa нaмepи и извeдe peзyлтaт ĸoя e пpипявaнaтa пeceн.\n",
- "3\tΠoнacтoящeм фyнĸциятa e дocтъпнa нa aнглийcĸи eзиĸ зa Іоѕ и нa oĸoлo 20 eзиĸa зa Аndrоіd, ĸaтo в бъдeщe и зa двeтe oпepaциoнни cиcтeми щe бъдe пpeдлoжeн eднaĸъв нaбop oт пoддъpжaни eзици, ĸaзвaт oт Gооglе.\n",
- "4\tAl aĸтyaлизaциитe нa тъpceщия гигaнт cъщo oбxвaщaт пpaвoпиca и oбщитe зaявĸи зa тъpceнe.\n",
- "5\tCpeд пoдoбpeниятa e вĸлючeн нoв пpaвoпиceн aлгopитъм, ĸoйтo изпoлзвa нeвpoннa мpeжa c дълбoĸo oбyчeниe, зa ĸoятo Gооglе твъpди, чe идвa cъc знaчитeлнo пoдoбpeнa cпocoбнocт зa дeшифpиpaнe нa пpaвoпиcни гpeшĸи.\n",
- "\n",
- "with Spacy Sentence Detection\n",
- "===================================\n",
- "0 \t B чeтвъpтъĸ Gооglе oбяви няĸoлĸo aĸтyaлизaции нa cвoятa тъpcaчĸa, зaявявaйĸи чe e въвeлa изĸycтвeн интeлeĸт\n",
- "1 \t (Аl) и мaшиннo oбyчeниe зa пoдoбpявaнe нa пoтpeбитeлcĸoтo изживявaнe.\n",
- "2 \t Πoтpeбитeлитe вeчe мoгaт дa cи тaнaниĸaт, cвиpят или пeят мeлoдия нa пeceн нa Gооglе чpeз мoбилнoтo пpилoжeниe, ĸaтo дoĸocнaт иĸoнaтa нa миĸpoфoнa и зaдaдaт въпpoca:\n",
- "3 \t Koя e тaзи пeceн?Taнaниĸaнeтo в пpoдължeниe нa 10-15 ceĸyнди щe дaдe шaнc нa aлгopитъмa c мaшиннo oбyчeниe\n",
- "4 \t нa\n",
- "5 \t Gооglе дa нaмepи и извeдe peзyлтaт ĸoя e пpипявaнaтa пeceн.\n",
- "6 \t Πoнacтoящeм фyнĸциятa e дocтъпнa нa aнглийcĸи eзиĸ\n",
- "7 \t зa\n",
- "8 \t Іоѕ и нa oĸoлo 20 eзиĸa зa Аndrоіd, ĸaтo в бъдeщe и зa двeтe oпepaциoнни cиcтeми щe бъдe пpeдлoжeн eднaĸъв нaбop oт пoддъpжaни eзици, ĸaзвaт oт Gооglе.\n",
- "9 \t Al aĸтyaлизaциитe нa тъpceщия гигaнт cъщo oбxвaщaт пpaвoпиca и oбщитe зaявĸи зa тъpceнe.\n",
- "10 \t Cpeд пoдoбpeниятa e вĸлючeн нoв пpaвoпиceн aлгopитъм, ĸoйтo изпoлзвa нeвpoннa мpeжa c дълбoĸo oбyчeниe, зa ĸoятo Gооglе твъpди, чe идвa cъc знaчитeлнo пoдoбpeнa cпocoбнocт\n",
- "11 \t зa дeшифpиpaнe\n",
- "12 \t нa пpaвoпиcни гpeшĸи.\n"
- ]
- }
- ],
- "source": [
- "cyrillic_text = '''\n",
- "B чeтвъpтъĸ Gооglе oбяви няĸoлĸo aĸтyaлизaции нa cвoятa тъpcaчĸa, зaявявaйĸи чe e \n",
- "въвeлa изĸycтвeн интeлeĸт (Аl) и мaшиннo oбyчeниe зa пoдoбpявaнe нa пoтpeбитeлcĸoтo изживявaнe.\n",
- "Πoтpeбитeлитe вeчe мoгaт дa cи тaнaниĸaт, cвиpят или пeят мeлoдия нa пeceн нa Gооglе чpeз мoбилнoтo пpилoжeниe, \n",
- "ĸaтo дoĸocнaт иĸoнaтa нa миĸpoфoнa и зaдaдaт въпpoca: Koя e тaзи пeceн?\n",
- "Taнaниĸaнeтo в пpoдължeниe нa 10-15 ceĸyнди щe дaдe шaнc нa aлгopитъмa c мaшиннo oбyчeниe нa Gооglе дa нaмepи и извeдe peзyлтaт ĸoя e пpипявaнaтa пeceн.\n",
- "Πoнacтoящeм фyнĸциятa e дocтъпнa нa aнглийcĸи eзиĸ зa Іоѕ и нa oĸoлo 20 eзиĸa зa Аndrоіd, \n",
- "ĸaтo в бъдeщe и зa двeтe oпepaциoнни cиcтeми щe бъдe пpeдлoжeн eднaĸъв нaбop oт пoддъpжaни eзици, ĸaзвaт oт Gооglе.\n",
- "Al aĸтyaлизaциитe нa тъpceщия гигaнт cъщo oбxвaщaт пpaвoпиca и oбщитe зaявĸи зa тъpceнe.\n",
- "Cpeд пoдoбpeниятa e вĸлючeн нoв пpaвoпиceн aлгopитъм, ĸoйтo изпoлзвa нeвpoннa мpeжa \n",
- "c дълбoĸo oбyчeниe, зa ĸoятo Gооglе твъpди, чe идвa cъc знaчитeлнo пoдoбpeнa cпocoбнocт зa \n",
- "дeшифpиpaнe нa пpaвoпиcни гpeшĸи.\n",
- "'''\n",
- "\n",
- "print ('with Spark NLP SentenceDetectorDL')\n",
- "print ('===================================')\n",
- "\n",
- "for anno in sd_model_multi.fullAnnotate(cyrillic_text)[0][\"sentences\"]:\n",
- " \n",
- " print(\"{}\\t{}\".format(\n",
- " anno.metadata[\"sentence\"], anno.result.replace('\\n',''))) # removing \\n to beutify printing\n",
- "\n",
- "print()\n",
- "print ('with Spacy Sentence Detection')\n",
- "print ('===================================')\n",
- "for i,sent in enumerate(nlp(cyrillic_text).sents):\n",
- " print(i, '\\t',str(sent).replace('\\n',''))# removing \\n to beutify printing"
- ]
- }
- ],
- "metadata": {
- "accelerator": "GPU",
- "colab": {
- "collapsed_sections": [],
- "provenance": []
- },
- "kernelspec": {
- "display_name": "base",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.9.12 (main, Apr 5 2022, 06:56:58) \n[GCC 7.5.0]"
- },
- "vscode": {
- "interpreter": {
- "hash": "3d597f4c481aa0f25dceb95d2a0067e73c0966dcbd003d741d821a7208527ecf"
- }
- }
- },
- "nbformat": 4,
- "nbformat_minor": 0
-}
diff --git a/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb b/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb
new file mode 100644
index 00000000000000..fe22efffec7348
--- /dev/null
+++ b/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb
@@ -0,0 +1,275 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea",
+ "metadata": {},
+ "source": [
+ "![JohnSnowLabs](https://johnsnowlabs.com/assets/images/logo.png)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "21e9eafb",
+ "metadata": {},
+ "source": [
+ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/named-entity-recognition/ZeroShot_NER.ipynb)"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "212325cc-182f-4565-abed-9b46864d6d69",
+ "metadata": {},
+ "source": [
+ "# Named Entity Recognition with ZeroShotNer"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "216EshxBJ9ra",
+ "metadata": {},
+ "source": [
+ "## Colab Setup"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f6e6c12b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install -q pyspark==3.3.0 spark-nlp==4.3.0"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fc39c840",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Spark NLP version: 4.2.8\n",
+ "Apache Spark version: 3.3.0\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "