DocumentAssembler on array<string> #13816

HeyBossy · 2023-05-23T07:39:13Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I don't know why but my question was deleted. Therefore, I will repeat again.
I am working with a dataframe that I need to lemmatize. There, the input is an array of strings. I am trying to use DocumentAssembler for array of strings. The documentation says: "The DocumentAssembler can read either a String column or an Array[String])". But it doesn't work like that for me. Can you explain what I'm doing wrong? Or is the documentation out of date?

Current Behavior

I am getting an error

AnalysisException: [CANNOT_UP_CAST_DATATYPE] Cannot up cast input from "ARRAY<STRING>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object

Expected Behavior

Steps To Reproduce

When I do a simple example:

data = spark.createDataFrame([[["Spark NLP is an open-source text processing library."]]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)

result.select("document").show(truncate=False

Spark NLP version and Apache Spark

sparknlp.version() == '4.4.0'
spark.version == '3.4.0'

Type of Spark Application

No response

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2023-05-23T07:42:12Z

@HeyBossy your question was not deleted, it was answered with an example and a Colab link: #13815

Please have a look at our step-by-step tutorials to get started with Spark NLP. (the input for DocumentAssembler must be STRING

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/open-source-nlp

maziyarpanahi · 2023-05-23T07:43:06Z

leaving it open to avoid another duplicate question. @HeyBossy please close it once you read the answer

HeyBossy · 2023-05-23T07:45:37Z

I didn't do extra square brackets. I want to submit type array string.

root
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)

The DocumentAssembler can read either a String column or an Array[String])
Can you explain to me what this means, I don't understand.

HeyBossy · 2023-05-23T07:48:05Z

https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/DocumentAssembler.scala#LL29C4-L29C4 here

maziyarpanahi · 2023-05-23T07:51:30Z

I see the problem now, I didn't see the docs saying that. The documentation is wrong! Actually, DocumentAssembler only accepts String - I will ask for the docs to be fixed (everywhere for this)

@HeyBossy As a workaround, you need to explode your array of text and then that can be used as an input to

maziyarpanahi · 2023-05-23T07:52:04Z

@DevinTDHa Could you please make sure this is fixed in all the docs (pydoc, scaladoc, website, etc.) - many thanks

HeyBossy · 2023-05-23T07:52:32Z

Okay I got it, thanks for the replies!

maziyarpanahi · 2023-05-23T07:53:50Z

Thanks @HeyBossy

re-opening this, it gets closed once we fixed this issue in the docs

HeyBossy added the question label May 23, 2023

HeyBossy assigned maziyarpanahi May 23, 2023

maziyarpanahi closed this as completed May 23, 2023

maziyarpanahi reopened this May 23, 2023

maziyarpanahi added documentation bug-fix and removed question labels May 23, 2023

HeyBossy closed this as completed May 23, 2023

maziyarpanahi reopened this May 23, 2023

DevinTDHa mentioned this issue May 23, 2023

SPARKNLP-809: Add warning to ForZeroShot annotators #13798

Merged

10 tasks

DevinTDHa closed this as completed Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocumentAssembler on array<string> #13816

DocumentAssembler on array<string> #13816

HeyBossy commented May 23, 2023

maziyarpanahi commented May 23, 2023

maziyarpanahi commented May 23, 2023

HeyBossy commented May 23, 2023

HeyBossy commented May 23, 2023

maziyarpanahi commented May 23, 2023

maziyarpanahi commented May 23, 2023

HeyBossy commented May 23, 2023

maziyarpanahi commented May 23, 2023

DocumentAssembler on array<string> #13816

DocumentAssembler on array<string> #13816

Comments

HeyBossy commented May 23, 2023

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

maziyarpanahi commented May 23, 2023

maziyarpanahi commented May 23, 2023

HeyBossy commented May 23, 2023

HeyBossy commented May 23, 2023

maziyarpanahi commented May 23, 2023

maziyarpanahi commented May 23, 2023

HeyBossy commented May 23, 2023

maziyarpanahi commented May 23, 2023