Use DocumentAssembler on array<string> #13815

HeyBossy · 2023-05-22T14:36:54Z

Is there an existing issue for this?

I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am trying to use DocumentAssembler for array of strings. The documentation says: "The DocumentAssembler can read either a String column or an Array[String])"

Current Behavior

But I am getting an error:

AnalysisException: [CANNOT_UP_CAST_DATATYPE] Cannot up cast input from "ARRAY<STRING>" to "STRING".
The type path of the target object is:
- root class: "java.lang.String"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object

Expected Behavior

Steps To Reproduce

For example, I want to submit a text column (type array string) to a document.

data = spark.createDataFrame([[["Spark NLP is an open-source text processing library."]]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
result = documentAssembler.transform(data)

result.select("document").show(truncate=False)

Spark NLP version and Apache Spark

Spark NLP version sparknlp.version(): 4.4.0
Apache NLP version spark.version: 3.4.0

Type of Spark Application

No response

Java Version

No response

Java Home Directory

No response

Setup and installation

No response

Operating System and Version

No response

Link to your project (if available)

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2023-05-22T15:36:21Z

as the error indicates you have an array instead of a string. (you have one extra [] in your DataFrame)

# [[STRING HERE]]
data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")

data.show()

example: https://colab.research.google.com/drive/1gh9GSoIJZCpWGocS_Ea6kNrjbZnA_3Mx?usp=sharing

HeyBossy added the question label May 22, 2023

HeyBossy assigned maziyarpanahi May 22, 2023

maziyarpanahi closed this as completed May 22, 2023

maziyarpanahi mentioned this issue May 23, 2023

DocumentAssembler on array<string> #13816

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DocumentAssembler on array<string> #13815

Use DocumentAssembler on array<string> #13815

HeyBossy commented May 22, 2023

maziyarpanahi commented May 22, 2023

Use DocumentAssembler on array<string> #13815

Use DocumentAssembler on array<string> #13815

Comments

HeyBossy commented May 22, 2023

Is there an existing issue for this?

Who can help?

What are you working on?

Current Behavior

Expected Behavior

Steps To Reproduce

Spark NLP version and Apache Spark

Type of Spark Application

Java Version

Java Home Directory

Setup and installation

Operating System and Version

Link to your project (if available)

Additional Information

maziyarpanahi commented May 22, 2023