Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50050][PYTHON][CONNECT] Make lit accept str and bool type numpy ndarray #48591

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 22, 2024

What changes were proposed in this pull request?

Make lit accept str and bool type numpy ndarray

Why are the changes needed?

to be consistent with PySpark Classic

In [4]: spark.range(1).select(sf.lit(np.array(["a", "b"], np.str_))).show()
+---------------+
|ARRAY('a', 'b')|
+---------------+
|         [a, b]|
+---------------+

Does this PR introduce any user-facing change?

yes

before:

In [3]: spark.range(1).select(sf.lit(np.array(["a", "b"], np.str_))).show()
---------------------------------------------------------------------------
PySparkTypeError                          Traceback (most recent call last)
Cell In[3], line 1
----> 1 spark.range(1).select(sf.lit(np.array(["a", "b"], np.str_))).schema

File ~/Dev/spark/python/pyspark/sql/utils.py:272, in try_remote_functions.<locals>.wrapped(*args, **kwargs)
    269 if is_remote() and "PYSPARK_NO_NAMESPACE_SHARE" not in os.environ:
    270     from pyspark.sql.connect import functions
--> 272     return getattr(functions, f.__name__)(*args, **kwargs)
    273 else:
    274     return f(*args, **kwargs)

File ~/Dev/spark/python/pyspark/sql/connect/functions/builtin.py:274, in lit(col)
    272 dt = _from_numpy_type(col.dtype)
    273 if dt is None:
--> 274     raise PySparkTypeError(
    275         errorClass="UNSUPPORTED_NUMPY_ARRAY_SCALAR",
    276         messageParameters={"dtype": col.dtype.name},
    277     )
    279 # NumpyArrayConverter for Py4J can not support ndarray with int8 values.
    280 # Actually this is not a problem for Connect, but here still convert it
    281 # to int16 for compatibility.
    282 if dt == ByteType():

PySparkTypeError: [UNSUPPORTED_NUMPY_ARRAY_SCALAR] The type of array scalar 'str32' is not supported.

after:

In [4]: spark.range(1).select(sf.lit(np.array(["a", "b"], np.str_))).show()
+-----------+
|array(a, b)|
+-----------+
|     [a, b]|
+-----------+

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng changed the title [SPARK-50050][PYTHON][CONNECT] Make lit accept str and bool type numpy ndarray [WIP][SPARK-50050][PYTHON][CONNECT] Make lit accept str and bool type numpy ndarray Oct 22, 2024
@zhengruifeng zhengruifeng marked this pull request as draft October 22, 2024 05:33
nit

nit
@zhengruifeng zhengruifeng changed the title [WIP][SPARK-50050][PYTHON][CONNECT] Make lit accept str and bool type numpy ndarray [SPARK-50050][PYTHON][CONNECT] Make lit accept str and bool type numpy ndarray Oct 22, 2024
@zhengruifeng zhengruifeng marked this pull request as ready for review October 22, 2024 09:49
@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the connect_lit_bool_str branch October 23, 2024 00:41
@xinrong-meng
Copy link
Member

Late LGTM, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants