Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Converting from NumPy to large_string or large_binary returns not implemented #35289

Open
phofl opened this issue Apr 23, 2023 · 1 comment · May be fixed by #36701
Open

[Python] Converting from NumPy to large_string or large_binary returns not implemented #35289

phofl opened this issue Apr 23, 2023 · 1 comment · May be fixed by #36701

Comments

@phofl
Copy link
Contributor

phofl commented Apr 23, 2023

Describe the bug, including details regarding any error messages, version, and platform.


arr = np.array(["a", "b"])
pa.array(arr, type=pa.large_string())

This returns

    pa.array(arr, type=pa.large_string())
  File "pyarrow/array.pxi", line 316, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: NumPyConverter doesn't implement <large_string> conversion. 

I think this should work?

Component(s)

Python

@jorisvandenbossche
Copy link
Member

Yes, there is indeed no specific reason for this to not work (apart from someone implementing it).

Looking at the code, this is default fallback because the NumPyConverter only is implemented for the non-large StringType. We have this:

Status NumPyConverter::Visit(const StringType& type) {
util::InitializeUTF8();
::arrow::internal::ChunkedStringBuilder builder(kBinaryChunksize, pool_);
auto data = reinterpret_cast<const uint8_t*>(PyArray_DATA(arr_));

But no equivalent Visit(const LargeStringType& type). The implementation for StringType is based on ChunkedStringBuilder, which is a chunked version of StringBuilder. We already have LargeStringBuilder, so it should certainly be possible to add a ChunkedLargeStringBuilder as well, so we can template the NumPyConverter to work with both builders.

dongjoon-hyun pushed a commit to apache/spark that referenced this issue Jun 14, 2023
…eVarTypes` as an internal configuration

### What changes were proposed in this pull request?

This PR is a followup of #39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration.

### Why are the changes needed?

As described in #41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now.

### Does this PR introduce _any_ user-facing change?

No, this configuration was not released out yet.

### How was this patch tested?

Ran the Scala linter.

Closes #41584 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
czxm pushed a commit to czxm/spark that referenced this issue Jun 19, 2023
…eVarTypes` as an internal configuration

### What changes were proposed in this pull request?

This PR is a followup of apache#39572 that hides the `spark.sql.execution.arrow.useLargeVarTypes` configuration as an internal configuration.

### Why are the changes needed?

As described in apache#41569, this feature only works for `mapInArrow`, and other cases cannot be completely supported because of Arrow side limitation, see apache/arrow#35289. Therefore, this PR hides this configuration as an internal one for now.

### Does this PR introduce _any_ user-facing change?

No, this configuration was not released out yet.

### How was this patch tested?

Ran the Scala linter.

Closes apache#41584 from HyukjinKwon/SPARK-39979-followup2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants