`pl.from_arrow` failing when underlying arrays are immutable #6315

RichardHaythorn · 2023-01-19T11:28:23Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

pl.from_arrow fails when an underlying array in pa.table is immutable.

I can't give a full reproducible example as I'm unsure how to create immutable arrays in pyarrow, and the pyarrow table I'm working on comes from elsewhere. The pyarrow documentation only says mutable arrays can be created.

Exception that is thrown:

ArrowErrorException                       Traceback (most recent call last)
Cell In[319], line 1
----> 1 pl.from_arrow(prices)

File /usr/local/lib/python3.10/site-packages/polars/convert.py:280, in from_arrow(a, rechunk)
    228 """
    229 Create a DataFrame or Series from an Arrow Table or Array.
    230 
   (...)
    277 
    278 """
    279 if isinstance(a, pa.Table):
--> 280     return DataFrame._from_arrow(a, rechunk=rechunk)
    281 elif isinstance(a, (pa.Array, pa.ChunkedArray)):
    282     return Series._from_arrow("", a, rechunk)

File /usr/local/lib/python3.10/site-packages/polars/internals/dataframe/frame.py:470, in DataFrame._from_arrow(cls, data, columns, rechunk)
    441 @classmethod
    442 def _from_arrow(
    443     cls: type[DF],
   (...)
    446     rechunk: bool = True,
    447 ) -> DF:
    448     """
...
--> 958         pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
    959 else:
    960     pydf = pli.DataFrame([])._df

ArrowErrorException: OutOfSpec("An ArrowArray of type Utf8\n            must have buffer 1 aligned to type i32")

Reproducible example

import polars as pl
import pyarrow as pa

#Mutable 
chkd_arr = pa.chunked_array([["a","b","c"],["d","e","f"]])
chkd_tbl = pa.table([chkd_arr],names=["id"])
print(chkd_arr.chunk(0).buffers()[1].is_mutable) #This prints True

pl.from_arrow(chkd_tbk) #Runs fine

#Immutable table
imm_tbl = pa.table(...)
print(imm_tbl.column(0).chunk(0).buffers()[1].is_mutable) #This prints False

pl.from_arrow(imm_tbl) #Throws exception

#Workaround for immutable table, combine chunks
imm_tbl = pa.table(...).combine_chunks()
print(imm_tbl.column(0).chunk(0).buffers()[1].is_mutable) #This prints True

pl.from_arrow(imm_tbl) #Runs fine

Expected behavior

Expected: Returns polars dataframe

Installed versions

---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: Linux-5.10.0-11-amd64-x86_64-with-glibc2.31
Python: 3.10.6 (main, Aug 23 2022, 08:36:38) [GCC 10.2.1 20210110]
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.2
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-01-19T13:32:23Z

Mutability should not matter. Can you try to make an MWE?

RichardHaythorn · 2023-01-19T14:59:38Z

Sorry, I can't seem to get an MWE yet. I did some further investigation and found performing a deepcopy of the table also does work, and that if the underlying buffer is set to value, it seems to fail.

tbl = pa.table()
if tbl.column(0).chunk(0).buffers()[1].parent is None #This works with pl.from_arrow

if tbl.column(0).chunk(0).buffers()[1].parent = <pyarrow.lib.Buffer object  at ...> #This fails with pl.from_arrow

ritchie46 · 2023-01-20T11:12:26Z

If you have one. Please let us know. With this information it's hard for us help you.

lpaulot · 2023-01-27T09:57:58Z

We have run into the same error and I think it is due to data alignement. polars.from_arrow does not work when a buffer address is not a multiple of 8. From what I see, alignement is enforced for tables created locally but not always when going through Arrow Flight.

Here is a reproduction:

import pyarrow as pa
import pyarrow.flight as fl
import polars as pl


if __name__ == "__main__":
    print("pyarrow:", pa.__version__, "polars:", pl.__version__)

    table_in = pa.table({"A": [0]})
    print("Input alignment:", table_in["A"].chunk(0).buffers()[1].address % 8)
    pl.from_arrow(table_in)

    class Server(fl.FlightServerBase):
        def do_get(self, context, ticket):
            return fl.RecordBatchStream(table_in)

    with Server() as s:
        client = fl.FlightClient(("127.0.0.1", s.port))
        for _ in range(1000):
            table_out = client.do_get(fl.Ticket("")).read_all()
            print("Ouput alignment:", table_out["A"].chunk(0).buffers()[1].address % 8)
            pl.from_arrow(table_out)  # raise ArrowErrorException when buffer not aligned

combine_chunks can fix the issue when there are several chunks as it will allocate new buffers. However it will not work when there is only one chunk. A workaround is to use copy.deepcopy.

lpaulot · 2023-01-27T14:38:26Z

For the record, the same kind of issue happened with acero : https://issues.apache.org/jira/browse/ARROW-18115 and the solution apparently was to introduce a realignement function https://issues.apache.org/jira/browse/ARROW-18119 .
The flight root cause is https://issues.apache.org/jira/browse/ARROW-16958 .

ritchie46 · 2023-01-27T14:57:24Z

Issue upstream: apache/arrow#32276

I am inclined to think this is a bug upstream. I would assume that memory would need to be aligned with the type. In this case with an i64. I will ask around.

ritchie46 · 2023-02-01T07:58:56Z

After states that alignment is a recommendation:

Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a length that is a multiple of 8 or 64 bytes.

but that for IPC they are enforced:

When serializing Arrow data for interprocess communication, these alignment and padding requirements are enforced.

So this is a bug in pyarrow.

maroshmka · 2023-06-06T12:43:34Z

hey, can you link the bug report in pyarrow, please ? The linked issue is for polars and its also missing the link. thanks

stinodego · 2024-01-16T15:13:00Z

This now works on the latest pyarrow/polars versions.

RichardHaythorn added bug Something isn't working python Related to Python Polars labels Jan 19, 2023

ritchie46 self-assigned this Jan 27, 2023

ritchie46 removed their assignment Feb 1, 2023

ritchie46 mentioned this issue Jun 3, 2023

OutOfSpec("An ArrowArray of type Decimal(36, 6)\n must have buffer 1 aligned to type i128") #9184

Closed

2 tasks

RichardHaythorn mentioned this issue Jul 12, 2023

[C++][FlightRPC] Buffer handling change in 12.0.1 causing a lot of warnings being printed apache/arrow#36301

Open

RichardHaythorn mentioned this issue Aug 1, 2023

[C++][FlightRPC] Flight generates misaligned buffers apache/arrow#32276

Open

ritchie46 mentioned this issue Aug 10, 2023

feat: handle unaligned pointers in FFI jorgecarleitao/arrow2#1535

Merged

stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024

stinodego closed this as completed Jan 16, 2024

stinodego removed the needs triage Awaiting prioritization by a maintainer label Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pl.from_arrow` failing when underlying arrays are immutable #6315

`pl.from_arrow` failing when underlying arrays are immutable #6315

RichardHaythorn commented Jan 19, 2023

ritchie46 commented Jan 19, 2023

RichardHaythorn commented Jan 19, 2023

ritchie46 commented Jan 20, 2023

lpaulot commented Jan 27, 2023

lpaulot commented Jan 27, 2023 •

edited

Loading

ritchie46 commented Jan 27, 2023

ritchie46 commented Feb 1, 2023

maroshmka commented Jun 6, 2023

stinodego commented Jan 16, 2024

pl.from_arrow failing when underlying arrays are immutable #6315

pl.from_arrow failing when underlying arrays are immutable #6315

Comments

RichardHaythorn commented Jan 19, 2023

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

ritchie46 commented Jan 19, 2023

RichardHaythorn commented Jan 19, 2023

ritchie46 commented Jan 20, 2023

lpaulot commented Jan 27, 2023

lpaulot commented Jan 27, 2023 • edited Loading

ritchie46 commented Jan 27, 2023

ritchie46 commented Feb 1, 2023

maroshmka commented Jun 6, 2023

stinodego commented Jan 16, 2024

`pl.from_arrow` failing when underlying arrays are immutable #6315

`pl.from_arrow` failing when underlying arrays are immutable #6315

lpaulot commented Jan 27, 2023 •

edited

Loading