Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.from_arrow failing when underlying arrays are immutable #6315

Closed
2 tasks done
RichardHaythorn opened this issue Jan 19, 2023 · 9 comments
Closed
2 tasks done

pl.from_arrow failing when underlying arrays are immutable #6315

RichardHaythorn opened this issue Jan 19, 2023 · 9 comments
Labels
A-io Area: reading and writing data bug Something isn't working python Related to Python Polars

Comments

@RichardHaythorn
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

pl.from_arrow fails when an underlying array in pa.table is immutable.

I can't give a full reproducible example as I'm unsure how to create immutable arrays in pyarrow, and the pyarrow table I'm working on comes from elsewhere. The pyarrow documentation only says mutable arrays can be created.

Exception that is thrown:

ArrowErrorException                       Traceback (most recent call last)
Cell In[319], line 1
----> 1 pl.from_arrow(prices)

File /usr/local/lib/python3.10/site-packages/polars/convert.py:280, in from_arrow(a, rechunk)
    228 """
    229 Create a DataFrame or Series from an Arrow Table or Array.
    230 
   (...)
    277 
    278 """
    279 if isinstance(a, pa.Table):
--> 280     return DataFrame._from_arrow(a, rechunk=rechunk)
    281 elif isinstance(a, (pa.Array, pa.ChunkedArray)):
    282     return Series._from_arrow("", a, rechunk)

File /usr/local/lib/python3.10/site-packages/polars/internals/dataframe/frame.py:470, in DataFrame._from_arrow(cls, data, columns, rechunk)
    441 @classmethod
    442 def _from_arrow(
    443     cls: type[DF],
   (...)
    446     rechunk: bool = True,
    447 ) -> DF:
    448     """
...
--> 958         pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
    959 else:
    960     pydf = pli.DataFrame([])._df

ArrowErrorException: OutOfSpec("An ArrowArray of type Utf8\n            must have buffer 1 aligned to type i32")

Reproducible example

import polars as pl
import pyarrow as pa

#Mutable 
chkd_arr = pa.chunked_array([["a","b","c"],["d","e","f"]])
chkd_tbl = pa.table([chkd_arr],names=["id"])
print(chkd_arr.chunk(0).buffers()[1].is_mutable) #This prints True

pl.from_arrow(chkd_tbk) #Runs fine

#Immutable table
imm_tbl = pa.table(...)
print(imm_tbl.column(0).chunk(0).buffers()[1].is_mutable) #This prints False

pl.from_arrow(imm_tbl) #Throws exception

#Workaround for immutable table, combine chunks
imm_tbl = pa.table(...).combine_chunks()
print(imm_tbl.column(0).chunk(0).buffers()[1].is_mutable) #This prints True

pl.from_arrow(imm_tbl) #Runs fine

Expected behavior

Expected: Returns polars dataframe

Installed versions

---Version info---
Polars: 0.15.15
Index type: UInt32
Platform: Linux-5.10.0-11-amd64-x86_64-with-glibc2.31
Python: 3.10.6 (main, Aug 23 2022, 08:36:38) [GCC 10.2.1 20210110]
---Optional dependencies---
pyarrow: 9.0.0
pandas: 1.5.2
numpy: 1.24.1
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: <not installed>
@RichardHaythorn RichardHaythorn added bug Something isn't working python Related to Python Polars labels Jan 19, 2023
@ritchie46
Copy link
Member

Mutability should not matter. Can you try to make an MWE?

@RichardHaythorn
Copy link
Author

Sorry, I can't seem to get an MWE yet. I did some further investigation and found performing a deepcopy of the table also does work, and that if the underlying buffer is set to value, it seems to fail.

tbl = pa.table()
if tbl.column(0).chunk(0).buffers()[1].parent is None #This works with pl.from_arrow

if tbl.column(0).chunk(0).buffers()[1].parent = <pyarrow.lib.Buffer object  at ...> #This fails with pl.from_arrow

@ritchie46
Copy link
Member

If you have one. Please let us know. With this information it's hard for us help you.

@lpaulot
Copy link

lpaulot commented Jan 27, 2023

We have run into the same error and I think it is due to data alignement. polars.from_arrow does not work when a buffer address is not a multiple of 8. From what I see, alignement is enforced for tables created locally but not always when going through Arrow Flight.

Here is a reproduction:

import pyarrow as pa
import pyarrow.flight as fl
import polars as pl


if __name__ == "__main__":
    print("pyarrow:", pa.__version__, "polars:", pl.__version__)

    table_in = pa.table({"A": [0]})
    print("Input alignment:", table_in["A"].chunk(0).buffers()[1].address % 8)
    pl.from_arrow(table_in)

    class Server(fl.FlightServerBase):
        def do_get(self, context, ticket):
            return fl.RecordBatchStream(table_in)

    with Server() as s:
        client = fl.FlightClient(("127.0.0.1", s.port))
        for _ in range(1000):
            table_out = client.do_get(fl.Ticket("")).read_all()
            print("Ouput alignment:", table_out["A"].chunk(0).buffers()[1].address % 8)
            pl.from_arrow(table_out)  # raise ArrowErrorException when buffer not aligned

combine_chunks can fix the issue when there are several chunks as it will allocate new buffers. However it will not work when there is only one chunk. A workaround is to use copy.deepcopy.

@ritchie46 ritchie46 self-assigned this Jan 27, 2023
@lpaulot
Copy link

lpaulot commented Jan 27, 2023

For the record, the same kind of issue happened with acero : https://issues.apache.org/jira/browse/ARROW-18115 and the solution apparently was to introduce a realignement function https://issues.apache.org/jira/browse/ARROW-18119 .
The flight root cause is https://issues.apache.org/jira/browse/ARROW-16958 .

@ritchie46
Copy link
Member

Issue upstream: apache/arrow#32276

I am inclined to think this is a bug upstream. I would assume that memory would need to be aligned with the type. In this case with an i64. I will ask around.

@ritchie46
Copy link
Member

After states that alignment is a recommendation:

Implementations are recommended to allocate memory on aligned addresses (multiple of 8- or 64-bytes) and pad (overallocate) to a length that is a multiple of 8 or 64 bytes. 

but that for IPC they are enforced:

When serializing Arrow data for interprocess communication, these alignment and padding requirements are enforced. 

So this is a bug in pyarrow.

@maroshmka
Copy link

hey, can you link the bug report in pyarrow, please ? The linked issue is for polars and its also missing the link. thanks

@stinodego
Copy link
Contributor

This now works on the latest pyarrow/polars versions.

@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants