-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Fix UUID representation #8248
Comments
More context: if we apply a row filter on UUID col, unpartitioned_uuid = catalog.load_table("default.test_uuid_and_fixed_unpartitioned")
arrow_table_eq = unpartitioned_uuid.scan(row_filter="uuid_col == '102cb62f-e6f8-4eb0-9973-d9b012ff0967'").to_arrow() it fails with
raised here: iceberg/python/pyiceberg/io/pyarrow.py Lines 451 to 454 in f5f543a
iceberg/python/pyiceberg/io/pyarrow.py Lines 480 to 481 in f5f543a
because UUID is a fixed_binary[16] in pyarrow but the UUID literal stores UUID in its value
|
@HonahX We need to store the |
@Fokko Yes! I just uploaded a draft PR #8267 for further discussion. This PR verifies that changing the literal value can solve the issue. I will try to see if I can add more tests to it. Since pyiceberg use the released spark-iceberg-runtime to create the test env, we may not be able to add tables partitioned by uuid until the next release. |
@Fokko @HonahX thanks for the feature but I have a problem using it I can't insert data in an iceberg table that have a column with UUID type from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType
import polars as pl
import uuid
id = uuid.uuid4()
SCHEMA = Schema(
NestedField(1, "id", UUIDType(), required=True),
)
df = pl.DataFrame({}).with_columns([pl.lit(id.bytes).alias("id")])
df = df.to_arrow()
df = df.cast(target_schema=SCHEMA.as_arrow()) do you have any idea ? thanks |
@raphaelauv This looks like a new issue, can you open up a new ticket? Thanks for including some example code there 👍 The underlying problem is that the UUID is a so-called logical type, but it is not supported by Arrow. :( The UUID is stored as a fixed width 16 bytes binary field. So it looks like we have to add some additional logic to the conversion. |
|
Feature Request / Improvement
Currently, we represent a UUID literal as
UUIDLiteral[Literal[UUID]]
, but we want to change this toUUIDLiteral[Literal[bytes]]
.Probably currently also tables that are partitioned by UUIDs will fail, because the comparison will be between an UUID and bytes.
What we can do, is create a table using PyIceberg:
And then add some data:
Would also be great to have tests around partitioned tables, but it looks like that's not possible today: #8247
cc @JonasJ-ap @HonahX I know that you were looking into this as well.
Query engine
None
The text was updated successfully, but these errors were encountered: