Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Fix UUID representation #8248

Closed
Fokko opened this issue Aug 7, 2023 · 6 comments · Fixed by #8267
Closed

Python: Fix UUID representation #8248

Fokko opened this issue Aug 7, 2023 · 6 comments · Fixed by #8267

Comments

@Fokko
Copy link
Contributor

Fokko commented Aug 7, 2023

Feature Request / Improvement

Currently, we represent a UUID literal as UUIDLiteral[Literal[UUID]], but we want to change this to UUIDLiteral[Literal[bytes]].

Probably currently also tables that are partitioned by UUIDs will fail, because the comparison will be between an UUID and bytes.

What we can do, is create a table using PyIceberg:

from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import UUIDType, NestedField

schema = Schema(
    NestedField(
        field_id=1, name="c1", field_type=UUIDType(), required=False
    ),
)

cat.create_table(
    identifier="default.uuid",
    schema=schema
)

And then add some data:

insert into uuid VALUES('102cb62f-e6f8-4eb0-9973-d9b012ff0967');

Would also be great to have tests around partitioned tables, but it looks like that's not possible today: #8247

cc @JonasJ-ap @HonahX I know that you were looking into this as well.

Query engine

None

@HonahX
Copy link
Contributor

HonahX commented Aug 9, 2023

More context:

if we apply a row filter on UUID col,

unpartitioned_uuid = catalog.load_table("default.test_uuid_and_fixed_unpartitioned")
arrow_table_eq = unpartitioned_uuid.scan(row_filter="uuid_col == '102cb62f-e6f8-4eb0-9973-d9b012ff0967'").to_arrow()

it fails with

pyarrow.lib.ArrowInvalid: Could not convert UUID('102cb62f-e6f8-4eb0-9973-d9b012ff0967') with type UUID: did not recognize Python value type when inferring an Arrow data type

raised here:

def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:
if not isinstance(iceberg_type, PrimitiveType):
raise ValueError(f"Expected primitive type, got: {iceberg_type}")
return pa.scalar(value).cast(schema_to_pyarrow(iceberg_type))

def visit_equal(self, term: BoundTerm[Any], literal: Literal[Any]) -> pc.Expression:
return pc.field(term.ref().field.name) == _convert_scalar(literal.value, term.ref().field.field_type)

because UUID is a fixed_binary[16] in pyarrow but the UUID literal stores UUID in its value

@Fokko
Copy link
Contributor Author

Fokko commented Aug 9, 2023

@HonahX We need to store the bytes instead of the UUID, are you interested in fixing this?

@HonahX
Copy link
Contributor

HonahX commented Aug 9, 2023

@Fokko Yes! I just uploaded a draft PR #8267 for further discussion. This PR verifies that changing the literal value can solve the issue. I will try to see if I can add more tests to it.

Since pyiceberg use the released spark-iceberg-runtime to create the test env, we may not be able to add tables partitioned by uuid until the next release.

@raphaelauv
Copy link

@Fokko @HonahX thanks for the feature but I have a problem using it

I can't insert data in an iceberg table that have a column with UUID type

from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType
import polars as pl

import uuid


id = uuid.uuid4()

SCHEMA = Schema(
    NestedField(1, "id", UUIDType(), required=True),
)

df = pl.DataFrame({}).with_columns([pl.lit(id.bytes).alias("id")])

df = df.to_arrow()
df = df.cast(target_schema=SCHEMA.as_arrow())

image

do you have any idea ? thanks

@Fokko
Copy link
Contributor Author

Fokko commented Jun 25, 2024

@raphaelauv This looks like a new issue, can you open up a new ticket? Thanks for including some example code there 👍

The underlying problem is that the UUID is a so-called logical type, but it is not supported by Arrow. :( The UUID is stored as a fixed width 16 bytes binary field. So it looks like we have to add some additional logic to the conversion.

@raphaelauv
Copy link

raphaelauv commented Jun 25, 2024

Thanks @Fokko ! Should I create the issue in the repo py-iceberg ?

apache/iceberg-python#855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants