Python: Fix UUID representation #8248

Fokko · 2023-08-07T07:28:45Z

Feature Request / Improvement

Currently, we represent a UUID literal as UUIDLiteral[Literal[UUID]], but we want to change this to UUIDLiteral[Literal[bytes]].

Probably currently also tables that are partitioned by UUIDs will fail, because the comparison will be between an UUID and bytes.

What we can do, is create a table using PyIceberg:

from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import UUIDType, NestedField

schema = Schema(
    NestedField(
        field_id=1, name="c1", field_type=UUIDType(), required=False
    ),
)

cat.create_table(
    identifier="default.uuid",
    schema=schema
)

And then add some data:

insert into uuid VALUES('102cb62f-e6f8-4eb0-9973-d9b012ff0967');

Would also be great to have tests around partitioned tables, but it looks like that's not possible today: #8247

cc @JonasJ-ap @HonahX I know that you were looking into this as well.

Query engine

None

The text was updated successfully, but these errors were encountered:

HonahX · 2023-08-09T08:31:30Z

More context:

if we apply a row filter on UUID col,

unpartitioned_uuid = catalog.load_table("default.test_uuid_and_fixed_unpartitioned")
arrow_table_eq = unpartitioned_uuid.scan(row_filter="uuid_col == '102cb62f-e6f8-4eb0-9973-d9b012ff0967'").to_arrow()

it fails with

pyarrow.lib.ArrowInvalid: Could not convert UUID('102cb62f-e6f8-4eb0-9973-d9b012ff0967') with type UUID: did not recognize Python value type when inferring an Arrow data type

raised here:

iceberg/python/pyiceberg/io/pyarrow.py

Lines 451 to 454 in f5f543a

    
           def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar: 
        
               if not isinstance(iceberg_type, PrimitiveType): 
        
                   raise ValueError(f"Expected primitive type, got: {iceberg_type}") 
        
               return pa.scalar(value).cast(schema_to_pyarrow(iceberg_type))

iceberg/python/pyiceberg/io/pyarrow.py

Lines 480 to 481 in f5f543a

    
           def visit_equal(self, term: BoundTerm[Any], literal: Literal[Any]) -> pc.Expression: 
        
               return pc.field(term.ref().field.name) == _convert_scalar(literal.value, term.ref().field.field_type)

because UUID is a fixed_binary[16] in pyarrow but the UUID literal stores UUID in its value

Fokko · 2023-08-09T08:36:07Z

@HonahX We need to store the bytes instead of the UUID, are you interested in fixing this?

HonahX · 2023-08-09T08:49:09Z

@Fokko Yes! I just uploaded a draft PR #8267 for further discussion. This PR verifies that changing the literal value can solve the issue. I will try to see if I can add more tests to it.

Since pyiceberg use the released spark-iceberg-runtime to create the test env, we may not be able to add tables partitioned by uuid until the next release.

raphaelauv · 2024-06-25T09:48:42Z

@Fokko @HonahX thanks for the feature but I have a problem using it

I can't insert data in an iceberg table that have a column with UUID type

from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType
import polars as pl

import uuid


id = uuid.uuid4()

SCHEMA = Schema(
    NestedField(1, "id", UUIDType(), required=True),
)

df = pl.DataFrame({}).with_columns([pl.lit(id.bytes).alias("id")])

df = df.to_arrow()
df = df.cast(target_schema=SCHEMA.as_arrow())

do you have any idea ? thanks

Fokko · 2024-06-25T09:50:56Z

@raphaelauv This looks like a new issue, can you open up a new ticket? Thanks for including some example code there 👍

The underlying problem is that the UUID is a so-called logical type, but it is not supported by Arrow. :( The UUID is stored as a fixed width 16 bytes binary field. So it looks like we have to add some additional logic to the conversion.

raphaelauv · 2024-06-25T10:09:30Z

~~Thanks @Fokko ! Should I create the issue in the repo py-iceberg ?~~

apache/iceberg-python#855

HonahX mentioned this issue Aug 9, 2023

Python: Change UUID representation to bytes and add integration tests #8267

Merged

Fokko closed this as completed in #8267 Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Fix UUID representation #8248

Python: Fix UUID representation #8248

Fokko commented Aug 7, 2023

HonahX commented Aug 9, 2023 •

edited

Loading

Fokko commented Aug 9, 2023

HonahX commented Aug 9, 2023

raphaelauv commented Jun 25, 2024

Fokko commented Jun 25, 2024 •

edited

Loading

raphaelauv commented Jun 25, 2024 •

edited

Loading

Python: Fix UUID representation #8248

Python: Fix UUID representation #8248

Comments

Fokko commented Aug 7, 2023

Feature Request / Improvement

Query engine

HonahX commented Aug 9, 2023 • edited Loading

Fokko commented Aug 9, 2023

HonahX commented Aug 9, 2023

raphaelauv commented Jun 25, 2024

Fokko commented Jun 25, 2024 • edited Loading

raphaelauv commented Jun 25, 2024 • edited Loading

HonahX commented Aug 9, 2023 •

edited

Loading

Fokko commented Jun 25, 2024 •

edited

Loading

raphaelauv commented Jun 25, 2024 •

edited

Loading