Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Change UUID representation to bytes and add integration tests #8267

Merged
merged 4 commits into from
Aug 10, 2023

Conversation

HonahX
Copy link
Contributor

@HonahX HonahX commented Aug 9, 2023

Fixes: #8248

This PR changes the value stored in UUID literal from uuid to bytes and add relevant integration tests for unpartitined uuid tables

@Fokko Fokko marked this pull request as ready for review August 9, 2023 09:49
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, thanks for fixing this @HonahX

@@ -139,7 +139,7 @@ def literal(value: L) -> Literal[L]:
elif isinstance(value, str):
return StringLiteral(value)
elif isinstance(value, UUID):
return UUIDLiteral(value)
return UUIDLiteral(value) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return UUIDLiteral(value) # type: ignore
return UUIDLiteral(value.bytes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. I tried this and it turned out that the error here:

Incompatible return value type (got "UUIDLiteral", expected "Literal[UUID]")  [return-value]

is due to the literal function expecting a return value of Literal(<Type of value>).
Hence, we need this ignore since UUIDLiteral is now a Literal[bytes]

I chose to let the initialization of UUIDLiteral still take UUID as input to maintain minimal changes to initialization code in other places. I think letting UUIDLiteral.__init__ takes bytes or Union[bytes, uuid] is also reasonable. Please let me know if you prefer other way of initialization this class.

Copy link
Contributor

@Fokko Fokko Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the UUIDLiteral is not intended to be used by the user. The user would pass in:

>>> l = literal('27f6397c-65b9-4464-a2f5-f4ee81629134')
StringLiteral('27f6397c-65b9-4464-a2f5-f4ee81629134')

And then internally, when binding to the schema, it will be converted into an UUID, which is failing right now:

>>> from pyiceberg.types import UUIDType
>>> l.to(UUIDType)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/functools.py", line 914, in _method
    return method.__get__(obj, cls)(*args, **kwargs)
  File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/expressions/literals.py", line 519, in to
    raise TypeError(f"Cannot convert StringLiteral into {type_var}")
TypeError: Cannot convert StringLiteral into <class 'pyiceberg.types.UUIDType'>

Which means that we need to implement this one as well.

The same is for the bytes:

>>> from uuid import UUID
>>> u = UUID('27f6397c-65b9-4464-a2f5-f4ee81629134')
>>> l = literal(u.bytes)
>>> l
BinaryLiteral(b"'\xf69|e\xb9Dd\xa2\xf5\xf4\xee\x81b\x914")
>>> l.to(UUIDType)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/functools.py", line 914, in _method
    return method.__get__(obj, cls)(*args, **kwargs)
  File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/expressions/literals.py", line 640, in to
    raise TypeError(f"Cannot convert BinaryLiteral into {type_var}")
TypeError: Cannot convert BinaryLiteral into <class 'pyiceberg.types.UUIDType'>

I think we need to extend the literal constructor with many more types, such as datetime, date, time, UUID, etc. For the datetime related, I've created a separate issue: #8282 Maybe we can add UUID to literal(..) in this PR to avoid backward incompatibility.

Copy link
Contributor Author

@HonahX HonahX Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I added conversions from BinaryLiteral to UUID and from FixedLiteral to UUID, and related tests.
Here is a summarization of the supported conversion now:

from pyiceberg.expressions.literals import *
from pyiceberg.types import *
import uuid

test_uuid = uuid.UUID('d4ccb06d-e7de-48e6-ad05-649fe8deda50')

lu = literal(test_uuid)
ls = literal('d4ccb06d-e7de-48e6-ad05-649fe8deda50')
lbinary = literal(test_uuid.bytes)
lfixed = literal(test_uuid.bytes).to(FixedType(16))

assert lu.value == test_uuid.bytes
assert ls.to(UUIDType()) == lu
assert lbinary.to(UUIDType()) == lu
assert lfixed.to(UUIDType()) == lu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the console conversion from string to uuid fails because to expect a type instance:
l.to(UUIDType())

@Fokko Fokko merged commit 2535c3a into apache:master Aug 10, 2023
@Fokko
Copy link
Contributor

Fokko commented Aug 10, 2023

Thanks @HonahX for fixing this and also smoothing out the code 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python: Fix UUID representation
2 participants