-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Native support for UUID #15058
Comments
This was brought up years ago: https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j but seems to have been dropped in favor of the extension. Adding types requires an ML discussion and vote so it would probably be best to (re)start the discussion there. |
Hey @Fokko how about making it a canonical extension? [1][2] [1] https://arrow.apache.org/docs/format/CanonicalExtensions.html |
This would be good to make a canonical extension type, given there are many integration points for it:
This seems like a good candidate to prototype spreading a canonical extension type through the ecosystem, given how simple and ubiquitous is it. I may look into this in about a month, if someone else doesn't beat me to it. |
I agree UUID sounds as a good fit for adding as a canonical extension type. |
@wjones127 would be awesome if you have time to implement this, because the only proposed solution at SO is to use duckdb (for conversion of tables with UUID to parquet) |
I found a class ARROW_TESTING_EXPORT UuidArray : public ExtensionArray {
public:
using ExtensionArray::ExtensionArray;
};
class ARROW_TESTING_EXPORT UuidType : public ExtensionType {
public:
UuidType() : ExtensionType(fixed_size_binary(16)) {}
std::string extension_name() const override { return "uuid"; }
bool ExtensionEquals(const ExtensionType& other) const override;
std::shared_ptr<Array> MakeArray(std::shared_ptr<ArrayData> data) const override;
Result<std::shared_ptr<DataType>> Deserialize(
std::shared_ptr<DataType> storage_type,
const std::string& serialized) const override;
std::string Serialize() const override { return "uuid-serialized"; }
}; Maybe we just need to draft and make it available? |
I've opened #37298 using logic from |
@Fokko I'm curious, can you explain what "natively" supporting an UUID meaning? An UUID is just a bunch of 16 opaque bytes with no actionable contents... (in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason) |
@pitrou That's true, for PyIceberg it is about maintaining the type information. We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a
We'll you want to avoid utf8 if not needed. In Iceberg the UUID is also often used to apply bucket partitioning, which works well on fixed 16 bytes. |
Unless you expect Iceberg and Arrow to support the exact same types, you should probably have a mechanism to store and restore Iceberg metadata to/from Arrow data. |
for me that's support of python's built-in uuid.UUID type when reading/writing dataframes from pandas. As already mentioned, you can convert to bytes, but during loading it won't be converted back. |
Looking forward to this. Would be great if there were a way (through casting from string or a compute function |
That sounds quite useful @shenker. You mean something like this? import pyarrow as pa
pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(pa.list_(pa.binary(), 16)) or rather pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(uuid()) I'm not sure how much work setting up extension types would be with cast kernels. Either way this idea would be better off in a new issue :). |
### Rationale for this change See apache#15058. UUID datatype is common in throughout the ecosystem and Arrow as supporting it as a native type would reduce friction. ### What changes are included in this PR? This PR implements logic for Arrow canonical extension type in C++ and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, new extension type is added. * Closes: apache#15058 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change See apache#15058. UUID datatype is common in throughout the ecosystem and Arrow as supporting it as a native type would reduce friction. ### What changes are included in this PR? This PR implements logic for Arrow canonical extension type in C++ and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, new extension type is added. * Closes: apache#15058 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the enhancement requested
In Apache Iceberg we have support for the UUID type. I think it would be nice to also support this in (Py)Arrow natively instead of having an extension.
Component(s)
Python
The text was updated successfully, but these errors were encountered: