Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Native support for UUID #15058

Closed
Fokko opened this issue Dec 21, 2022 · 13 comments · Fixed by #37298
Closed

[Python] Native support for UUID #15058

Fokko opened this issue Dec 21, 2022 · 13 comments · Fixed by #37298

Comments

@Fokko
Copy link
Contributor

Fokko commented Dec 21, 2022

Describe the enhancement requested

In Apache Iceberg we have support for the UUID type. I think it would be nice to also support this in (Py)Arrow natively instead of having an extension.

Component(s)

Python

@AlenkaF AlenkaF changed the title Native support for UUID [Python] Native support for UUID Dec 21, 2022
@assignUser
Copy link
Member

This was brought up years ago: https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j but seems to have been dropped in favor of the extension.

Adding types requires an ML discussion and vote so it would probably be best to (re)start the discussion there.

@rok
Copy link
Member

rok commented Dec 21, 2022

Hey @Fokko how about making it a canonical extension? [1][2]

[1] https://arrow.apache.org/docs/format/CanonicalExtensions.html
[2] https://lists.apache.org/thread/sxd5fhc42hb6svs79t3fd79gkqj83pfh

@wjones127
Copy link
Member

This would be good to make a canonical extension type, given there are many integration points for it:

  • It is a logical type in Parquet
  • It is a logical type in many databases, who may return it in Flight
  • Arrow-native and arrow compatible engines (such as DuckDB) have UUID as a datatype and could exchange that over the C data interface.

This seems like a good candidate to prototype spreading a canonical extension type through the ecosystem, given how simple and ubiquitous is it.

I may look into this in about a month, if someone else doesn't beat me to it.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 13, 2023

I agree UUID sounds as a good fit for adding as a canonical extension type.

@arogozhnikov
Copy link

@wjones127 would be awesome if you have time to implement this, because the only proposed solution at SO is to use duckdb (for conversion of tables with UUID to parquet)

@mapleFU
Copy link
Member

mapleFU commented Aug 22, 2023

I found a UUID in src/arrow/testing/extension_type.h

class ARROW_TESTING_EXPORT UuidArray : public ExtensionArray {
 public:
  using ExtensionArray::ExtensionArray;
};

class ARROW_TESTING_EXPORT UuidType : public ExtensionType {
 public:
  UuidType() : ExtensionType(fixed_size_binary(16)) {}

  std::string extension_name() const override { return "uuid"; }

  bool ExtensionEquals(const ExtensionType& other) const override;

  std::shared_ptr<Array> MakeArray(std::shared_ptr<ArrayData> data) const override;

  Result<std::shared_ptr<DataType>> Deserialize(
      std::shared_ptr<DataType> storage_type,
      const std::string& serialized) const override;

  std::string Serialize() const override { return "uuid-serialized"; }
};

Maybe we just need to draft and make it available?

@rok
Copy link
Member

rok commented Aug 22, 2023

I've opened #37298 using logic from src/arrow/testing/extension_type.h and will also add a Pyhon wrapper.
As per canonical extension process I'll should also start a ML discussion and vote.

@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

@Fokko I'm curious, can you explain what "natively" supporting an UUID meaning? An UUID is just a bunch of 16 opaque bytes with no actionable contents...

(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)

@Fokko
Copy link
Contributor Author

Fokko commented Aug 22, 2023

@pitrou That's true, for PyIceberg it is about maintaining the type information. We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.

(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)

We'll you want to avoid utf8 if not needed. In Iceberg the UUID is also often used to apply bucket partitioning, which works well on fixed 16 bytes.

@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.

Unless you expect Iceberg and Arrow to support the exact same types, you should probably have a mechanism to store and restore Iceberg metadata to/from Arrow data.

@arogozhnikov
Copy link

arogozhnikov commented Aug 22, 2023

can you explain what "natively" supporting an UUID meaning?

for me that's support of python's built-in uuid.UUID type when reading/writing dataframes from pandas. As already mentioned, you can convert to bytes, but during loading it won't be converted back.

@shenker
Copy link

shenker commented Dec 11, 2023

Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for #37298.

@rok
Copy link
Member

rok commented Dec 11, 2023

Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for #37298.

That sounds quite useful @shenker. You mean something like this?

import pyarrow as pa
pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(pa.list_(pa.binary(), 16))

or rather

pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(uuid())

I'm not sure how much work setting up extension types would be with cast kernels. Either way this idea would be better off in a new issue :).

@raulcd raulcd added this to the 18.0.0 milestone Jun 28, 2024
@pitrou pitrou closed this as completed in 2328b6e Aug 26, 2024
mapleFU pushed a commit to mapleFU/arrow that referenced this issue Sep 3, 2024
### Rationale for this change

See apache#15058.
UUID datatype is common in throughout the ecosystem and Arrow as supporting it as a native type would reduce friction.

### What changes are included in this PR?

This PR implements logic for Arrow canonical extension type in C++ and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, new extension type is added.
* Closes: apache#15058

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
khwilson pushed a commit to khwilson/arrow that referenced this issue Sep 14, 2024
### Rationale for this change

See apache#15058.
UUID datatype is common in throughout the ecosystem and Arrow as supporting it as a native type would reduce friction.

### What changes are included in this PR?

This PR implements logic for Arrow canonical extension type in C++ and a Python wrapper.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes, new extension type is added.
* Closes: apache#15058

Authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.