[Python] Native support for UUID #15058

Fokko · 2022-12-21T11:28:57Z

Describe the enhancement requested

In Apache Iceberg we have support for the UUID type. I think it would be nice to also support this in (Py)Arrow natively instead of having an extension.

Component(s)

Python

assignUser · 2022-12-21T14:01:03Z

This was brought up years ago: https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j but seems to have been dropped in favor of the extension.

Adding types requires an ML discussion and vote so it would probably be best to (re)start the discussion there.

rok · 2022-12-21T14:25:40Z

Hey @Fokko how about making it a canonical extension? [1][2]

[1] https://arrow.apache.org/docs/format/CanonicalExtensions.html
[2] https://lists.apache.org/thread/sxd5fhc42hb6svs79t3fd79gkqj83pfh

wjones127 · 2023-06-09T02:25:05Z

This would be good to make a canonical extension type, given there are many integration points for it:

It is a logical type in Parquet
It is a logical type in many databases, who may return it in Flight
Arrow-native and arrow compatible engines (such as DuckDB) have UUID as a datatype and could exchange that over the C data interface.

This seems like a good candidate to prototype spreading a canonical extension type through the ecosystem, given how simple and ubiquitous is it.

I may look into this in about a month, if someone else doesn't beat me to it.

jorisvandenbossche · 2023-06-13T12:42:51Z

I agree UUID sounds as a good fit for adding as a canonical extension type.

arogozhnikov · 2023-08-22T06:33:15Z

@wjones127 would be awesome if you have time to implement this, because the only proposed solution at SO is to use duckdb (for conversion of tables with UUID to parquet)

mapleFU · 2023-08-22T06:41:26Z

I found a UUID in src/arrow/testing/extension_type.h

class ARROW_TESTING_EXPORT UuidArray : public ExtensionArray {
 public:
  using ExtensionArray::ExtensionArray;
};

class ARROW_TESTING_EXPORT UuidType : public ExtensionType {
 public:
  UuidType() : ExtensionType(fixed_size_binary(16)) {}

  std::string extension_name() const override { return "uuid"; }

  bool ExtensionEquals(const ExtensionType& other) const override;

  std::shared_ptr<Array> MakeArray(std::shared_ptr<ArrayData> data) const override;

  Result<std::shared_ptr<DataType>> Deserialize(
      std::shared_ptr<DataType> storage_type,
      const std::string& serialized) const override;

  std::string Serialize() const override { return "uuid-serialized"; }
};

Maybe we just need to draft and make it available?

rok · 2023-08-22T09:00:25Z

I've opened #37298 using logic from src/arrow/testing/extension_type.h and will also add a Pyhon wrapper.
As per canonical extension process I'll should also start a ML discussion and vote.

pitrou · 2023-08-22T09:38:08Z

@Fokko I'm curious, can you explain what "natively" supporting an UUID meaning? An UUID is just a bunch of 16 opaque bytes with no actionable contents...

(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)

Fokko · 2023-08-22T09:53:38Z

@pitrou That's true, for PyIceberg it is about maintaining the type information. We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.

(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)

We'll you want to avoid utf8 if not needed. In Iceberg the UUID is also often used to apply bucket partitioning, which works well on fixed 16 bytes.

pitrou · 2023-08-22T09:55:10Z

We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.

Unless you expect Iceberg and Arrow to support the exact same types, you should probably have a mechanism to store and restore Iceberg metadata to/from Arrow data.

arogozhnikov · 2023-08-22T09:56:06Z

can you explain what "natively" supporting an UUID meaning?

for me that's support of python's built-in uuid.UUID type when reading/writing dataframes from pandas. As already mentioned, you can convert to bytes, but during loading it won't be converted back.

shenker · 2023-12-11T01:49:18Z

Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for #37298.

rok · 2023-12-11T16:15:13Z

Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for #37298.

That sounds quite useful @shenker. You mean something like this?

import pyarrow as pa
pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(pa.list_(pa.binary(), 16))

or rather

pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(uuid())

I'm not sure how much work setting up extension types would be with cast kernels. Either way this idea would be better off in a new issue :).

### Rationale for this change See apache#15058. UUID datatype is common in throughout the ecosystem and Arrow as supporting it as a native type would reduce friction. ### What changes are included in this PR? This PR implements logic for Arrow canonical extension type in C++ and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes, new extension type is added. * Closes: apache#15058 Authored-by: Rok Mihevc <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

Fokko added the Type: enhancement label Dec 21, 2022

AlenkaF changed the title ~~Native support for UUID~~ [Python] Native support for UUID Dec 21, 2022

AlenkaF added the Component: Python label Dec 21, 2022

rok mentioned this issue Aug 22, 2023

GH-15058: [C++][Python] Native support for UUID #37298

Merged

github-actions bot assigned rok Aug 22, 2023

shenker mentioned this issue Dec 11, 2023

[Python][C++] Hex decoding strings/allow casting strings to UUIDs and vice-versa #39183

Open

rok mentioned this issue Apr 18, 2024

[Format][Docs] Add a canonical extension type specification for UUID #41298

Closed

kevinjqliu mentioned this issue Jun 26, 2024

write UUID fail on _check_schema_compatible apache/iceberg-python#855

Closed

raulcd added this to the 18.0.0 milestone Jun 28, 2024

chmp mentioned this issue Aug 2, 2024

Cannot deserialize Uuid chmp/serde_arrow#203

Closed

ianmcook mentioned this issue Aug 23, 2024

[C++][Python] (De)serialize Arrow UUID to/from Parquet UUID #43807

Open

pitrou closed this as completed in #37298 Aug 26, 2024

pitrou closed this as completed in 2328b6e Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Native support for UUID #15058

[Python] Native support for UUID #15058

Fokko commented Dec 21, 2022

assignUser commented Dec 21, 2022

rok commented Dec 21, 2022

wjones127 commented Jun 9, 2023

jorisvandenbossche commented Jun 13, 2023 •

edited

Loading

arogozhnikov commented Aug 22, 2023

mapleFU commented Aug 22, 2023

rok commented Aug 22, 2023

pitrou commented Aug 22, 2023

Fokko commented Aug 22, 2023

pitrou commented Aug 22, 2023

arogozhnikov commented Aug 22, 2023 •

edited

Loading

shenker commented Dec 11, 2023 •

edited

Loading

rok commented Dec 11, 2023

[Python] Native support for UUID #15058

[Python] Native support for UUID #15058

Comments

Fokko commented Dec 21, 2022

Describe the enhancement requested

Component(s)

assignUser commented Dec 21, 2022

rok commented Dec 21, 2022

wjones127 commented Jun 9, 2023

jorisvandenbossche commented Jun 13, 2023 • edited Loading

arogozhnikov commented Aug 22, 2023

mapleFU commented Aug 22, 2023

rok commented Aug 22, 2023

pitrou commented Aug 22, 2023

Fokko commented Aug 22, 2023

pitrou commented Aug 22, 2023

arogozhnikov commented Aug 22, 2023 • edited Loading

shenker commented Dec 11, 2023 • edited Loading

rok commented Dec 11, 2023

jorisvandenbossche commented Jun 13, 2023 •

edited

Loading

arogozhnikov commented Aug 22, 2023 •

edited

Loading

shenker commented Dec 11, 2023 •

edited

Loading