-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] (De)serialize Arrow UUID to/from Parquet UUID #43807
Comments
Should we wait [1] is merged? [1] #37298 |
@mapleFU Yes, or else use that as the base branch |
Yes that PR adds the capability for arrow extension types to specify their target parquet logical type, and implements it for UUID and JSON (see relevant testcase). The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type |
Thanks! It would be great to have it implemented in both directions. |
If you need a Parquet file that contains a UUID column (for testing purposes), DuckDB can write one like this: import duckdb
con = duckdb.connect()
con.execute("CREATE TABLE t1 AS SELECT gen_random_uuid() a FROM range(0, 16);")
con.execute("copy t1 to 'uuid_test.parquet'") Thanks @pdet for this example code. |
I currently have no time on this in this two weeks. I'm glad to help review this |
For reference C++ JSON extension type proposal already includes Parquet serialization. |
#37298 is merged now |
Thanks @rok , lets fast make https://github.com/apache/arrow/pull/13901/files in . I'm focus on support List in Join this two week but I'll take careful round on this pr |
is this done with arrow 18 and UUID support ? |
UUID extension type is supported in arrow 18. I don't think it'll get serialized to UUID logical type in Parquet like JSON does (to JSON logical type), but I'd expect it to roundtrip ok to parquet in some cases. What case are you looking to cover @raphaelauv ? |
Describe the enhancement requested
As a follow up to issue #15058 / PR #37298:
Parquet's UUID logical type is directly equivalent to Arrow's UUID canonical extension type.
After we have native support for UUID in Arrow C++ and PyArrow, it would be lovely if:
This would improve interoperability with other components that read/write Parquet files and support Parquet's UUID type.
Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: