Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] (De)serialize Arrow UUID to/from Parquet UUID #43807

Open
ianmcook opened this issue Aug 23, 2024 · 12 comments
Open

[C++][Python] (De)serialize Arrow UUID to/from Parquet UUID #43807

ianmcook opened this issue Aug 23, 2024 · 12 comments

Comments

@ianmcook
Copy link
Member

Describe the enhancement requested

As a follow up to issue #15058 / PR #37298:

Parquet's UUID logical type is directly equivalent to Arrow's UUID canonical extension type.

After we have native support for UUID in Arrow C++ and PyArrow, it would be lovely if:

  • Parquet UUID fields would deserialize to Arrow UUID fields
  • Arrow UUID fields would serialize to Parquet UUID fields

This would improve interoperability with other components that read/write Parquet files and support Parquet's UUID type.

Component(s)

C++, Python

@ianmcook
Copy link
Member Author

@joellubi are you doing the equivalent of this for the Go implementation in #43679?

@mapleFU
Copy link
Member

mapleFU commented Aug 25, 2024

Should we wait [1] is merged?

[1] #37298

@ianmcook
Copy link
Member Author

@mapleFU Yes, or else use that as the base branch

@joellubi
Copy link
Member

@joellubi are you doing the equivalent of this for the Go implementation in #43679?

Yes that PR adds the capability for arrow extension types to specify their target parquet logical type, and implements it for UUID and JSON (see relevant testcase).

The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type FixedSizeBinary. I can get this added too.

@ianmcook
Copy link
Member Author

The type mapping going the other way (Parquet -> Arrow) was not added for UUID, right now it uses the storage type FixedSizeBinary. I can get this added too.

Thanks! It would be great to have it implemented in both directions.

@ianmcook
Copy link
Member Author

If you need a Parquet file that contains a UUID column (for testing purposes), DuckDB can write one like this:

import duckdb
con = duckdb.connect()

con.execute("CREATE TABLE t1 AS SELECT gen_random_uuid() a FROM range(0, 16);")

con.execute("copy t1 to 'uuid_test.parquet'")

Thanks @pdet for this example code.

@mapleFU
Copy link
Member

mapleFU commented Aug 26, 2024

I currently have no time on this in this two weeks. I'm glad to help review this

@rok
Copy link
Member

rok commented Aug 26, 2024

For reference C++ JSON extension type proposal already includes Parquet serialization.

@ianmcook
Copy link
Member Author

#37298 is merged now

@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

Thanks @rok , lets fast make https://github.com/apache/arrow/pull/13901/files in . I'm focus on support List in Join this two week but I'll take careful round on this pr

@raphaelauv
Copy link

is this done with arrow 18 and UUID support ?

@rok
Copy link
Member

rok commented Oct 30, 2024

UUID extension type is supported in arrow 18. I don't think it'll get serialized to UUID logical type in Parquet like JSON does (to JSON logical type), but I'd expect it to roundtrip ok to parquet in some cases. What case are you looking to cover @raphaelauv ?

@ianmcook ianmcook added this to the 19.0.0 milestone Nov 26, 2024
@amoeba amoeba modified the milestones: 19.0.0, 20.0.0 Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants