-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Add variant type #45375
base: main
Are you sure you want to change the base?
[C++] Add variant type #45375
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
87376cc
to
2e94210
Compare
Thanks @neilechao for working on this! I saw your reply on the dev@parquet ML. Let me know if you have any question. |
Thanks @wgtmac! My main question is this - is it possible to add variant type support to Parquet without adding a conversion to and from Arrow? The Variant encoding and shredding spec are in parquet-format, but I don't think the community has spent much time thinking about the on-wire format of variant |
There was a discussion on it: #42069. IMHO, we can get started with the variant binary format of the Parquet spec. cc @mapleFU @pitrou @emkornfield @wjones127 @westonpace |
My thoughts here I think mirror @wgtmac lets first be able to read/write encoded version (including having APIs for decoding from binary). Then we can add low-level parquet writes for shredding/deshredding, and for now arrow will can look like struct<metadata, value> with perhaps a logical type. Finally, if there is bandwidth we can discuss standardizing what shredded Arrow would look like. Open to other suggestions. |
Got it, thanks @wgtmac and @emkornfield! |
private: | ||
Variant() | ||
: LogicalType::Impl(LogicalType::Type::VARIANT, SortOrder::UNKNOWN), | ||
LogicalType::Impl::SimpleApplicable(parquet::Type::BYTE_ARRAY) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac @emkornfield - I initially had Variant inherit from SimpleApplicable(BYTE_ARRAY), but Variant should actually be composed of two separate byte arrays - one for metadata (the dict) and one for values. This muddies the applicability of VariantLogicalType to a single parquet type.
parquet column-size variant_basic.parquet VARIANT_COL.value-> Size In Bytes: 69 Size In Ratio: 0.52671754 VARIANT_COL.metadata-> Size In Bytes: 62 Size In Ratio: 0.47328246
- One possibility is to create separate VariantMetadataLogicalType and VariantValueLogicalType, with VariantLogicalType containing both as class members. The pros are that this reflects the storage in Parquet, where metadata and values are stored in separate columns, and the cons are that this diverges from parquet.thrift and potentially the other language implementations
- Other options would be to have VariantMetadata and VariantValue present but not as logical types
What are your thoughts on these approaches?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?