Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add variant type #45375

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

[C++] Add variant type #45375

wants to merge 3 commits into from

Conversation

neilechao
Copy link

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@neilechao neilechao changed the title Add variant type [C++] Add variant type Jan 29, 2025
@wgtmac
Copy link
Member

wgtmac commented Feb 6, 2025

Thanks @neilechao for working on this! I saw your reply on the dev@parquet ML. Let me know if you have any question.

@neilechao
Copy link
Author

Thanks @wgtmac! My main question is this - is it possible to add variant type support to Parquet without adding a conversion to and from Arrow? The Variant encoding and shredding spec are in parquet-format, but I don't think the community has spent much time thinking about the on-wire format of variant

@wgtmac
Copy link
Member

wgtmac commented Feb 11, 2025

There was a discussion on it: #42069. IMHO, we can get started with the variant binary format of the Parquet spec. cc @mapleFU @pitrou @emkornfield @wjones127 @westonpace

@emkornfield
Copy link
Contributor

There was a discussion on it: #42069. IMHO, we can get started with the variant binary format of the Parquet spec. cc @mapleFU @pitrou @emkornfield @wjones127 @westonpace

My thoughts here I think mirror @wgtmac lets first be able to read/write encoded version (including having APIs for decoding from binary). Then we can add low-level parquet writes for shredding/deshredding, and for now arrow will can look like struct<metadata, value> with perhaps a logical type. Finally, if there is bandwidth we can discuss standardizing what shredded Arrow would look like. Open to other suggestions.

@neilechao
Copy link
Author

Got it, thanks @wgtmac and @emkornfield!

private:
Variant()
: LogicalType::Impl(LogicalType::Type::VARIANT, SortOrder::UNKNOWN),
LogicalType::Impl::SimpleApplicable(parquet::Type::BYTE_ARRAY) {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac @emkornfield - I initially had Variant inherit from SimpleApplicable(BYTE_ARRAY), but Variant should actually be composed of two separate byte arrays - one for metadata (the dict) and one for values. This muddies the applicability of VariantLogicalType to a single parquet type.

parquet column-size variant_basic.parquet VARIANT_COL.value-> Size In Bytes: 69 Size In Ratio: 0.52671754 VARIANT_COL.metadata-> Size In Bytes: 62 Size In Ratio: 0.47328246

  1. One possibility is to create separate VariantMetadataLogicalType and VariantValueLogicalType, with VariantLogicalType containing both as class members. The pros are that this reflects the storage in Parquet, where metadata and values are stored in separate columns, and the cons are that this diverges from parquet.thrift and potentially the other language implementations
  2. Other options would be to have VariantMetadata and VariantValue present but not as logical types

What are your thoughts on these approaches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants