-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arrow-array::builder: support Int8, Int16 and Int64 keys #6845
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ajwerner -- this looks like an improvement to me.
It would be nice to support the other key types, but we can do that as a follow on PR as well
Box::new(dict_builder) | ||
} | ||
t => panic!("Unsupported dictionary value type {t:?} is not currently supported"), | ||
DataType::Dictionary(key_type, value_type) if matches!(**key_type, DataType::Int8 | DataType::Int32 ) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible for completeness to also add support for DataType::Int16 and DataType::Int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I had originally typed it with all of them and backtracked when I read the thread in the initial PR that introduced this code that was concerned with code bloat. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
Box::new(dict_builder) | ||
} | ||
t => panic!("Unsupported dictionary value type {t:?} is not currently supported"), | ||
DataType::Dictionary(key_type, value_type) if matches!(**key_type, DataType::Int8 | DataType::Int32 ) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I had originally typed it with all of them and backtracked when I read the thread in the initial PR that introduced this code that was concerned with code bloat. Done.
85f8e82
to
901e276
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
901e276
to
91f5725
Compare
The spec says that the keys in dictionaries are [0]: > (4) The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits. In my use case I have a very small number of values so wasting bits on a wider key is wasteful. [0]: https://github.com/apache/arrow/blob/fe32a7dfe5e22e7737198476fe1ac0e8a5dccef2/docs/source/format/Columnar.rst?plain=1#L182-L183
91f5725
to
4680186
Compare
Fixed the fmt/clippy issues. |
Thanks again @ajwerner |
The spec says that the keys in dictionaries are 0:
In my use case I have a very small number of values so wasting bits on a wider key is wasteful.
Which issue does this PR close?
Closes #6844
Rationale for this change
I have a situation where I have lots of vectors of strings with mostly one string element and at most one per union member -- that means the dictionary keys fit in a single Int8. The builder infrastructure is rather convenient. I can get a solid single-digit reduction in the total size of my batches by using smaller keys.