Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow-array::builder: support Int8, Int16 and Int64 keys #6845

Merged
merged 1 commit into from
Dec 7, 2024

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Dec 6, 2024

The spec says that the keys in dictionaries are 0:

(4) The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits.

In my use case I have a very small number of values so wasting bits on a wider key is wasteful.

Which issue does this PR close?

Closes #6844

Rationale for this change

I have a situation where I have lots of vectors of strings with mostly one string element and at most one per union member -- that means the dictionary keys fit in a single Int8. The builder infrastructure is rather convenient. I can get a solid single-digit reduction in the total size of my batches by using smaller keys.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 6, 2024
@ajwerner ajwerner changed the title arrow-array::builder: support more dictionary keys arrow-array::builder: support Int8 dictionary keys Dec 6, 2024
@ajwerner ajwerner marked this pull request as ready for review December 6, 2024 03:23
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ajwerner -- this looks like an improvement to me.

It would be nice to support the other key types, but we can do that as a follow on PR as well

Box::new(dict_builder)
}
t => panic!("Unsupported dictionary value type {t:?} is not currently supported"),
DataType::Dictionary(key_type, value_type) if matches!(**key_type, DataType::Int8 | DataType::Int32 ) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible for completeness to also add support for DataType::Int16 and DataType::Int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I had originally typed it with all of them and backtracked when I read the thread in the initial PR that introduced this code that was concerned with code bloat. Done.

@ajwerner ajwerner changed the title arrow-array::builder: support Int8 dictionary keys arrow-array::builder: support more dictionary keys Dec 6, 2024
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Box::new(dict_builder)
}
t => panic!("Unsupported dictionary value type {t:?} is not currently supported"),
DataType::Dictionary(key_type, value_type) if matches!(**key_type, DataType::Int8 | DataType::Int32 ) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I had originally typed it with all of them and backtracked when I read the thread in the initial PR that introduced this code that was concerned with code bloat. Done.

@ajwerner ajwerner force-pushed the support-all-the-dictionary-keys branch 2 times, most recently from 85f8e82 to 901e276 Compare December 6, 2024 15:58
@alamb alamb changed the title arrow-array::builder: support more dictionary keys arrow-array::builder: support Int8, Int16 and Int64 keys Dec 6, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @ajwerner

FYI @tustvold / @etseidl

@ajwerner ajwerner force-pushed the support-all-the-dictionary-keys branch from 901e276 to 91f5725 Compare December 6, 2024 16:32
The spec says that the keys in dictionaries are [0]:

> (4) The index type of a Dictionary type can only be an integer type, preferably signed, with width 8 to 64 bits.

In my use case I have a very small number of values so wasting bits on a
wider key is wasteful.

[0]: https://github.com/apache/arrow/blob/fe32a7dfe5e22e7737198476fe1ac0e8a5dccef2/docs/source/format/Columnar.rst?plain=1#L182-L183
@ajwerner ajwerner force-pushed the support-all-the-dictionary-keys branch from 91f5725 to 4680186 Compare December 6, 2024 16:34
@ajwerner
Copy link
Contributor Author

ajwerner commented Dec 6, 2024

Fixed the fmt/clippy issues.

@alamb alamb merged commit a1a53ca into apache:main Dec 7, 2024
26 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 7, 2024

Thanks again @ajwerner

@ajwerner ajwerner deleted the support-all-the-dictionary-keys branch December 17, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add builder support for Int8 keys
2 participants