-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparisons for values of logical types are not handled correctly throughout the library #10338
Comments
CC: @wypb |
CC: @pedroerp |
We will probably need a virtual function on logical type to do the comparison. The hard part is how do we avoid calling that virtual function for common logical types to avoid performance regression. |
Good catch. I suppose we need to provide a plugable API for user to specify equality and comparison functions for custom logical types, sort of like how this is done in C++ (operator==, ...). Is there anything else that should be expose? I guess at least equality and some form of comparison for sorting? How does Presto Java does it? Or they just have all types hard coded throughout the codebase? |
Presto defines a set of operators (add, subtract, etc.) and each type is expected to provide an implementation for a subset of these that are supported. See |
I see. That would probably mean each row comparison would incur in a virtual function call? Would be nice if we could come up with a batch/vector oriented API to amortize the cost. |
I see annotations in Java code so probably some codegen magic is happening. The equivalent in Velox would be template magic. |
Specifying Comparison of Extended Types Extended types, like timestampp with timezone must have special comparison and hashing for hash tables and special comparison in expressions. This can be implemented by adding virtual functions to Type. These are not defined if type->isExtendedType() is false and are defined otherwise. The signatures are: int32_t compare(const BaseVector& left, vector_size_t leftIndex, const BaseVector& right, vector_size_t rightIndex) const; int32_t compare(const DecodedVector& left, vector_size_t index, void* right) const; The first compares single elements of vectors. The second compares a DecodedVector to a slot in a RowContainer. The return value is < 0 for lt, 0 for equals and > 0 for gt. uint64_t hash(const BaseVector& vector, vector_size_t index) const; The call sites are
BBaseVector::equalValueAt and compare need to call the Type virtual function in the case of the vector being of an extended tyope. The type's extendedness should be cached in BaseVector to similarly to the kind, so that the type does not have to be accessed.
HashTable in kHash mode switches on the TypeKind. While there is no TypeKind for extended type, this switch can switch on an extended TypeKind enum that has a value for extended type that goes to Type::compare. This enum (int) is internal to HashTable. The same logic occurs in spilling, which compares vectors with BaseVector::compare. OrderBy This will probably work just by BaseVector supporting the types. Functions The vector functions for comparison need a case for extended types. Type could have a vectorized comparison, e.g. compareMultiple(const DecodedVector& left, const DecodedVector& right, const SelectivityVector& rrows, int32_t* result). This is only needed if performance is an issue. |
Bug description
TIMESTAMP WITH TIME ZONE logical type is backed by BIGINT physical type. The timestamp values are stored in memory as 64-bit integers using an encoding that doesn't allow for direct comparisons of these integers.
https://facebookincubator.github.io/velox/develop/timestamp.html
However, many places in the core engine are applying equality and comparisons to the physical value without considering its logical semantics.
One example, is aggregation with grouping keys of type TIMESTAMP WITH TIME ZONE returns incorrect result. '2024-04-10 10:00 America/New_York' and '2024-04-10 07:00 America/Los_Angeles' represent the same timestamp, but appear as different groups in aggregation results:
In Presto,
All operators that perform comparisons are affected by this issue, e.g. Aggregation, Join, OrderBy.
CC: @kgpai @Yuhta @bikramSingh91 @kagamiori @amitkdutta
System information
n/a
Relevant logs
No response
The text was updated successfully, but these errors were encountered: