Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparisons for values of logical types are not handled correctly throughout the library #10338

Open
mbasmanova opened this issue Jun 27, 2024 · 8 comments · Fixed by prestodb/presto#23200
Labels
bug Something isn't working triage Newly created issue that needs attention.

Comments

@mbasmanova
Copy link
Contributor

mbasmanova commented Jun 27, 2024

Bug description

TIMESTAMP WITH TIME ZONE logical type is backed by BIGINT physical type. The timestamp values are stored in memory as 64-bit integers using an encoding that doesn't allow for direct comparisons of these integers.

https://facebookincubator.github.io/velox/develop/timestamp.html

TimestampWithTimezone physically packs two integers in a single 64 word, using 12 bits for timezone ID, and 52 bits for a millisecond-precision timestamp.

However, many places in the core engine are applying equality and comparisons to the physical value without considering its logical semantics.

One example, is aggregation with grouping keys of type TIMESTAMP WITH TIME ZONE returns incorrect result. '2024-04-10 10:00 America/New_York' and '2024-04-10 07:00 America/Los_Angeles' represent the same timestamp, but appear as different groups in aggregation results:

  auto data = makeRowVector({
      makeFlatVector<std::string>({
          "2024-04-10 10:00 America/New_York",
          "2024-04-10 07:00 America/Los_Angeles",
      }),
  });

  auto plan = PlanBuilder()
                  .values({data})
                  .project({"cast(c0 as timestamp with time zone)"})
                  .singleAggregation({"p0"}, {"count(1)"})
                  .project({"cast(p0 as varchar)", "a0"})
                  .planNode();

  auto results = AssertQueryBuilder(plan).copyResults(pool());
  LOG(ERROR) << results->toString();
  LOG(ERROR) << results->toString(0, 10);


[ROW ROW<p0:VARCHAR,a0:BIGINT>: 2 elements, no nulls]

0: {2024-04-10 10:00:00.000 America/New_York, 1}
1: {2024-04-10 07:00:00.000 America/Los_Angeles, 1}

In Presto,

presto:di> select cast(x as timestamp with time zone), count(1) from unnest(array['2024-04-10 07:00 America/Los_Angeles', '2024-04-10 10:00 America/New_York']) as t(x) group by 1;
                    _col0                    | _col1
---------------------------------------------+-------
 2024-04-10 07:00:00.000 America/Los_Angeles |     2
(1 row)

presto:di> select x, count(1) from unnest(array['2024-04-10 07:00 America/Los_Angeles', '2024-04-10 10:00 America/New_York']) as t(x) group by 1;
                  x                   | _col1
--------------------------------------+-------
 2024-04-10 07:00 America/Los_Angeles |     1
 2024-04-10 10:00 America/New_York    |     1
(2 rows)

All operators that perform comparisons are affected by this issue, e.g. Aggregation, Join, OrderBy.

CC: @kgpai @Yuhta @bikramSingh91 @kagamiori @amitkdutta

System information

n/a

Relevant logs

No response

@mbasmanova mbasmanova added bug Something isn't working triage Newly created issue that needs attention. labels Jun 27, 2024
@mbasmanova
Copy link
Contributor Author

CC: @wypb

@mbasmanova
Copy link
Contributor Author

CC: @pedroerp

@Yuhta
Copy link
Contributor

Yuhta commented Jun 27, 2024

We will probably need a virtual function on logical type to do the comparison. The hard part is how do we avoid calling that virtual function for common logical types to avoid performance regression.

@pedroerp
Copy link
Contributor

Good catch. I suppose we need to provide a plugable API for user to specify equality and comparison functions for custom logical types, sort of like how this is done in C++ (operator==, ...).

Is there anything else that should be expose? I guess at least equality and some form of comparison for sorting?

How does Presto Java does it? Or they just have all types hard coded throughout the codebase?

@mbasmanova
Copy link
Contributor Author

How does Presto Java does it?

Presto defines a set of operators (add, subtract, etc.) and each type is expected to provide an implementation for a subset of these that are supported.

See

@pedroerp
Copy link
Contributor

I see. That would probably mean each row comparison would incur in a virtual function call? Would be nice if we could come up with a batch/vector oriented API to amortize the cost.

@Yuhta
Copy link
Contributor

Yuhta commented Jun 28, 2024

I see annotations in Java code so probably some codegen magic is happening. The equivalent in Velox would be template magic.

@oerling
Copy link
Contributor

oerling commented Jul 11, 2024

Specifying Comparison of Extended Types

Extended types, like timestampp with timezone must have special comparison and hashing for hash tables and special comparison in expressions.

This can be implemented by adding virtual functions to Type. These are not defined if type->isExtendedType() is false and are defined otherwise.

The signatures are:

int32_t compare(const BaseVector& left, vector_size_t leftIndex, const BaseVector& right, vector_size_t rightIndex) const;

int32_t compare(const DecodedVector& left, vector_size_t index, void* right) const;

The first compares single elements of vectors. The second compares a DecodedVector to a slot in a RowContainer.

The return value is < 0 for lt, 0 for equals and > 0 for gt.

uint64_t hash(const BaseVector& vector, vector_size_t index) const;

The call sites are

  • VectorHasher: An extended type forces use of kHash. So only the hash, not the value id methods know about extended types.

  • Vectors

BBaseVector::equalValueAt and compare need to call the Type virtual function in the case of the vector being of an extended tyope. The type's extendedness should be cached in BaseVector to similarly to the kind, so that the type does not have to be accessed.

  • HashTable and RowContainer:

HashTable in kHash mode switches on the TypeKind. While there is no TypeKind for extended type, this switch can switch on an extended TypeKind enum that has a value for extended type that goes to Type::compare. This enum (int) is internal to HashTable.

The same logic occurs in spilling, which compares vectors with BaseVector::compare.

OrderBy

This will probably work just by BaseVector supporting the types.

Functions

The vector functions for comparison need a case for extended types. Type could have a vectorized comparison, e.g. compareMultiple(const DecodedVector& left, const DecodedVector& right, const SelectivityVector& rrows, int32_t* result). This is only needed if performance is an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Newly created issue that needs attention.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants