-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support complex types in sparksql hash and xxhash64 function #9414
Conversation
✅ Deploy Preview for meta-velox canceled.
|
da3b7d1
to
e91764e
Compare
@mbasmanova Could you please help to review? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Would you add a reference for Spark's implementation in the PR description?
velox/functions/sparksql/Hash.cpp
Outdated
typename T, | ||
typename SeedType = typename HashTraits<HashClass>::SeedType, | ||
typename ReturnType = typename HashTraits<HashClass>::ReturnType> | ||
ReturnType hashOne(T input, SeedType seed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does T must be int32_t for hashInt32?
velox/functions/sparksql/Hash.cpp
Outdated
|
||
virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0; | ||
|
||
VectorHasher(DecodedVector& decoded) : decoded_(decoded) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to move the constructor and deconstructor near the class name for readability.
velox/functions/sparksql/Hash.cpp
Outdated
return hashNotNull(index, seed); | ||
} | ||
|
||
virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename to hashNotNullAt to correspond with hashAt.
velox/functions/sparksql/Hash.cpp
Outdated
if (baseType->isPrimitiveType()) { | ||
return VELOX_DYNAMIC_SCALAR_TEMPLATE_TYPE_DISPATCH( | ||
createPrimitiveVectorHasher, HashClass, baseType->kind(), decoded); | ||
} else if (baseType->isArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: else is not needed after return.
velox/functions/sparksql/Hash.cpp
Outdated
} | ||
|
||
auto hasher = createVectorHasher<HashClass>(*decoded); | ||
selected->applyToSelected([&](int row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: int -> vector_size_t or auto
velox/functions/sparksql/Hash.cpp
Outdated
struct XxHash64; | ||
|
||
template <typename HashClass> | ||
struct HashTraits {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice there are several series of APIs being added, including HashTraits, hashOne, VectorHasher. Could we add some comments for each suite of APIs about its functionality?
2312470
to
f46fa37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment. Thanks!
velox/functions/sparksql/Hash.cpp
Outdated
@@ -386,6 +618,9 @@ void checkArgTypes(const std::vector<exec::VectorFunctionArg>& args) { | |||
case TypeKind::DOUBLE: | |||
case TypeKind::HUGEINT: | |||
case TypeKind::TIMESTAMP: | |||
case TypeKind::ARRAY: | |||
case TypeKind::MAP: | |||
case TypeKind::ROW: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check inside element type for these?
@mbasmanova Could you help to review again? Thanks! |
} | ||
|
||
private: | ||
const std::optional<int64_t> seed_; | ||
}; | ||
|
||
bool checkHashElementType(const TypePtr& type) { | ||
switch (type->kind()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does we have supported all the type? If yes, we don't need the check. Or check in debug mode for the future data type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all type kinds are supported. Such as "UNKNOWN"
Is it possible to reconstruct code to avoid too much static functions? |
Please update the PR title to note support complex type in |
Curious any downside of using static functions? The hash computations are stateless. |
All functions of the class are static looks strange for me, but current implement is also ok for me. |
@rui-mo @PHILO-HE @jinchengchenghh Do you have further comments? Thanks! |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no comment. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Just two nits.
velox/functions/sparksql/Hash.cpp
Outdated
struct XxHash64; | ||
|
||
/// A template struct that contains the seed and return type of the hash | ||
/// function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: /// is used for multi-line comments which are to be exposed. In a anonymous namespace, // is needed.
velox/functions/sparksql/Hash.cpp
Outdated
} | ||
|
||
/// Class to compute hashes identical to one produced by Spark. | ||
/// Hashes are computed using the algorithm implemented in HashClass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
edf54dc
to
2e928f0
Compare
@mbasmanova Could you help to review again? Thanks! |
@mbasmanova Any more comments? The function is used by Gluten columnar shuffle. |
2e928f0
to
6744af2
Compare
@mbasmanova Could you help to review this patch? Thanks! |
// ReturnType can be either int32_t or int64_t | ||
// HashClass contains the function like hashInt32 | ||
template <typename ReturnType, typename HashClass, typename SeedType> | ||
template < |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to carry these three template types through all the intermediate classes? Could we instead just pass HashClass
, and use the trait class only when we actually need SeedType and ReturnType, which is in the hashOne() functions?
@@ -95,13 +327,13 @@ void applyWithType( | |||
|
|||
class Murmur3Hash final { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any benefits on defining the SeedType and ReturnType traits somewhere and not directly in this class? e.g.
class Murmur3Hash final {
using SeedType = int32_t;
using ReturnType = int32_t;
...
@pedroerp Addressed comments. Could you help to review again? Thanks! |
velox/functions/sparksql/Hash.cpp
Outdated
@@ -26,21 +26,270 @@ namespace { | |||
|
|||
const int32_t kDefaultSeed = 42; | |||
|
|||
// Computes the hash value of input using the hash function in HashClass. | |||
template <typename HashClass, typename SeedType, typename ReturnType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my point was that if you add the traits to the HashClass itself, can't you here just do something like:
template <typename THash>
THash::ReturnType hashOne(int32_t input, THash::SeedType seed) {
return THash::hashInt32(input, seed);
}
f597b70
to
9c22758
Compare
Add one more metric: branch misprediction per loop: |
StringView input, | ||
typename HashClass::SeedType seed) { | ||
return HashClass::hashBytes(input, seed); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma Need we use template here? Does function overloading works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked with Rong offline. No good solution except we overload functions in HashClass.
@mbasmanova @pedroerp Can you help to review again? the function is key to Gluten to support complex datatype in shuffle. |
@FelixYBW @marin-ma Folks, I used your benchmark to compare performance before and after this PR for primitive types. I'm seeing a significant regression. Before:
After:
|
It's expected, the old solution just inline the hash function into the loop which is only 18 cycles for uint64_t input. The new solution use a virtual function call and maybe one more direct function call. Any extra cycle in the loop matters. To optimize this, we can treat primitive type different from complex type which adds code complexity. @marin-ma Let's check if it's possible to replace the virtual function call by a simpler indirect function call. If data size increases and cache misses grow, the gap will decrease because the memory latency will hide the extra cycles. Currently the hash calculation time is very little compared to split, compression and shuffle write in Gluten. We may needn't put more effort on the optimizations. |
@marin-ma Did you test TPCH in Gluten using the PR? Let's do a test if not yet and see what's the perf lose. |
auto hasher = createVectorHasher<HashClass>(*decoded); | ||
selected->applyToSelected([&](auto row) { | ||
result.set(row, hasher->hashNotNullAt(row, result.valueAt(row))); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marin-ma Looks we only need to keep this switch case statement for primitive type. Let's keep this to address @mbasmanova 's perf concern.
Let's test TPCH firstly.
@mbasmanova we can't reproduce your perf number. Main branch does be <10% faster than this PR. With larger dataset, the gap becomes even small. We also tested TPCH and observe no perf difference. The root cause is the extra indirect virtual function call.
But we did find one optimization opportunity for multiple columns. @marin-ma will update Optimization doesn't work. @mbasmanova Any more comments? |
Here is the performance of simpler function vs. vector function. Simpler function can cost up to 3.5x more than time than Vector function. Because simpler function has to check the null row by row in the loop, but null is randomly so leads to high branch misprediction ratio. Given the big performance difference, I don't think we should use simpler function. Here is the branch misprediction chart, it's amazing that Velox can maintain a flatten branch misprediction ratio on vector processes. |
@FelixYBW Binwei, would you share the implementation of the simple function? I'd like to understand what you refer to regarding "has to check the null row by row in the loop". I would assume this to be the same between simple and vector functions. |
For this function, we need to pass multiple columns and set defaultNullBehavior=false. Then the code looks like below. Null check is in the inner loop.
|
@FelixYBW Binwei, thank you for explaining. I believe you are saying that VectorFunction handles nulls in-bulk and that gives it 3.5x perf boost over simple function. Is this right? It would be nice to update the PR description to explain the choice of implementation, e.g. VectorFunction vs. Simple Function, virtual function call per-row vs. type-switch per row. Thanks.
|
Yes. I will take a look how the in-bulk nulls handling can decrease the branch misprediction so much. If so we should use vectorFunction always if a function needs defaultNullBehavior=false |
2f6bafd
to
6a270b3
Compare
@mbasmanova Updated the PR description and rebased. PTAL. Thanks! |
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…kincubator#9414) Summary: Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row. The expected results in UT are obtained from spark's output. Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609 To support hashing for complex types and align with Spark's implementation, this patch uses a per-row virtual function call and the function is implemented as vector function rather than simple function. Below are some notes from the benchmark results: Virtual function call per-row vs. type-switch per row: The virtual function call performs 15% better due to having 20% fewer instructions. The switch statement involves more branch instructions but both methods have a similar branch misprediction rate of 2.8%. The switch statement doesn't show higher branch misprediction because its fixed pattern allows the BPU to handle it effectively. However, if the schema becomes very complex and exceeds the BPU's history track buffer (currently at 1000 levels), the misprediction rate may increase. VectorFunction vs. Simple Function: Since the function doesn't apply default null behavior, null judgment for each field occurs within the call per row when using a simple function. In contrast, a vector function first filters the null values per column, avoiding null judgments in the top-level loop. By evaluating the implementation across all null ratios for simple/vector functions, we observed that the simpler function can take up to 3.5 times longer than the vector function. Checking for null values row by row within the loop can lead to a high branch misprediction ratio due to the randomness of null values, while vector function can maintain a consistent branch misprediction ratio across all null ratios in vector processes. Pull Request resolved: facebookincubator#9414 Reviewed By: mbasmanova Differential Revision: D56783038 Pulled By: pedroerp fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
…kincubator#9414) Summary: Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row. The expected results in UT are obtained from spark's output. Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609 To support hashing for complex types and align with Spark's implementation, this patch uses a per-row virtual function call and the function is implemented as vector function rather than simple function. Below are some notes from the benchmark results: Virtual function call per-row vs. type-switch per row: The virtual function call performs 15% better due to having 20% fewer instructions. The switch statement involves more branch instructions but both methods have a similar branch misprediction rate of 2.8%. The switch statement doesn't show higher branch misprediction because its fixed pattern allows the BPU to handle it effectively. However, if the schema becomes very complex and exceeds the BPU's history track buffer (currently at 1000 levels), the misprediction rate may increase. VectorFunction vs. Simple Function: Since the function doesn't apply default null behavior, null judgment for each field occurs within the call per row when using a simple function. In contrast, a vector function first filters the null values per column, avoiding null judgments in the top-level loop. By evaluating the implementation across all null ratios for simple/vector functions, we observed that the simpler function can take up to 3.5 times longer than the vector function. Checking for null values row by row within the loop can lead to a high branch misprediction ratio due to the randomness of null values, while vector function can maintain a consistent branch misprediction ratio across all null ratios in vector processes. Pull Request resolved: facebookincubator#9414 Reviewed By: mbasmanova Differential Revision: D56783038 Pulled By: pedroerp fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
Currently, sparksql hash functions only supports primitive types.
This patch adds the implementation for complex types, including array, map and row.
The expected results in UT are obtained from spark's output.
Spark's implementation
https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609
Summary:
To support hashing for complex types and align with Spark's implementation,
this patch uses a per-row virtual function call and the function is implemented
as vector function rather than simple function.
Below are some notes from the benchmark results:
Virtual function call per-row vs. type-switch per row:
The virtual function call performs 15% better due to having 20% fewer instructions.
The switch statement involves more branch instructions but both methods have a
similar branch misprediction rate of 2.8%. The switch statement doesn't show
higher branch misprediction because its fixed pattern allows the BPU to handle it
effectively. However, if the schema becomes very complex and exceeds the BPU's
history track buffer (currently at 1000 levels), the misprediction rate may increase.
VectorFunction vs. Simple Function:
Since the function doesn't apply default null behavior, null judgment for each
field occurs within the call per row when using a simple function.
In contrast, a vector function first filters the null values per column, avoiding
null judgments in the top-level loop.
By evaluating the implementation across all null ratios for simple/vector functions,
we observed that the simpler function can take up to 3.5 times longer than the vector
function. Checking for null values row by row within the loop can lead to a high
branch misprediction ratio due to the randomness of null values, while vector function
can maintain a consistent branch misprediction ratio across all null ratios in vector
processes.