Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support complex types in sparksql hash and xxhash64 function (faceboo…
…kincubator#9414) Summary: Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row. The expected results in UT are obtained from spark's output. Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609 To support hashing for complex types and align with Spark's implementation, this patch uses a per-row virtual function call and the function is implemented as vector function rather than simple function. Below are some notes from the benchmark results: Virtual function call per-row vs. type-switch per row: The virtual function call performs 15% better due to having 20% fewer instructions. The switch statement involves more branch instructions but both methods have a similar branch misprediction rate of 2.8%. The switch statement doesn't show higher branch misprediction because its fixed pattern allows the BPU to handle it effectively. However, if the schema becomes very complex and exceeds the BPU's history track buffer (currently at 1000 levels), the misprediction rate may increase. VectorFunction vs. Simple Function: Since the function doesn't apply default null behavior, null judgment for each field occurs within the call per row when using a simple function. In contrast, a vector function first filters the null values per column, avoiding null judgments in the top-level loop. By evaluating the implementation across all null ratios for simple/vector functions, we observed that the simpler function can take up to 3.5 times longer than the vector function. Checking for null values row by row within the loop can lead to a high branch misprediction ratio due to the randomness of null values, while vector function can maintain a consistent branch misprediction ratio across all null ratios in vector processes. Pull Request resolved: facebookincubator#9414 Reviewed By: mbasmanova Differential Revision: D56783038 Pulled By: pedroerp fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
- Loading branch information