Support complex types in sparksql hash and xxhash64 function #9414

marin-ma · 2024-04-09T07:11:24Z

Currently, sparksql hash functions only supports primitive types.
This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation
https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

Summary:

To support hashing for complex types and align with Spark's implementation,
this patch uses a per-row virtual function call and the function is implemented
as vector function rather than simple function.
Below are some notes from the benchmark results:

Virtual function call per-row vs. type-switch per row:
The virtual function call performs 15% better due to having 20% fewer instructions.
The switch statement involves more branch instructions but both methods have a
similar branch misprediction rate of 2.8%. The switch statement doesn't show
higher branch misprediction because its fixed pattern allows the BPU to handle it
effectively. However, if the schema becomes very complex and exceeds the BPU's
history track buffer (currently at 1000 levels), the misprediction rate may increase.

VectorFunction vs. Simple Function:
Since the function doesn't apply default null behavior, null judgment for each
field occurs within the call per row when using a simple function.
In contrast, a vector function first filters the null values per column, avoiding
null judgments in the top-level loop.
By evaluating the implementation across all null ratios for simple/vector functions,
we observed that the simpler function can take up to 3.5 times longer than the vector
function. Checking for null values row by row within the loop can lead to a high
branch misprediction ratio due to the randomness of null values, while vector function
can maintain a consistent branch misprediction ratio across all null ratios in vector
processes.

netlify · 2024-04-09T07:11:42Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`6a270b3`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/664d4e70b02bc80008958071

marin-ma · 2024-04-09T11:07:16Z

@mbasmanova Could you please help to review? Thanks!

mbasmanova

@rui-mo @PHILO-HE Folks, would you help take a first pass?

rui-mo

Thanks! Would you add a reference for Spark's implementation in the PR description?

rui-mo · 2024-04-10T09:29:00Z

velox/functions/sparksql/Hash.cpp

+    typename T,
+    typename SeedType = typename HashTraits<HashClass>::SeedType,
+    typename ReturnType = typename HashTraits<HashClass>::ReturnType>
+ReturnType hashOne(T input, SeedType seed) {


Does T must be int32_t for hashInt32?

rui-mo · 2024-04-10T09:31:33Z

velox/functions/sparksql/Hash.cpp

+
+  virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0;
+
+  VectorHasher(DecodedVector& decoded) : decoded_(decoded) {}


Maybe better to move the constructor and deconstructor near the class name for readability.

rui-mo · 2024-04-10T09:33:34Z

velox/functions/sparksql/Hash.cpp

+    return hashNotNull(index, seed);
+  }
+
+  virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0;


Maybe rename to hashNotNullAt to correspond with hashAt.

rui-mo · 2024-04-10T09:34:46Z

velox/functions/sparksql/Hash.cpp

+  if (baseType->isPrimitiveType()) {
+    return VELOX_DYNAMIC_SCALAR_TEMPLATE_TYPE_DISPATCH(
+        createPrimitiveVectorHasher, HashClass, baseType->kind(), decoded);
+  } else if (baseType->isArray()) {


nit: else is not needed after return.

rui-mo · 2024-04-10T09:38:14Z

velox/functions/sparksql/Hash.cpp

-    }
+
+    auto hasher = createVectorHasher<HashClass>(*decoded);
+    selected->applyToSelected([&](int row) {


nit: int -> vector_size_t or auto

rui-mo · 2024-04-10T09:45:33Z

velox/functions/sparksql/Hash.cpp

+struct XxHash64;
+
+template <typename HashClass>
+struct HashTraits {};


I notice there are several series of APIs being added, including HashTraits, hashOne, VectorHasher. Could we add some comments for each suite of APIs about its functionality?

marin-ma · 2024-04-11T03:40:07Z

@rui-mo @PHILO-HE Could you help to review again? Thanks!

PHILO-HE

Just one comment. Thanks!

PHILO-HE · 2024-04-11T03:51:22Z

velox/functions/sparksql/Hash.cpp

@@ -386,6 +618,9 @@ void checkArgTypes(const std::vector<exec::VectorFunctionArg>& args) {
      case TypeKind::DOUBLE:
      case TypeKind::HUGEINT:
      case TypeKind::TIMESTAMP:
+      case TypeKind::ARRAY:
+      case TypeKind::MAP:
+      case TypeKind::ROW:


Do we need to check inside element type for these?

marin-ma · 2024-04-11T08:41:25Z

@mbasmanova Could you help to review again? Thanks!

jinchengchenghh · 2024-04-15T00:29:29Z

velox/functions/sparksql/Hash.cpp

  }

 private:
  const std::optional<int64_t> seed_;
 };

+bool checkHashElementType(const TypePtr& type) {
+  switch (type->kind()) {


Does we have supported all the type? If yes, we don't need the check. Or check in debug mode for the future data type

Not all type kinds are supported. Such as "UNKNOWN"

jinchengchenghh · 2024-04-15T01:10:45Z

Is it possible to reconstruct code to avoid too much static functions?

jinchengchenghh · 2024-04-15T01:11:29Z

Please update the PR title to note support complex type in hash and xxhash function

marin-ma · 2024-04-15T01:24:59Z

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

jinchengchenghh · 2024-04-15T01:32:28Z

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

All functions of the class are static looks strange for me, but current implement is also ok for me.

marin-ma · 2024-04-15T03:18:58Z

@rui-mo @PHILO-HE @jinchengchenghh Do you have further comments? Thanks!

jinchengchenghh · 2024-04-15T05:58:53Z

LGTM

PHILO-HE

I have no comment. Thanks!

rui-mo

Thanks. Just two nits.

rui-mo · 2024-04-15T07:14:39Z

velox/functions/sparksql/Hash.cpp

+struct XxHash64;
+
+/// A template struct that contains the seed and return type of the hash
+/// function.


nit: /// is used for multi-line comments which are to be exposed. In a anonymous namespace, // is needed.

rui-mo · 2024-04-15T07:21:10Z

velox/functions/sparksql/Hash.cpp

+}
+
+/// Class to compute hashes identical to one produced by Spark.
+/// Hashes are computed using the algorithm implemented in HashClass.


marin-ma · 2024-04-15T09:44:21Z

@mbasmanova Could you help to review again? Thanks!

FelixYBW · 2024-04-19T06:09:11Z

@mbasmanova Any more comments? The function is used by Gluten columnar shuffle.

marin-ma · 2024-04-24T08:47:23Z

@mbasmanova Could you help to review this patch? Thanks!

velox/functions/sparksql/Hash.cpp

pedroerp · 2024-04-26T01:20:17Z

velox/functions/sparksql/Hash.cpp

 // ReturnType can be either int32_t or int64_t
 // HashClass contains the function like hashInt32
-template <typename ReturnType, typename HashClass, typename SeedType>
+template <


do we need to carry these three template types through all the intermediate classes? Could we instead just pass HashClass, and use the trait class only when we actually need SeedType and ReturnType, which is in the hashOne() functions?

pedroerp · 2024-04-26T01:22:40Z

velox/functions/sparksql/Hash.cpp

@@ -95,13 +327,13 @@ void applyWithType(

 class Murmur3Hash final {


are there any benefits on defining the SeedType and ReturnType traits somewhere and not directly in this class? e.g.

class Murmur3Hash final { using SeedType = int32_t; using ReturnType = int32_t; ...

marin-ma · 2024-04-26T07:16:13Z

@pedroerp Addressed comments. Could you help to review again? Thanks!

pedroerp · 2024-04-26T19:01:08Z

velox/functions/sparksql/Hash.cpp

@@ -26,21 +26,270 @@ namespace {

 const int32_t kDefaultSeed = 42;

+// Computes the hash value of input using the hash function in HashClass.
+template <typename HashClass, typename SeedType, typename ReturnType>


I think my point was that if you add the traits to the HashClass itself, can't you here just do something like:

template <typename THash> THash::ReturnType hashOne(int32_t input, THash::SeedType seed) { return THash::hashInt32(input, seed); }

FelixYBW · 2024-05-13T02:12:29Z

virtual/switch
Instruction per loop 14,714 18,584 0.792
IPC 2.54 2.78
loop per second 533,513 462,064 1.155
branch misprediction ratio 2.8% 2.0%
branch misprediction/1K inst 3.88 3.24
branch per loop 2,039 3,030

Add one more metric: branch misprediction per loop:
virtual function call 57
switch 60

FelixYBW · 2024-05-14T18:01:29Z

velox/functions/sparksql/Hash.cpp

+    StringView input,
+    typename HashClass::SeedType seed) {
+  return HashClass::hashBytes(input, seed);
+}


@marin-ma Need we use template here? Does function overloading works?

Talked with Rong offline. No good solution except we overload functions in HashClass.

FelixYBW · 2024-05-15T05:29:31Z

@mbasmanova @pedroerp Can you help to review again? the function is key to Gluten to support complex datatype in shuffle.

mbasmanova · 2024-05-15T15:31:11Z

@FelixYBW @marin-ma Folks, I used your benchmark to compare performance before and after this PR for primitive types. I'm seeing a significant regression.

Before:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         585.43us     1.71K
hash_BIGINT##xxhash64                                     488.31us     2.05K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        488.94us     2.05K
hash_INTEGER##xxhash64                                    471.50us     2.12K
hash_VARCHAR##hash                                          2.21ms    452.07
hash_VARCHAR##xxhash64                                      1.97ms    508.80

After:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

FelixYBW · 2024-05-15T19:54:51Z

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

It's expected, the old solution just inline the hash function into the loop which is only 18 cycles for uint64_t input. The new solution use a virtual function call and maybe one more direct function call. Any extra cycle in the loop matters.

To optimize this, we can treat primitive type different from complex type which adds code complexity. @marin-ma Let's check if it's possible to replace the virtual function call by a simpler indirect function call.

If data size increases and cache misses grow, the gap will decrease because the memory latency will hide the extra cycles.

Currently the hash calculation time is very little compared to split, compression and shuffle write in Gluten. We may needn't put more effort on the optimizations.

FelixYBW · 2024-05-15T19:58:39Z

@marin-ma Did you test TPCH in Gluten using the PR? Let's do a test if not yet and see what's the perf lose.

FelixYBW · 2024-05-15T20:43:49Z

velox/functions/sparksql/Hash.cpp

+    auto hasher = createVectorHasher<HashClass>(*decoded);
+    selected->applyToSelected([&](auto row) {
+      result.set(row, hasher->hashNotNullAt(row, result.valueAt(row)));
+    });


@marin-ma Looks we only need to keep this switch case statement for primitive type. Let's keep this to address @mbasmanova 's perf concern.

Let's test TPCH firstly.

FelixYBW · 2024-05-16T04:41:05Z

@mbasmanova we can't reproduce your perf number. Main branch does be <10% faster than this PR. With larger dataset, the gap becomes even small. We also tested TPCH and observe no perf difference.

The root cause is the extra indirect virtual function call.

	main	this pr
cycles	11,962,246,551	11,963,496,863
instructions	69,544,481,778	67,421,807,341
br_inst_retired.indirect	245,381	540,407,083
br_inst_retired.near_call	1,224,994,398	1,680,520,291
br_inst_retired.near_return	1,224,104,407	1,681,002,036

But we did find one optimization opportunity for multiple columns. @marin-ma will update

Optimization doesn't work. @mbasmanova Any more comments?

FelixYBW · 2024-05-20T23:41:53Z

@mbasmanova @pedroerp

Here is the performance of simpler function vs. vector function. Simpler function can cost up to 3.5x more than time than Vector function. Because simpler function has to check the null row by row in the loop, but null is randomly so leads to high branch misprediction ratio.

Given the big performance difference, I don't think we should use simpler function.

Here is the branch misprediction chart, it's amazing that Velox can maintain a flatten branch misprediction ratio on vector processes.

mbasmanova · 2024-05-21T01:57:53Z

@FelixYBW Binwei, would you share the implementation of the simple function? I'd like to understand what you refer to regarding "has to check the null row by row in the loop". I would assume this to be the same between simple and vector functions.

FelixYBW · 2024-05-21T06:40:02Z

@FelixYBW Binwei, would you share the implementation of the simple function? I'd like to understand what you refer to regarding "has to check the null row by row in the loop". I would assume this to be the same between simple and vector functions.

For this function, we need to pass multiple columns and set defaultNullBehavior=false. Then the code looks like below. Null check is in the inner loop.

for(int r=0;r<rowsize;r++)
{
	seed=0;
	for(int c=0;c<colsize;c++)
	{
		auto col=rowvec->childAt(c);
		if (!col->isNullAt(r))
		{
			seed=hash(col->valueAt(r),seed)
		}
	}
	rst[r]=seed;
}

mbasmanova · 2024-05-21T15:36:36Z

@FelixYBW Binwei, thank you for explaining. I believe you are saying that VectorFunction handles nulls in-bulk and that gives it 3.5x perf boost over simple function. Is this right? It would be nice to update the PR description to explain the choice of implementation, e.g. VectorFunction vs. Simple Function, virtual function call per-row vs. type-switch per row. Thanks.

  exec::DecodedArgs decodedArgs(rows, args, context);
  for (auto i = hashIdx; i < args.size(); i++) {
    auto decoded = decodedArgs.at(i);
    const SelectivityVector* selected = &rows;
    if (args[i]->mayHaveNulls()) {
      *selectedMinusNulls.get(rows.end()) = rows;
      selectedMinusNulls->deselectNulls(
          decoded->nulls(&rows), rows.begin(), rows.end());
      selected = selectedMinusNulls.get();
    }

    auto hasher = createVectorHasher<HashClass>(*decoded);
    selected->applyToSelected([&](auto row) {
      result.set(row, hasher->hashNotNullAt(row, result.valueAt(row)));
    });

FelixYBW · 2024-05-21T17:26:20Z

@FelixYBW Binwei, thank you for explaining. I believe you are saying that VectorFunction handles nulls in-bulk and that gives it 3.5x perf boost over simple function. Is this right? It would be nice to update the PR description to explain the choice of implementation, e.g. VectorFunction vs. Simple Function, virtual function call per-row vs. type-switch per row. Thanks.

Yes. I will take a look how the in-bulk nulls handling can decrease the branch misprediction so much. If so we should use vectorFunction always if a function needs defaultNullBehavior=false

marin-ma · 2024-05-22T03:42:36Z

@mbasmanova Updated the PR description and rebased. PTAL. Thanks!

facebook-github-bot · 2024-05-23T02:26:23Z

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-23T04:25:46Z

@pedroerp merged this pull request in 066a72f.

conbench-facebook · 2024-05-23T05:14:40Z

Conbench analyzed the 1 benchmark run on commit 066a72fd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…kincubator#9414) Summary: Currently, sparksql hash functions only supports primitive types. This patch adds the implementation for complex types, including array, map and row. The expected results in UT are obtained from spark's output. Spark's implementation https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609 To support hashing for complex types and align with Spark's implementation, this patch uses a per-row virtual function call and the function is implemented as vector function rather than simple function. Below are some notes from the benchmark results: Virtual function call per-row vs. type-switch per row: The virtual function call performs 15% better due to having 20% fewer instructions. The switch statement involves more branch instructions but both methods have a similar branch misprediction rate of 2.8%. The switch statement doesn't show higher branch misprediction because its fixed pattern allows the BPU to handle it effectively. However, if the schema becomes very complex and exceeds the BPU's history track buffer (currently at 1000 levels), the misprediction rate may increase. VectorFunction vs. Simple Function: Since the function doesn't apply default null behavior, null judgment for each field occurs within the call per row when using a simple function. In contrast, a vector function first filters the null values per column, avoiding null judgments in the top-level loop. By evaluating the implementation across all null ratios for simple/vector functions, we observed that the simpler function can take up to 3.5 times longer than the vector function. Checking for null values row by row within the loop can lead to a high branch misprediction ratio due to the randomness of null values, while vector function can maintain a consistent branch misprediction ratio across all null ratios in vector processes. Pull Request resolved: facebookincubator#9414 Reviewed By: mbasmanova Differential Revision: D56783038 Pulled By: pedroerp fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2024

marin-ma force-pushed the sparksql-complex-hash branch from da3b7d1 to e91764e Compare April 9, 2024 08:02

mbasmanova reviewed Apr 9, 2024

View reviewed changes

rui-mo reviewed Apr 10, 2024

View reviewed changes

marin-ma force-pushed the sparksql-complex-hash branch 2 times, most recently from 2312470 to f46fa37 Compare April 11, 2024 02:47

PHILO-HE reviewed Apr 11, 2024

View reviewed changes

jinchengchenghh reviewed Apr 15, 2024

View reviewed changes

marin-ma changed the title ~~Support complex types in sparksql hash function~~ Support complex types in sparksql hash and xxhash64 function Apr 15, 2024

PHILO-HE reviewed Apr 15, 2024

View reviewed changes

rui-mo reviewed Apr 15, 2024

View reviewed changes

marin-ma force-pushed the sparksql-complex-hash branch from edf54dc to 2e928f0 Compare April 15, 2024 07:26

marin-ma force-pushed the sparksql-complex-hash branch from 2e928f0 to 6744af2 Compare April 24, 2024 08:46

pedroerp reviewed Apr 26, 2024

View reviewed changes

marin-ma force-pushed the sparksql-complex-hash branch from f597b70 to 9c22758 Compare April 30, 2024 01:18

FelixYBW reviewed May 14, 2024

View reviewed changes

FelixYBW reviewed May 15, 2024

View reviewed changes

mbasmanova approved these changes May 21, 2024

View reviewed changes

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 21, 2024

marin-ma added 5 commits May 22, 2024 09:46

support hash complex type

86622a0

remove virtual function call

86a2d26

address comments

e6a49d1

add benchmark

e2f4423

revert switch

6a270b3

marin-ma force-pushed the sparksql-complex-hash branch from 2f6bafd to 6a270b3 Compare May 22, 2024 01:46

facebook-github-bot closed this in 066a72f May 23, 2024

facebook-github-bot added the Merged label May 23, 2024


		virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0;

		VectorHasher(DecodedVector& decoded) : decoded_(decoded) {}

		@@ -95,13 +327,13 @@ void applyWithType(

		class Murmur3Hash final {

Support complex types in sparksql hash and xxhash64 function #9414

Support complex types in sparksql hash and xxhash64 function #9414

Conversation

marin-ma commented Apr 9, 2024 • edited Loading

netlify bot commented Apr 9, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

marin-ma commented Apr 9, 2024

mbasmanova left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Apr 11, 2024

PHILO-HE left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinchengchenghh commented Apr 15, 2024

jinchengchenghh commented Apr 15, 2024

marin-ma commented Apr 15, 2024

jinchengchenghh commented Apr 15, 2024

marin-ma commented Apr 15, 2024

jinchengchenghh commented Apr 15, 2024

PHILO-HE left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Apr 15, 2024

FelixYBW commented Apr 19, 2024

marin-ma commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Apr 26, 2024

Choose a reason for hiding this comment

FelixYBW commented May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixYBW commented May 15, 2024

mbasmanova commented May 15, 2024

FelixYBW commented May 15, 2024

FelixYBW commented May 15, 2024

Choose a reason for hiding this comment

FelixYBW commented May 16, 2024 • edited Loading

FelixYBW commented May 20, 2024

mbasmanova commented May 21, 2024

FelixYBW commented May 21, 2024

mbasmanova commented May 21, 2024

FelixYBW commented May 21, 2024

marin-ma commented May 22, 2024

facebook-github-bot commented May 23, 2024

facebook-github-bot commented May 23, 2024

conbench-facebook bot commented May 23, 2024

marin-ma commented Apr 9, 2024 •

edited

Loading

netlify bot commented Apr 9, 2024 •

edited

Loading

FelixYBW commented May 13, 2024 •

edited

Loading

FelixYBW commented May 16, 2024 •

edited

Loading