Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support complex types in sparksql hash and xxhash64 function #9414

Closed

Conversation

marin-ma
Copy link
Contributor

@marin-ma marin-ma commented Apr 9, 2024

Currently, sparksql hash functions only supports primitive types.
This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation
https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

Summary:

To support hashing for complex types and align with Spark's implementation,
this patch uses a per-row virtual function call and the function is implemented
as vector function rather than simple function.
Below are some notes from the benchmark results:

Virtual function call per-row vs. type-switch per row:
The virtual function call performs 15% better due to having 20% fewer instructions.
The switch statement involves more branch instructions but both methods have a
similar branch misprediction rate of 2.8%. The switch statement doesn't show
higher branch misprediction because its fixed pattern allows the BPU to handle it
effectively. However, if the schema becomes very complex and exceeds the BPU's
history track buffer (currently at 1000 levels), the misprediction rate may increase.

VectorFunction vs. Simple Function:
Since the function doesn't apply default null behavior, null judgment for each
field occurs within the call per row when using a simple function.
In contrast, a vector function first filters the null values per column, avoiding
null judgments in the top-level loop.
By evaluating the implementation across all null ratios for simple/vector functions,
we observed that the simpler function can take up to 3.5 times longer than the vector
function. Checking for null values row by row within the loop can lead to a high
branch misprediction ratio due to the randomness of null values, while vector function
can maintain a consistent branch misprediction ratio across all null ratios in vector
processes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2024
Copy link

netlify bot commented Apr 9, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 6a270b3
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/664d4e70b02bc80008958071

@marin-ma marin-ma force-pushed the sparksql-complex-hash branch from da3b7d1 to e91764e Compare April 9, 2024 08:02
@marin-ma
Copy link
Contributor Author

marin-ma commented Apr 9, 2024

@mbasmanova Could you please help to review? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo @PHILO-HE Folks, would you help take a first pass?

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Would you add a reference for Spark's implementation in the PR description?

typename T,
typename SeedType = typename HashTraits<HashClass>::SeedType,
typename ReturnType = typename HashTraits<HashClass>::ReturnType>
ReturnType hashOne(T input, SeedType seed) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does T must be int32_t for hashInt32?


virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0;

VectorHasher(DecodedVector& decoded) : decoded_(decoded) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to move the constructor and deconstructor near the class name for readability.

return hashNotNull(index, seed);
}

virtual ReturnType hashNotNull(vector_size_t index, SeedType seed) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to hashNotNullAt to correspond with hashAt.

if (baseType->isPrimitiveType()) {
return VELOX_DYNAMIC_SCALAR_TEMPLATE_TYPE_DISPATCH(
createPrimitiveVectorHasher, HashClass, baseType->kind(), decoded);
} else if (baseType->isArray()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: else is not needed after return.

}

auto hasher = createVectorHasher<HashClass>(*decoded);
selected->applyToSelected([&](int row) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: int -> vector_size_t or auto

struct XxHash64;

template <typename HashClass>
struct HashTraits {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice there are several series of APIs being added, including HashTraits, hashOne, VectorHasher. Could we add some comments for each suite of APIs about its functionality?

@marin-ma marin-ma force-pushed the sparksql-complex-hash branch 2 times, most recently from 2312470 to f46fa37 Compare April 11, 2024 02:47
@marin-ma
Copy link
Contributor Author

@rui-mo @PHILO-HE Could you help to review again? Thanks!

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment. Thanks!

@@ -386,6 +618,9 @@ void checkArgTypes(const std::vector<exec::VectorFunctionArg>& args) {
case TypeKind::DOUBLE:
case TypeKind::HUGEINT:
case TypeKind::TIMESTAMP:
case TypeKind::ARRAY:
case TypeKind::MAP:
case TypeKind::ROW:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check inside element type for these?

@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review again? Thanks!

}

private:
const std::optional<int64_t> seed_;
};

bool checkHashElementType(const TypePtr& type) {
switch (type->kind()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does we have supported all the type? If yes, we don't need the check. Or check in debug mode for the future data type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all type kinds are supported. Such as "UNKNOWN"

@jinchengchenghh
Copy link
Contributor

Is it possible to reconstruct code to avoid too much static functions?

@jinchengchenghh
Copy link
Contributor

Please update the PR title to note support complex type in hash and xxhash function

@marin-ma marin-ma changed the title Support complex types in sparksql hash function Support complex types in sparksql hash and xxhash64 function Apr 15, 2024
@marin-ma
Copy link
Contributor Author

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

@jinchengchenghh
Copy link
Contributor

Is it possible to reconstruct code to avoid too much static functions?

Curious any downside of using static functions? The hash computations are stateless.

All functions of the class are static looks strange for me, but current implement is also ok for me.

@marin-ma
Copy link
Contributor Author

@rui-mo @PHILO-HE @jinchengchenghh Do you have further comments? Thanks!

@jinchengchenghh
Copy link
Contributor

LGTM

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no comment. Thanks!

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just two nits.

struct XxHash64;

/// A template struct that contains the seed and return type of the hash
/// function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /// is used for multi-line comments which are to be exposed. In a anonymous namespace, // is needed.

}

/// Class to compute hashes identical to one produced by Spark.
/// Hashes are computed using the algorithm implemented in HashClass.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@marin-ma marin-ma force-pushed the sparksql-complex-hash branch from edf54dc to 2e928f0 Compare April 15, 2024 07:26
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review again? Thanks!

@FelixYBW
Copy link

@mbasmanova Any more comments? The function is used by Gluten columnar shuffle.

@marin-ma marin-ma force-pushed the sparksql-complex-hash branch from 2e928f0 to 6744af2 Compare April 24, 2024 08:46
@marin-ma
Copy link
Contributor Author

@mbasmanova Could you help to review this patch? Thanks!

velox/functions/sparksql/Hash.cpp Show resolved Hide resolved
// ReturnType can be either int32_t or int64_t
// HashClass contains the function like hashInt32
template <typename ReturnType, typename HashClass, typename SeedType>
template <
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to carry these three template types through all the intermediate classes? Could we instead just pass HashClass, and use the trait class only when we actually need SeedType and ReturnType, which is in the hashOne() functions?

@@ -95,13 +327,13 @@ void applyWithType(

class Murmur3Hash final {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any benefits on defining the SeedType and ReturnType traits somewhere and not directly in this class? e.g.

class Murmur3Hash final {
  using SeedType = int32_t;
  using ReturnType = int32_t;
  ...

@marin-ma
Copy link
Contributor Author

@pedroerp Addressed comments. Could you help to review again? Thanks!

@@ -26,21 +26,270 @@ namespace {

const int32_t kDefaultSeed = 42;

// Computes the hash value of input using the hash function in HashClass.
template <typename HashClass, typename SeedType, typename ReturnType>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my point was that if you add the traits to the HashClass itself, can't you here just do something like:

template <typename THash>
THash::ReturnType hashOne(int32_t input, THash::SeedType seed) {
  return THash::hashInt32(input, seed);
}

@marin-ma marin-ma force-pushed the sparksql-complex-hash branch from f597b70 to 9c22758 Compare April 30, 2024 01:18
@FelixYBW
Copy link

FelixYBW commented May 13, 2024

      virtual/switch
Instruction per loop 14,714 18,584 0.792
IPC 2.54 2.78  
loop per second 533,513 462,064 1.155
branch misprediction ratio 2.8% 2.0%
branch misprediction/1K inst 3.88 3.24
branch per loop 2,039 3,030

Add one more metric: branch misprediction per loop:
virtual function call 57
switch 60

StringView input,
typename HashClass::SeedType seed) {
return HashClass::hashBytes(input, seed);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Need we use template here? Does function overloading works?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Rong offline. No good solution except we overload functions in HashClass.

@FelixYBW
Copy link

@mbasmanova @pedroerp Can you help to review again? the function is key to Gluten to support complex datatype in shuffle.

@mbasmanova
Copy link
Contributor

@FelixYBW @marin-ma Folks, I used your benchmark to compare performance before and after this PR for primitive types. I'm seeing a significant regression.

Before:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         585.43us     1.71K
hash_BIGINT##xxhash64                                     488.31us     2.05K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        488.94us     2.05K
hash_INTEGER##xxhash64                                    471.50us     2.12K
hash_VARCHAR##hash                                          2.21ms    452.07
hash_VARCHAR##xxhash64                                      1.97ms    508.80

After:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

@FelixYBW
Copy link

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
hash_BIGINT##hash                                         883.61us     1.13K
hash_BIGINT##xxhash64                                     927.77us     1.08K
----------------------------------------------------------------------------
hash_INTEGER##hash                                        808.87us     1.24K
hash_INTEGER##xxhash64                                    815.35us     1.23K
hash_VARCHAR##hash                                          2.67ms    374.56
hash_VARCHAR##xxhash64                                      2.49ms    402.09

It's expected, the old solution just inline the hash function into the loop which is only 18 cycles for uint64_t input. The new solution use a virtual function call and maybe one more direct function call. Any extra cycle in the loop matters.

To optimize this, we can treat primitive type different from complex type which adds code complexity. @marin-ma Let's check if it's possible to replace the virtual function call by a simpler indirect function call.

If data size increases and cache misses grow, the gap will decrease because the memory latency will hide the extra cycles.

Currently the hash calculation time is very little compared to split, compression and shuffle write in Gluten. We may needn't put more effort on the optimizations.

@FelixYBW
Copy link

@marin-ma Did you test TPCH in Gluten using the PR? Let's do a test if not yet and see what's the perf lose.

auto hasher = createVectorHasher<HashClass>(*decoded);
selected->applyToSelected([&](auto row) {
result.set(row, hasher->hashNotNullAt(row, result.valueAt(row)));
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marin-ma Looks we only need to keep this switch case statement for primitive type. Let's keep this to address @mbasmanova 's perf concern.

Let's test TPCH firstly.

@FelixYBW
Copy link

FelixYBW commented May 16, 2024

@mbasmanova we can't reproduce your perf number. Main branch does be <10% faster than this PR. With larger dataset, the gap becomes even small. We also tested TPCH and observe no perf difference.

The root cause is the extra indirect virtual function call.

  main this pr
cycles 11,962,246,551 11,963,496,863
instructions 69,544,481,778 67,421,807,341
br_inst_retired.indirect 245,381 540,407,083
br_inst_retired.near_call 1,224,994,398 1,680,520,291
br_inst_retired.near_return 1,224,104,407 1,681,002,036

But we did find one optimization opportunity for multiple columns. @marin-ma will update

Optimization doesn't work. @mbasmanova Any more comments?

@FelixYBW
Copy link

@mbasmanova @pedroerp

Here is the performance of simpler function vs. vector function. Simpler function can cost up to 3.5x more than time than Vector function. Because simpler function has to check the null row by row in the loop, but null is randomly so leads to high branch misprediction ratio.

Given the big performance difference, I don't think we should use simpler function.

image

Here is the branch misprediction chart, it's amazing that Velox can maintain a flatten branch misprediction ratio on vector processes.

image

@mbasmanova
Copy link
Contributor

@FelixYBW Binwei, would you share the implementation of the simple function? I'd like to understand what you refer to regarding "has to check the null row by row in the loop". I would assume this to be the same between simple and vector functions.

@FelixYBW
Copy link

@FelixYBW Binwei, would you share the implementation of the simple function? I'd like to understand what you refer to regarding "has to check the null row by row in the loop". I would assume this to be the same between simple and vector functions.

For this function, we need to pass multiple columns and set defaultNullBehavior=false. Then the code looks like below. Null check is in the inner loop.

for(int r=0;r<rowsize;r++)
{
	seed=0;
	for(int c=0;c<colsize;c++)
	{
		auto col=rowvec->childAt(c);
		if (!col->isNullAt(r))
		{
			seed=hash(col->valueAt(r),seed)
		}
	}
	rst[r]=seed;
}

@mbasmanova
Copy link
Contributor

@FelixYBW Binwei, thank you for explaining. I believe you are saying that VectorFunction handles nulls in-bulk and that gives it 3.5x perf boost over simple function. Is this right? It would be nice to update the PR description to explain the choice of implementation, e.g. VectorFunction vs. Simple Function, virtual function call per-row vs. type-switch per row. Thanks.

  exec::DecodedArgs decodedArgs(rows, args, context);
  for (auto i = hashIdx; i < args.size(); i++) {
    auto decoded = decodedArgs.at(i);
    const SelectivityVector* selected = &rows;
    if (args[i]->mayHaveNulls()) {
      *selectedMinusNulls.get(rows.end()) = rows;
      selectedMinusNulls->deselectNulls(
          decoded->nulls(&rows), rows.begin(), rows.end());
      selected = selectedMinusNulls.get();
    }

    auto hasher = createVectorHasher<HashClass>(*decoded);
    selected->applyToSelected([&](auto row) {
      result.set(row, hasher->hashNotNullAt(row, result.valueAt(row)));
    });

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 21, 2024
@FelixYBW
Copy link

@FelixYBW Binwei, thank you for explaining. I believe you are saying that VectorFunction handles nulls in-bulk and that gives it 3.5x perf boost over simple function. Is this right? It would be nice to update the PR description to explain the choice of implementation, e.g. VectorFunction vs. Simple Function, virtual function call per-row vs. type-switch per row. Thanks.

Yes. I will take a look how the in-bulk nulls handling can decrease the branch misprediction so much. If so we should use vectorFunction always if a function needs defaultNullBehavior=false

@marin-ma marin-ma force-pushed the sparksql-complex-hash branch from 2f6bafd to 6a270b3 Compare May 22, 2024 01:46
@marin-ma
Copy link
Contributor Author

@mbasmanova Updated the PR description and rebased. PTAL. Thanks!

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in 066a72f.

Copy link

Conbench analyzed the 1 benchmark run on commit 066a72fd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…kincubator#9414)

Summary:
Currently, sparksql hash functions only supports primitive types.
This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation
https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

To support hashing for complex types and align with Spark's implementation,
this patch uses a per-row virtual function call and the function is implemented
as vector function rather than simple function.
Below are some notes from the benchmark results:

Virtual function call per-row vs. type-switch per row:
The virtual function call performs 15% better due to having 20% fewer instructions.
The switch statement involves more branch instructions but both methods have a
similar branch misprediction rate of 2.8%. The switch statement doesn't show
higher branch misprediction because its fixed  pattern allows the BPU to handle it
effectively. However, if the schema becomes very complex and exceeds the BPU's
history track buffer (currently at 1000 levels), the misprediction rate may increase.

VectorFunction vs. Simple Function:
Since the function doesn't apply default null behavior, null judgment for each
field occurs within the call per row when using a simple function.
In contrast, a vector function first filters the null values per column, avoiding
null judgments in the top-level loop.
By evaluating the implementation across all null ratios for simple/vector functions,
we observed that the simpler function can take up to 3.5 times longer than the vector
function. Checking for null values row by row within the loop can lead to a high
branch misprediction ratio due to the randomness of null values, while vector function
can maintain a consistent branch misprediction ratio across all null ratios in vector
processes.

Pull Request resolved: facebookincubator#9414

Reviewed By: mbasmanova

Differential Revision: D56783038

Pulled By: pedroerp

fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…kincubator#9414)

Summary:
Currently, sparksql hash functions only supports primitive types.
This patch adds the implementation for complex types, including array, map and row.

The expected results in UT are obtained from spark's output.

Spark's implementation
https://github.com/apache/spark/blob/a2b7050e0fc5db6ac98db57309e4737acd26bf3a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L536-L609

To support hashing for complex types and align with Spark's implementation,
this patch uses a per-row virtual function call and the function is implemented
as vector function rather than simple function.
Below are some notes from the benchmark results:

Virtual function call per-row vs. type-switch per row:
The virtual function call performs 15% better due to having 20% fewer instructions.
The switch statement involves more branch instructions but both methods have a
similar branch misprediction rate of 2.8%. The switch statement doesn't show
higher branch misprediction because its fixed  pattern allows the BPU to handle it
effectively. However, if the schema becomes very complex and exceeds the BPU's
history track buffer (currently at 1000 levels), the misprediction rate may increase.

VectorFunction vs. Simple Function:
Since the function doesn't apply default null behavior, null judgment for each
field occurs within the call per row when using a simple function.
In contrast, a vector function first filters the null values per column, avoiding
null judgments in the top-level loop.
By evaluating the implementation across all null ratios for simple/vector functions,
we observed that the simpler function can take up to 3.5 times longer than the vector
function. Checking for null values row by row within the loop can lead to a high
branch misprediction ratio due to the randomness of null values, while vector function
can maintain a consistent branch misprediction ratio across all null ratios in vector
processes.

Pull Request resolved: facebookincubator#9414

Reviewed By: mbasmanova

Differential Revision: D56783038

Pulled By: pedroerp

fbshipit-source-id: 0238f0e88f7f395c41e976003a138cddba3bd093
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants