-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32802][SQL] Avoid using SpecificInternalRow in RunLengthEncoding#Encoder #29654
Conversation
Test build #128322 has finished for PR 29654 at commit
|
Test build #128422 has finished for PR 29654 at commit
|
.../src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/spark/sql/execution/columnar/compression/compressionSchemes.scala
Show resolved
Hide resolved
Test build #128432 has finished for PR 29654 at commit
|
@maropu could you review this PR? thanks! |
The performance gain and the code changes look good to me. It would be nice to have more reviews. cc: @rednaxelafx @kiszk @srowen |
Have you checked the performance on jdk8, too? Could you update the golden files?
|
I don't know enough to review the logic but it looks plausible. |
if (lastValue.isNullAt(0)) { | ||
columnType.copyField(row, ordinal, lastValue, 0) | ||
if (lastValue == null) { | ||
lastValue = columnType.clone(columnType.getField(row, ordinal)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need clone()
here? While I guess that this is for UTF8String
, lastValue
is used only for keeping a value. Thus, it is ok with just keeping the reference for UTF8String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Yes it is because UTF8String
. We reuse InternalRow
during each iteration and since UTF8String
is backed by the same memory region in the InternalRow
, it will be updated each time as we load a new row, which is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thank you for your clarification.
Looks good. I have one question. |
Haven't checked jdk8. Let me do that and update the golden files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @dongjoon-hyun
cc @cloud-fan |
Thanks. I think the CI failure is not related. Probably it is caused by #29640 |
Wait, where do you see the CI failure? |
@srowen I see this in the "Python lint" section, and this happened to two of my PRs which do not touch Python at all. I think it is just a simple fix. Let me come up with a PR quickly. |
OK right I see it locally anyway. I can submit a fix, I have the PR now. |
Oops. I just filed #29733 |
Test build #128582 has finished for PR 29654 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @sunchao and all.
Merged to master for Apache Spark 3.1.0 on December 2020.
cc @cloud-fan and @gatorsmile
@@ -182,8 +181,7 @@ private[columnar] case object RunLengthEncoding extends CompressionScheme { | |||
private var _uncompressedSize = 0 | |||
private var _compressedSize = 0 | |||
|
|||
// Using `MutableRow` to store the last value to avoid boxing/unboxing cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean the overhead of boxing is smaller than the overhead of indirections introduced by SpecificInternalRow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark result does. Please let us know if we need other validation, @cloud-fan .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it, but it's better to spell out this key finding in the PR description, so that people don't need to infer it from the benchmark results by themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I also think this finding looks interesting. I'm not sure that this rule (the overhead of boxing is smaller than the overhead of indirections
) always hold on any JVM implementations, but I think its importatnt to notice this performance property for future developement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @cloud-fan @dongjoon-hyun and @maropu , and yes my bad for not putting enough info in the PR description. The comment is pretty old and I'm not sure what has changed over the years. By examine flamegraph from a test run I observed boxing both before and after the change:
Before:
After:
Even though the time spent on the boxing part is noticeably less after the change (it could have sth to do with my sample size though).
Perhaps we can open a JIRA to investigate this further? we could have more room for improvement if we can completely eliminate the boxing/unboxing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunchao It is hard to guess the difference of boxing between two flamegraphs, can you show the percentage of boxing in each flamegraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry @viirya just noticed you were asking for percentage. I calculated. Before the change it was 2.76% and after it is 1.31%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice :-)
…ng#Encoder ### What changes were proposed in this pull request? Currently `RunLengthEncoding#Encoder` uses `SpecificInternalRow` as a holder for the current value when calculating compression stats and doing the actual compression. It calls `ColumnType.copyField` and `ColumnType.getField` on the internal row which incurs extra cost comparing to directly operating on the internal type. This proposes to replace the `SpecificInternalRow` with `T#InternalType` to avoid the extra cost. ### Why are the changes needed? Operating on `SpecificInternalRow` carries certain cost and negatively impact performance when using `RunLengthEncoding` for compression. With the change I see some improvements through `CompressionSchemeBenchmark`: ```diff Intel(R) Core(TM) i9-9880H CPU 2.30GHz BOOLEAN Encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 1 1 0 51957.0 0.0 1.0X -RunLengthEncoding(2.502) 549 555 9 122.2 8.2 0.0X -BooleanBitSet(0.125) 296 301 3 226.6 4.4 0.0X +PassThrough(1.000) 2 2 0 42985.4 0.0 1.0X +RunLengthEncoding(2.517) 487 500 10 137.7 7.3 0.0X +BooleanBitSet(0.125) 348 353 4 192.8 5.2 0.0X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz SHORT Encode (Lower Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 3 3 0 22779.9 0.0 1.0X -RunLengthEncoding(1.520) 1186 1192 9 56.6 17.7 0.0X +PassThrough(1.000) 3 4 0 21216.6 0.0 1.0X +RunLengthEncoding(1.493) 882 931 50 76.1 13.1 0.0X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz SHORT Encode (Higher Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 3 4 0 21352.2 0.0 1.0X -RunLengthEncoding(2.009) 1173 1175 3 57.2 17.5 0.0X +PassThrough(1.000) 3 3 0 22388.6 0.0 1.0X +RunLengthEncoding(2.015) 924 941 23 72.6 13.8 0.0X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz INT Encode (Lower Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 9 10 1 7410.1 0.1 1.0X -RunLengthEncoding(1.000) 1499 1502 4 44.8 22.3 0.0X -DictionaryEncoding(0.500) 621 630 11 108.0 9.3 0.0X -IntDelta(0.250) 134 149 10 502.0 2.0 0.1X +PassThrough(1.000) 9 10 1 7575.9 0.1 1.0X +RunLengthEncoding(1.002) 952 966 12 70.5 14.2 0.0X +DictionaryEncoding(0.500) 561 567 6 119.7 8.4 0.0X +IntDelta(0.250) 129 134 3 521.9 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz INT Encode (Higher Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 9 10 1 7668.3 0.1 1.0X -RunLengthEncoding(1.332) 1561 1685 175 43.0 23.3 0.0X -DictionaryEncoding(0.501) 616 642 21 108.9 9.2 0.0X -IntDelta(0.250) 126 131 2 533.4 1.9 0.1X +PassThrough(1.000) 9 10 1 7494.1 0.1 1.0X +RunLengthEncoding(1.336) 974 987 13 68.9 14.5 0.0X +DictionaryEncoding(0.501) 709 719 10 94.6 10.6 0.0X +IntDelta(0.250) 127 132 4 528.4 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz LONG Encode (Lower Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 18 19 1 3803.0 0.3 1.0X -RunLengthEncoding(0.754) 1526 1540 20 44.0 22.7 0.0X -DictionaryEncoding(0.250) 735 759 33 91.3 11.0 0.0X -LongDelta(0.125) 126 129 2 530.8 1.9 0.1X +PassThrough(1.000) 19 21 1 3543.5 0.3 1.0X +RunLengthEncoding(0.747) 1049 1058 12 63.9 15.6 0.0X +DictionaryEncoding(0.250) 620 634 17 108.2 9.2 0.0X +LongDelta(0.125) 129 132 2 520.1 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz LONG Encode (Higher Skew): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 18 20 1 3705.4 0.3 1.0X -RunLengthEncoding(1.002) 1665 1669 6 40.3 24.8 0.0X -DictionaryEncoding(0.251) 890 901 11 75.4 13.3 0.0X -LongDelta(0.125) 125 130 3 537.2 1.9 0.1X +PassThrough(1.000) 18 20 2 3726.8 0.3 1.0X +RunLengthEncoding(0.999) 1076 1077 2 62.4 16.0 0.0X +DictionaryEncoding(0.251) 904 919 19 74.3 13.5 0.0X +LongDelta(0.125) 125 131 4 536.5 1.9 0.1X OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5 Intel(R) Core(TM) i9-9880H CPU 2.30GHz STRING Encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ -PassThrough(1.000) 27 30 2 2497.1 0.4 1.0X -RunLengthEncoding(0.892) 3443 3587 204 19.5 51.3 0.0X -DictionaryEncoding(0.167) 2286 2290 6 29.4 34.1 0.0X +PassThrough(1.000) 28 31 2 2430.2 0.4 1.0X +RunLengthEncoding(0.889) 1798 1800 3 37.3 26.8 0.0X +DictionaryEncoding(0.167) 1956 1959 4 34.3 29.1 0.0X ``` In the above diff, new results are with changes in this PR. It can be seen that encoding performance has improved quite a lot especially for string type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Relies on existing unit tests. Closes apache#29654 from sunchao/SPARK-32802. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Currently
RunLengthEncoding#Encoder
usesSpecificInternalRow
as a holder for the current value when calculating compression stats and doing the actual compression. It callsColumnType.copyField
andColumnType.getField
on the internal row which incurs extra cost comparing to directly operating on the internal type. This proposes to replace theSpecificInternalRow
withT#InternalType
to avoid the extra cost.Why are the changes needed?
Operating on
SpecificInternalRow
carries certain cost and negatively impact performance when usingRunLengthEncoding
for compression.With the change I see some improvements through
CompressionSchemeBenchmark
:In the above diff, new results are with changes in this PR. It can be seen that encoding performance has improved quite a lot especially for string type.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Relies on existing unit tests.