-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ISSUE-135][FOLLOWUP][Improvement][AQE] Assign adjacent partitions to the same ShuffleServer #307
Conversation
Spark usually start the task one by one, If we have 700 reduce tasks, 7 shuffle server, we usually start 0 - 99 task first. 0 - 99 reduce will be allocated to shuffle server 0. Will it increase too much pressure for shuffle server 0? |
Good idea. |
We should allocate continuous n reduce partitions to one shuffle server to avoid that too many task read the same shuffle server. If we have 700 reduce tasks, our concurrency is 100, we usually start 0 - 99 task first. if we allocate 10 continuous reduce partitions to one shuffle server, shuffle server 0 will have [0, 9] [80, 89] reduce partitions, shuffle server 1 will have [ 10, 19], [90, 99] reduce partition, shuffle server 3 will have [20, 29],[100, 109] reduce partition .... |
How to decide how many contiguous partitions need to be allocated, like |
Like the above example, if assigning 7 consecutive partitions is the best, for example [0,6] is assigned to server1, [7,12] is assigned to server2, but we do not know the number of concurrent tasks. |
We can read the executor configuration of spark through configuration, but the actual app running process may not be able to allocate so many resources. |
If we use dynamic allocation, we can't know the number of executors. So I think we can give a configuration first, user can set that value. Similarly ByteDance Shuffle Service give the concurrency tasks through an experience formula, you can see https://github.com/bytedance/CloudShuffleService/blob/ef0ffb3f43f9f6e96af49629aed2a6ce61a6a2ab/spark-shuffle-manager-2/src/main/scala/org/apache/spark/shuffle/css/CssShuffleManager.scala#L64 |
Maybe there are no difference between 7 and 10. We should need some performance tests here. |
Codecov Report
@@ Coverage Diff @@
## master #307 +/- ##
============================================
+ Coverage 60.73% 61.01% +0.28%
- Complexity 1462 1489 +27
============================================
Files 180 185 +5
Lines 9229 9314 +85
Branches 887 900 +13
============================================
+ Hits 5605 5683 +78
- Misses 3325 3326 +1
- Partials 299 305 +6
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
I will make some performance tests later. |
Yes. This optimization has been applied in our internal uniffle, it works well. |
docs/coordinator_guide.md
Outdated
@@ -101,6 +101,7 @@ This document will introduce how to deploy Uniffle coordinators. | |||
|rss.coordinator.remote.storage.io.sample.access.times|3|The number of times to read and write HDFS files| | |||
|rss.coordinator.startup-silent-period.enabled|false|Enable the startup-silent-period to reject the assignment requests for avoiding partial assignments. To avoid service interruption, this mechanism is disabled by default. Especially it's recommended to use in coordinator HA mode when restarting single coordinator.| | |||
|rss.coordinator.startup-silent-period.duration|20000|The waiting duration(ms) when conf of rss.coordinator.startup-silent-period.enabled is enabled.| | |||
|rss.coordinator.select.partition.strategy|AbstractAssignmentStrategy.SelectPartitionStrategyName.ROUND|There are two strategies for selecting partitions: ROUND and CONTINUOUS. ROUND will poll to allocate partitions to ShuffleServer, and CONTINUOUS will try to allocate consecutive partitions to ShuffleServer.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we tell users that this config option can optimize the performance of AQE?
@@ -103,6 +103,8 @@ These configurations are shared by all types of clients. | |||
|<client_type>.rss.client.io.compression.codec|lz4|The compression codec is used to compress the shuffle data. Default codec is `lz4`. Other options are`ZSTD` and `SNAPPY`.| | |||
|<client_type>.rss.client.io.compression.zstd.level|3|The zstd compression level, the default level is 3| | |||
|<client_type>.rss.client.shuffle.data.distribution.type|NORMAL|The type of partition shuffle data distribution, including normal and local_order. The default value is normal. Now this config is only valid in Spark3.x| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we tell users that this config option can optimize the performance of AQE?
We would better modify the document https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md#support-spark-aqe , too. |
I reused the environment in #190 to compare the performance before and after issue#136, and the performance improvement after merging the PR EnvironmentShuffle Server Num : 5 We measure the performance of get_shuffle_result by the following metrics:
Test ResultsBefore issue_136
After issue_136
After this pr
SummarizeAfter this pr, the number of interface requests is reduced by 79.9%, and the total time is reduced by 81.8%. |
@@ -53,6 +54,59 @@ public static int nextIdx(int idx, int size) { | |||
return idx; | |||
} | |||
|
|||
/** | |||
* Assign multiple adjacent partitionRanges to several servers | |||
* Suppose totalPartitionNum=52, partitionNumPerRange=2, serverNum=5, estimateTaskConcurrency=20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partitionNumPerRange
should be 1. We will remove range partition in the future.. We can use this to simplify the logic. Current implement is ok for me, too.
@@ -110,6 +110,7 @@ public void getShuffleAssignments( | |||
final int replica = request.getDataReplica(); | |||
final Set<String> requiredTags = Sets.newHashSet(request.getRequireTagsList()); | |||
final int requiredShuffleServerNumber = request.getAssignmentShuffleServerNumber(); | |||
final int estimateTaskConcurrency = request.getEstimateTaskConcurrency(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will the value be if the old client request the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value will be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it cause an unexpected result? If not, this feature will be compatible feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the estimateTaskConcurrency value is 0, and rss.coordinator.select.partition.strategy=CONTINUOUS, the assignment will be similar to ROUND strategy, you can check CoordinatorUtils#generateRangesGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can apply this feature to our community and estimate the number of ShuffleServers needed according to the number of concurrent tasks. |
Would you contribute this feature and let the @zuston help you review this feature? |
I'm grad to review this feature if you want @leixm |
Thank you, i will raise a pr for this feature. |
What changes were proposed in this pull request?
Follow issue#136, allocate adjacent partitions to the same ShuffleServer. When the client calls getShuffleResultForMultiPart, the number of ShuffleServer requests is minimized
Why are the changes needed?
Bring some performance improvement
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT