-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#731] feat(spark): Make blockid layout configurable for Spark clients #1528
Conversation
69e87b8
to
0fc6097
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1528 +/- ##
============================================
+ Coverage 53.63% 54.60% +0.96%
- Complexity 2822 2837 +15
============================================
Files 436 417 -19
Lines 24549 22295 -2254
Branches 2080 2094 +14
============================================
- Hits 13167 12174 -993
+ Misses 10562 9361 -1201
+ Partials 820 760 -60 ☔ View full report in Codecov by Sentry. |
0fc6097
to
e3e9a66
Compare
9fafb5d
to
e499311
Compare
Should I make Spark2, TEZ and MR clients configurable in this PR as well, our should that better be done in a separate PR? |
We can do it in an another pr. |
1f3e6c7
to
6bd203b
Compare
Also moves bit logic from BlockId into BlockIdLayout
3883937
to
262c343
Compare
This is ready for review. |
@@ -53,17 +53,17 @@ | |||
import org.apache.uniffle.client.factory.ShuffleClientFactory; | |||
import org.apache.uniffle.common.ShuffleServerInfo; | |||
import org.apache.uniffle.common.exception.RssException; | |||
import org.apache.uniffle.common.util.BlockId; | |||
import org.apache.uniffle.common.util.BlockIdLayout; | |||
import org.apache.uniffle.common.util.Constants; | |||
|
|||
public class RssTezUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more cases for Tez?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic is unchanged and layout is not configurable for TEZ, so there is nothing new that needs to be tested.
If existing tests are incomplete, I'd suggest someone who has a deeper understanding of the internals of TEZ task attempt ids than I have adds more coverage in a separate PR. I am happy to adjust them afterwards in this PR. Alternatively, the coverage could be increased when adding configurable block id layout for TEZ in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic is unchanged and layout is not configurable for TEZ, so there is nothing new that needs to be tested.
If existing tests are incomplete, I'd suggest someone who has a deeper understanding of the internals of TEZ task attempt ids than I have adds more coverage in a separate PR. I am happy to adjust them afterwards in this PR. Alternatively, the coverage could be increased when adding configurable block id layout for TEZ in a follow up PR.
OK, I got it . I just feel that you changed the logic of Tez from the title.
@@ -528,6 +529,11 @@ public void getShuffleResult( | |||
String appId = request.getAppId(); | |||
int shuffleId = request.getShuffleId(); | |||
int partitionId = request.getPartitionId(); | |||
BlockIdLayout blockIdLayout = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider some compatible issues? If older client connects a new server, the blockIdLayout may be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could fall back to BlockIdLayout.DEFAULT
here, but that would only work for clients that use the original values in Constants.java
. For cluster setups where Constants.java
has been changed, there is no way to pick the right block id layout here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request.getBlockIdLayout
may return null if we use an old client to connect the shuffle server. The code will throw a NullPointerException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what do you suggest? Use BlockIdLayout.DEFAULT
if request.getBlockIdLayout == null
and mark this a breaking change for users that deployed a cluster with modified bit lengths in Constants.java
? How can we make this compatible for those users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use BlockIdLayout.DEFAULT
if layout is null. We can't guarantee the compatibility if users modify our code in the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, done!
+ "This may happen when the data is flushing, please ignore.", | ||
totalLength, | ||
dataFileLen, | ||
BlockId.getPartitionId(blockId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove partitionId here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual block id layout is not available here, and wiring it through from the client would mean to add it to many method signature calls. All this only for logging.
At least we should log the long
block id here so it can be de-cyphered manually if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
LGTM, @zuston Could you take an another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, left a comment for the compatible problem.
And for the further optimization, the blockIds stored on the shuffle-server for me is not a good design, which will bring too much burden for shuffle-server, especially for many spark jobs, I have raise another issue to track this. #1538
Anyway, the above suggestion is not related with this PR. Thanks for your effort
server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your contribution! @EnricoMi
What about the |
Make sense for me. WDYT? @jerqi |
OK for me. |
Let me do that in #1566. |
What changes were proposed in this pull request?
Make bit-lengths in block id (block id layout) configurable through dynamic client config from coordinator or client config. Block id layout can be created from RssConf, or where that is not available, is being passed around.
Constants
):Constants
).Default values have moved from
Constants
intoBlockIdLayout
. The following replacements exist:PARTITION_ID_MAX_LENGTH
:BlockIdLayout.DEFAULT.partitionIdBits
TASK_ATTEMPT_ID_MAX_LENGTH
:BlockIdLayout.DEFAULT.taskAttemptIdBits
ATOMIC_INT_MAX_LENGTH
:BlockIdLayout.DEFAULT.sequenceNoBits
Why are the changes needed?
The bit-lengths of sequence number, partition id and task attempt id in block id are defined in
common/src/main/java/org/apache/uniffle/common/util/Constants.java
. Changing these requires recompiling and redeploying the project. Making this configurable incoordinator.conf
,server.conf
or client-side would very useful.Also see #1512, #749.
Fixes #731.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests.