-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Improvement] Skip blocks when read from memory #294
Conversation
…hen read from memory
Codecov Report
@@ Coverage Diff @@
## master #294 +/- ##
============================================
- Coverage 59.24% 58.71% -0.53%
- Complexity 1456 1614 +158
============================================
Files 180 194 +14
Lines 9631 11024 +1393
Branches 835 971 +136
============================================
+ Hits 5706 6473 +767
- Misses 3577 4167 +590
- Partials 348 384 +36
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thanks for proposing this PR, overall it will benefit more for some big memory shuffle-server. After a brief look, I have a question that whether the memory size of |
Good idea. I will try. |
server/src/main/java/org/apache/uniffle/server/buffer/ShuffleBuffer.java
Outdated
Show resolved
Hide resolved
I have two suggestions:
|
Bloomfilter is not suitable here, because it can only make sure whether does not exist. And you can see |
Maybe we do not need precision skipping ? |
Besides, it is better to add [ISSUE-ID] in title. |
I think precision skipping is better. And i will send one time for a same |
Maybe we can pass the min and max blockId to replace bitmap? |
The client side already have precision skipping. The bitmap of all blockIds can still be very large for data skew. |
1.I think min and max blockId is ok, but |
We seems that we don't need the processedBlock. We can reduce the expect blockIds range according to processed blocks. |
Get. |
But the final expect blockIds is also discontinuous. I'm going to use an arrry to store it, like [start1, end1, start2, end2], and if endN-startN is too small, i will remove it for reduce its size. |
We can limit the array size and try our best to filter more data which we have processed. |
Maybe we need some POC to verify the effect of every method. |
This pr can also optimize the AQE performance. @leixm Maybe you have interest. |
Maybe we can support multiple filters. Users can choose the filter which they like. MinMax may be good for AQE situation. Bitmap may be good for multiple replicas. We should some extra tests. |
I have a question that the MinMax range is task id min max? @jerqi Not blockId? |
And why not directly use the Do you mind I pick up this ticket to improve the AQE skew performance? If you hope this also could support multiple replicas, you could go on. @xianjingfeng |
Bitmap is ok. We have concern about the size of bitmap. It need some tests. |
The size of taskIdsBitmap shoud be small. Actually, it only contains the limited task ids. |
100w tasks will occupy 125k memory. If we use blockBitmap, the blockBitmap may occupy serveral MB. |
It's OK for me. I think we could disable this taskIdBitmap filter in no-AQE optimization. And especially for AQE, the taskIds size for one reader should not be large. What do u think so? @jerqi @xianjingfeng |
How to combine aqe and multi replicas? Do we need pass two filters? |
client-spark/spark3/src/main/java/org/apache/spark/shuffle/reader/RssShuffleReader.java
Outdated
Show resolved
Hide resolved
We just need pass one of them. I think |
Feel free to do this. I have no plan to support multiple replicas filter. |
@xianjingfeng A little question: does the min-max blockIds filter and taskIdBitmap filter are exclusive? |
Yes |
Could we add some performance tests for this feature with production jobs? |
I will |
common/src/main/java/org/apache/uniffle/common/BlockSkipStrategy.java
Outdated
Show resolved
Hide resolved
Could you modify the description of this pr? Do it only add a range filter strategy for AQE? Will multi replicas support to filter data in this pr? |
Done.
Support multi replicas too. |
} | ||
|
||
@Override | ||
public ShuffleDataResult readShuffleData() { | ||
if (BlockSkipStrategy.BLOCKID_RANGE.equals(blockSkipStrategy) && lastBlockId == Constants.INVALID_BLOCK_ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we judge the lastBlockId == Constants.INVALID_BLOCK_ID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to build the blockId range at the first time the handler read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @zuston Do you have another suggestion?
Merged. Thanks all @frankliee @zuston @xianjingfeng . @xianjingfeng Could you raise a follow-up pr to add some docs about this feature? |
Yes. |
incubator-uniffle/client/src/main/java/org/apache/uniffle/client/util/ClientUtils.java Lines 33 to 37 in ddf8384
Should we remove it or modify blockid generation rule? |
Let's remove it. We can use taskBitmap as the replica filter. It's hard to modify block generation rule. It will be imcompatible feature. |
Revert directly or keep some modification, such as |
I wonder why we put |
This reverts commit 55191c4.
This reverts commit 55191c4. ### What changes were proposed in this pull request? Revert #294 ### Why are the changes needed? BlockId is discontinuous, so BLOCKID_RANGE is not a good choice to filter memory data ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No need
Reduce the size of RoaringBitmap. We should put the worst frequently data to higher bit. |
What changes were proposed in this pull request?
Skip blocks which not in expected blockId range when read from memory.
Why are the changes needed?
1.If we use AQE, every task will read data from all partitions.
2.If the data of the first shuffle server is incomplete, we need to read from another server if #276 is merged.
Both of the above situations will lead to read redundant data from shuffle server.
Does this PR introduce any user-facing change?
Set
rss.client.read.block.skip.strategy
toBLOCKID_RANGE
.How was this patch tested?
Already added