This repository has been archived by the owner on Sep 18, 2023. It is now read-only.
[NSE-857] Fill destination buffer by reducer #880
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The solution is to create an offset array which list the src offset for each reducer, like below:
reducer0:
Then we can read the src column randomly and fill the destination column sequentially one reducer by one reducer. The source data size should be smaller enough to hold into L1/L2 cache, and make sure source and destination cache line both are read onece. Otherwise the performance will be very bad. Currently the recordbatch size is 32K rows, so for double column the size is 128K.
On the write, we can use NTStore to bypass RFO. Then we can avoid the cache polution. But when reducer# is very large, ntstore doesn't works well because each reducer will be only fill little data, like 32K batch for 4000 reducer, each reducer will be written 8 values only.
How was this patch tested?
From benchmark data, the solution partially solved the reducer# scaling issue. From below chart, we can see 4096 and 512 reducer has the same performance
Remining work
AVX implementation doesn't show better performance than INT solution. NTStore either doesn't show better performance