-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-3594] Optimize getSplits() during compaction #3475
Conversation
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/977/ |
Build Success with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/988/ |
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/985/ |
@@ -359,6 +359,8 @@ class CarbonMergerRDD[K, V]( | |||
loadMetadataDetails = SegmentStatusManager | |||
.readLoadMetadata(CarbonTablePath.getMetadataPath(tablePath)) | |||
} | |||
|
|||
val validSegIds: java.util.List[String] = new util.ArrayList[String]() | |||
// for each valid segment. | |||
for (eachSeg <- carbonMergerMapping.validSegments) { | |||
// In case of range column get the size for calculation of number of ranges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop can be optimized, if (null != rangeColumn)
is not changing in the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(null != rangeColumn)
check is for calculating the total size for range column based on valid segments in the loop. so, i think it cannot be optimized further
integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
Outdated
Show resolved
Hide resolved
integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
Outdated
Show resolved
Hide resolved
integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala
Outdated
Show resolved
Hide resolved
d9297ea
to
843f7fb
Compare
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/991/ |
} | ||
carbonInputSplits ++:= filteredSplits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use ++:=
it is creating new list, you can use ++=
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
843f7fb
to
392e88e
Compare
Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/992/ |
Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1001/ |
Build Success with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1004/ |
Build Success with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1003/ |
parent d6662b3 author litao <litao_xidian@126.com> 1574689267 +0800 committer litao <litao_xidian@126.com> 1575292552 +0800 # This is a combination of 3 commits. # This is the 1st commit message: part of carbon data spatial index code include how to create the hash id and get from query list to hash id list. Geospatial support in create table, load, alter and query polygon Merge branch 'geo' of https://github.com/VenuReddy2103/carbondata into mybranch # This is the commit message apache#2: [CARBONDATA-3584] Fix Select Query failure for Boolean dictionary column when Codegen is enabled Induced because of apache#3463 This closes apache#3470 # This is the commit message apache#3: [HOTFIX][checkstyle] update AnnotationLocation rule apache#3464 This closes apache#3464 # This is the commit message apache#6: [HOTFIX] Change not to use cache when creating CarbonTable from schema file apache#3472 Using cache will lead to incorrect table path set in SDK writer. This PR corrects it This closes apache#3472 # This is the commit message apache#7: [CARBONDATA-3594] Optimize getSplits() during compaction Problem: In MergerRDD, for compaction of n segments per task, get splits is called n times. Solution: In MergerRDD, for per compaction task,get all validSegments and call getsplits only once for those valid segments This closes apache#3475 # This is the commit message apache#8: [CARBONDATA-3589]: Adding NULL segments check and empty segments check before prepriming Insert into select from hive table into carbon table having partition fails with index server running because of the fact that empty segments were being sent for prepriming. Added a check for the same. This closes apache#3468
Problem:
In MergerRDD, for compaction of n segments per task, get splits is called n times.
Solution:
In MergerRDD, for per compaction task,get all validSegments and call getsplits only once for those valid segments
Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.