[CARBONDATA-3594] Optimize getSplits() during compaction #3475

Indhumathi27 · 2019-11-22T10:57:54Z

Problem:
In MergerRDD, for compaction of n segments per task, get splits is called n times.
Solution:
In MergerRDD, for per compaction task,get all validSegments and call getsplits only once for those valid segments

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2019-11-22T11:33:08Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/977/

CarbonDataQA · 2019-11-22T12:47:26Z

Build Success with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/988/

CarbonDataQA · 2019-11-22T12:49:59Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/985/

jackylk · 2019-11-24T13:44:41Z

integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala

@@ -359,6 +359,8 @@ class CarbonMergerRDD[K, V](
      loadMetadataDetails = SegmentStatusManager
        .readLoadMetadata(CarbonTablePath.getMetadataPath(tablePath))
    }
+
+    val validSegIds: java.util.List[String] = new util.ArrayList[String]()
    // for each valid segment.
    for (eachSeg <- carbonMergerMapping.validSegments) {
      // In case of range column get the size for calculation of number of ranges


This loop can be optimized, if (null != rangeColumn) is not changing in the loop

if(null != rangeColumn) check is for calculating the total size for range column based on valid segments in the loop. so, i think it cannot be optimized further

integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala

CarbonDataQA · 2019-11-24T14:59:43Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/991/

jackylk · 2019-11-24T15:25:49Z

integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala

    }
+    carbonInputSplits ++:= filteredSplits


Please do not use ++:= it is creating new list, you can use ++= instead

CarbonDataQA · 2019-11-24T16:17:00Z

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/992/

CarbonDataQA · 2019-11-24T17:08:00Z

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1001/

CarbonDataQA · 2019-11-24T17:53:04Z

Build Success with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1004/

CarbonDataQA · 2019-11-24T17:57:42Z

Build Success with Spark 2.3.2, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1003/

parent d6662b3 author litao <litao_xidian@126.com> 1574689267 +0800 committer litao <litao_xidian@126.com> 1575292552 +0800 # This is a combination of 3 commits. # This is the 1st commit message: part of carbon data spatial index code include how to create the hash id and get from query list to hash id list. Geospatial support in create table, load, alter and query polygon Merge branch 'geo' of https://github.com/VenuReddy2103/carbondata into mybranch # This is the commit message apache#2: [CARBONDATA-3584] Fix Select Query failure for Boolean dictionary column when Codegen is enabled Induced because of apache#3463 This closes apache#3470 # This is the commit message apache#3: [HOTFIX][checkstyle] update AnnotationLocation rule apache#3464 This closes apache#3464 # This is the commit message apache#6: [HOTFIX] Change not to use cache when creating CarbonTable from schema file apache#3472 Using cache will lead to incorrect table path set in SDK writer. This PR corrects it This closes apache#3472 # This is the commit message apache#7: [CARBONDATA-3594] Optimize getSplits() during compaction Problem: In MergerRDD, for compaction of n segments per task, get splits is called n times. Solution: In MergerRDD, for per compaction task,get all validSegments and call getsplits only once for those valid segments This closes apache#3475 # This is the commit message apache#8: [CARBONDATA-3589]: Adding NULL segments check and empty segments check before prepriming Insert into select from hive table into carbon table having partition fails with index server running because of the fact that empty segments were being sent for prepriming. Added a check for the same. This closes apache#3468

jackylk reviewed Nov 24, 2019

View reviewed changes

integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala Outdated Show resolved Hide resolved

jackylk reviewed Nov 24, 2019

View reviewed changes

integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala Outdated Show resolved Hide resolved

jackylk reviewed Nov 24, 2019

View reviewed changes

integration/spark-common/src/main/scala/org/apache/carbondata/spark/rdd/CarbonMergerRDD.scala Outdated Show resolved Hide resolved

Indhumathi27 force-pushed the mdt_opt branch from d9297ea to 843f7fb Compare November 24, 2019 14:22

Indhumathi27 changed the title ~~[WIP] Optimize getSplits() during compaction~~ [CARBONDATA-3594] Optimize getSplits() during compaction Nov 24, 2019

jackylk reviewed Nov 24, 2019

View reviewed changes

Optimize getSplits() during compaction

392e88e

Indhumathi27 force-pushed the mdt_opt branch from 843f7fb to 392e88e Compare November 24, 2019 15:37

asfgit closed this in 5b101ec Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CARBONDATA-3594] Optimize getSplits() during compaction #3475

[CARBONDATA-3594] Optimize getSplits() during compaction #3475

Indhumathi27 commented Nov 22, 2019 •

edited

Loading

CarbonDataQA commented Nov 22, 2019

CarbonDataQA commented Nov 22, 2019

CarbonDataQA commented Nov 22, 2019

jackylk Nov 24, 2019

Indhumathi27 Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

jackylk Nov 24, 2019

Indhumathi27 Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

[CARBONDATA-3594] Optimize getSplits() during compaction #3475

[CARBONDATA-3594] Optimize getSplits() during compaction #3475

Conversation

Indhumathi27 commented Nov 22, 2019 • edited Loading

CarbonDataQA commented Nov 22, 2019

CarbonDataQA commented Nov 22, 2019

CarbonDataQA commented Nov 22, 2019

jackylk Nov 24, 2019

Choose a reason for hiding this comment

Indhumathi27 Nov 24, 2019

Choose a reason for hiding this comment

CarbonDataQA commented Nov 24, 2019

jackylk Nov 24, 2019

Choose a reason for hiding this comment

Indhumathi27 Nov 24, 2019

Choose a reason for hiding this comment

CarbonDataQA commented Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

CarbonDataQA commented Nov 24, 2019

Indhumathi27 commented Nov 22, 2019 •

edited

Loading