[SPARK-4092] [CORE] Fix InputMetrics for coalesce'd Rdds #3120

ksakellis · 2014-11-05T21:52:35Z

When calculating the input metrics there was an assumption that one task only reads from one block - this is not true for some operations including coalesce. This patch simply increments the task's input metrics if previous ones existed of the same read method.

A limitation to this patch is that if a task reads from two different blocks of different read methods, one will override the other.

AmplabJenkins · 2014-11-05T21:57:10Z

Can one of the admins verify this patch?

sryza · 2014-11-06T02:11:02Z

core/src/test/scala/org/apache/spark/metrics/InputMetricsSuite.scala

 import org.scalatest.FunSuite

+import org.apache.spark.util.Utils


imports should be ordered alphabetically, so this should go after the org.apache.spark.scheduler ones

sryza · 2014-11-06T02:16:17Z

Had a few nitpicks. Otherwise, this looks good to me.

kayousterhout · 2014-11-06T21:53:10Z

Jenkins, test this please

kayousterhout · 2014-11-06T21:54:22Z

core/src/main/scala/org/apache/spark/CacheManager.scala

        context.taskMetrics.inputMetrics = Some(blockResult.inputMetrics)
+        context.taskMetrics.inputMetrics.get.bytesRead += prevBytesRead


Can there be a race here, or is this code always called from one thread?

Since this code does not change any state in CacheManager itself, it should not affect the thread safety of the outer object. So what is important is, will multiple threads call getOrCompute and pass in the same TaskContext (two threads operating on the same task). I don't think that happens since only a single thread operates on each task. Please let me know if I'm missing something.

Yeah that's a good point -- you're right.

AmplabJenkins · 2014-11-07T22:43:47Z

Can one of the admins verify this patch?

kayousterhout · 2014-11-07T23:02:42Z

Jenkins, this is OK to test

pwendell · 2014-11-10T06:05:21Z

Jenkins, test this please.

SparkQA · 2014-11-10T06:10:12Z

Test build #23136 has started for PR 3120 at commit f1a615f.

This patch merges cleanly.

SparkQA · 2014-11-10T06:11:45Z

Test build #23136 has finished for PR 3120 at commit f1a615f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-10T06:11:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23136/
Test FAILed.

ash211 · 2014-11-14T01:20:31Z

@ksakellis it looks like this has a merge conflict now -- would you mind updating this PR?

ksakellis · 2014-11-14T18:36:23Z

@ash211 Just updated the pr.

ksakellis · 2014-11-21T22:47:52Z

@pwendell Can you please comment on kay's suggestion?

ksakellis · 2014-12-16T19:08:52Z

@kayousterhout The test you pointed out:
sc.parallelize(1 to 2).saveAsTextFile("file:////tester1")
val a = sc.textFile("file:////tester1")
sc.parallelize(1 to 10).saveAsTextFile("file:////tester2")
val b = sc.textFile("file:////tester2")
a.cartesian(b)
b.cartesian(a)

is actually not valid because of how cartesian is implemented. I added a comment to the interleaved reads unit test to describe the reasoning.

ksakellis · 2015-01-05T20:50:41Z

@kayousterhout @pwendell ping?

pwendell · 2015-01-12T07:18:07Z

Jenkins, test this please.

SparkQA · 2015-01-12T07:22:39Z

Test build #25404 has started for PR 3120 at commit a2ca793.

This patch merges cleanly.

pwendell · 2015-01-12T08:00:28Z

core/src/main/scala/org/apache/spark/CacheManager.scala

+        val prevBytesRead = existingMetrics
+          .filter(_.readMethod == blockResult.inputMetrics.readMethod)
+          .map(_.bytesRead)
+          .getOrElse(0L)


So what happens if we have input types that intermix here? For instance, what if they interleave between two input sources... will they just keep clobbering over eachother? It might be better to just chose a single input metric and stick with it, i.e. if we happen to be reading a block that wasn't derived from the same input as the one before it, just ignore it.

val blockInput = blockResult.inputMetrics context.taskMetrics.inputMetrics match { case Some(existingInput) => if (existingInput.readMethod == blockInput.readMethod) { existingInput.bytesRead += blockInput.bytesRead } // NOTE: If we have interleaving of two input types in one task, we currently ignore blocks associated // with all but one type (whichever type was seen first). See SPARK-XXX. case None => context.taskMetrics.inputMetrics = Some(blockInput) }

It's easier to document that behavior and also add a unit test for it.

Actually after looking at Hadoop RDD - it might be necessary to just clobber here to preserve consistency with that case. But it could still be nicer to write this with a match.

What if there are 3 input sources that interleave here? Suppose you have (1) input from cache, (2) input from Hadoop, and (3) input from cache. My understanding is that when (2) starts being read, it will clobber the input metrics from (1). Then, when (3) is read, it will again clobber the input metrics, so the metrics won't properly reflect the total data read from cache (they'll only reflect the data read from (3)). Is that right?

Yes, when the cache is being used then yes there will be clobbering. There are a few solutions

we can not filter on readMethod and just append blindly. That way we don't override any metrics but the eventual read method will not be correct (either first wins or last wins - whatever we choose)

we model input metrics like we do with shuffle metrics where we collect an array of them and then finally we sum them up. - this is a bigger change.

Is it possible to do @pwendell 's suggestion, where you check the type of the input metrics and only append if it's the same type? I'd actually be slightly in favor of just returning a list of input metrics, one for each input type, because the other solutions seem a little hacky -- but defer to @sryza / @pwendell here (who I think had argued in the past that this extra complexity wasn't worth it).

SparkQA · 2015-01-12T08:13:55Z

Test build #25404 has finished for PR 3120 at commit a2ca793.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-12T08:13:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25404/
Test FAILed.

pwendell · 2015-01-12T08:23:23Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

@@ -153,34 +157,19 @@ class NewHadoopRDD[K, V](
          throw new java.util.NoSuchElementException("End of stream")
        }
        havePair = false
-
-        // Update bytes read metric every few records
-        if (recordsSinceMetricsUpdate == HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES


This was done intentionally to help keep the callback updates out of the InputMetrics class and isolate it to Hadoop RDD. This notion of callbacks makes the InputMetrics class more complicated and mutable. Since it's an exposed class we really wanted to keep the interface clean and simple, even if it meant some extra engineering in HadoopRDD. So could this part of the change be reverted back to how it was before (and you don't change the InputMetrics/TaskMetrics classes?).

@pwendell There is a long thread in this pr between @sryza and @kayousterhout about why we need to add the call back to the input metrics. The reason is to prevent clobbering between different HadoopRdds. For example CartesianRdd - this is why there is a specific unit test for that case. I don't think we can do anything correctly if we don't have the callbacks in the inputMetrics.

Okay, that's fine then. I looked and it's all private[spark] so actually there is no change to visibility.

When calculating the input metrics there was an assumption that one task only reads from one block - this is not true for some operations including coalesce. This patch simply increments the task's input metrics if previous ones existed of the same read method. A limitation to this patch is that if a task reads from two different blocks of different read methods, one will override the other.

Also added a test for interleaving reads.

Tasks now only store/accumulate input metrics from the same read method. If a task has interleaved reads from more than one block of different read methods, we choose to store the first read methods metrics. https://issues.apache.org/jira/browse/SPARK-5225 addresses keeping track of all input metrics. This change also centralizes this logic in TaskMetrics and gates how inputMetrics can be added to TaskMetrics.

pwendell · 2015-01-16T00:59:59Z

Jenkins, test this please.

pwendell · 2015-01-16T01:00:20Z

LGTM pending tests

SparkQA · 2015-01-16T01:02:44Z

Test build #25623 has started for PR 3120 at commit 54e6658.

This patch merges cleanly.

SparkQA · 2015-01-16T02:07:46Z

Test build #25623 has finished for PR 3120 at commit 54e6658.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-16T02:07:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25623/
Test PASSed.

rxin · 2015-01-16T07:10:15Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

@@ -146,6 +185,10 @@ class TaskMetrics extends Serializable {
    }
    _shuffleReadMetrics = Some(merged)
  }
+
+  private[spark] def updateInputMetrics() = synchronized {


in your next pr, can u fix this by adding a return type explicitly?

So this follows the method above: updateShuffleReadMetrics that doesn't have a return type. Should I change both then?

would be great to do that!

sryza reviewed Nov 6, 2014
View reviewed changes

kayousterhout reviewed Nov 6, 2014
View reviewed changes

ksakellis force-pushed the kostas-spark-4092 branch from f1a615f to 39353cb Compare November 14, 2014 18:35

ksakellis force-pushed the kostas-spark-4092 branch from 39353cb to e567029 Compare December 13, 2014 02:15

ksakellis force-pushed the kostas-spark-4092 branch from e567029 to a2ca793 Compare January 9, 2015 21:42

pwendell reviewed Jan 12, 2015
View reviewed changes

Kostas Sakellis added 4 commits January 13, 2015 18:09

CR feedback

a2a36d4

Add bytesReadCallback to InputMetrics

f0e0cc5

Also added a test for interleaving reads.

ksakellis force-pushed the kostas-spark-4092 branch from a2ca793 to 54e6658 Compare January 14, 2015 09:05

pwendell mentioned this pull request Jan 16, 2015

[SPARK-3288] All fields in TaskMetrics should be private and use getters/setters #4020

Closed

asfgit closed this in a79a9f9 Jan 16, 2015

pwendell mentioned this pull request Jan 16, 2015

SPARK-2630 Input data size of CoalescedRDD counts only one partition #2310

Closed

ksakellis mentioned this pull request Jan 16, 2015

[SPARK-4874] [CORE] Collect record count metrics #4067

Closed

rxin reviewed Jan 16, 2015
View reviewed changes

		import org.scalatest.FunSuite

		import org.apache.spark.util.Utils

		context.taskMetrics.inputMetrics = Some(blockResult.inputMetrics)
		context.taskMetrics.inputMetrics.get.bytesRead += prevBytesRead

[SPARK-4092] [CORE] Fix InputMetrics for coalesce'd Rdds #3120

[SPARK-4092] [CORE] Fix InputMetrics for coalesce'd Rdds #3120

Conversation

ksakellis commented Nov 5, 2014

AmplabJenkins commented Nov 5, 2014

Choose a reason for hiding this comment

sryza commented Nov 6, 2014

kayousterhout commented Nov 6, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Nov 7, 2014

kayousterhout commented Nov 7, 2014

pwendell commented Nov 10, 2014

SparkQA commented Nov 10, 2014

SparkQA commented Nov 10, 2014

AmplabJenkins commented Nov 10, 2014

ash211 commented Nov 14, 2014

ksakellis commented Nov 14, 2014

ksakellis commented Nov 21, 2014

ksakellis commented Dec 16, 2014

ksakellis commented Jan 5, 2015

pwendell commented Jan 12, 2015

SparkQA commented Jan 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2015

AmplabJenkins commented Jan 12, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwendell commented Jan 16, 2015

pwendell commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

AmplabJenkins commented Jan 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment