[SPARK-20244][Core] Handle incorrect bytesRead metrics when using PySpark #17617

jerryshao · 2017-04-12T06:33:57Z

What changes were proposed in this pull request?

Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's next() and close() may run in different threads. This could be happened when using PySpark with PythonRDD.

So here building a map to track the bytesRead for different thread and add them together. This method will be used in three RDDs, HadoopRDD, NewHadoopRDD and FileScanRDD. I assume FileScanRDD cannot be called directly, so I only fixed HadoopRDD and NewHadoopRDD.

How was this patch tested?

Unit test and local cluster verification.

SparkQA · 2017-04-12T06:37:36Z

Test build #75729 has started for PR 17617 at commit d6f3c42.

jerryshao · 2017-04-12T07:48:51Z

Jenkins, retest this please.

SparkQA · 2017-04-12T10:38:39Z

Test build #75738 has finished for PR 17617 at commit d6f3c42.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-05-02T19:26:55Z

Interesting, more accurate reporting is good but I haven't looked at this block of code in awhile maybe @srowen has the context necessary to take a look?

jerryshao · 2017-05-04T07:03:28Z

@holdenk , the basic problem is that Spark uses Hadoop FileSystem's statistics API to get bytesRead, bytesWrite per task. This statistics API is implemented by thread local variables, it is OK for scala / java RDD computations, since this computation is executed in the same thread as the task thread. But for PythonRDD, Spark will create another thread to consume data. So using current way to count bytesRead will get a wrong number.

This is a generic problem when task thread and RDD computation thread are not the same thread, due to thread local variables problem, the calculated bytesRead metric will be wrong.

jiangxb1987

This change sounds valid, but I tried the test cases in current master branch and that aren't failing.

jiangxb1987 · 2017-05-26T23:32:08Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

-    val f = () => threadStats.map(_.getBytesRead).sum
-    val baselineBytesRead = f()
-    () => f() - baselineBytesRead
+    val f = () => FileSystem.getAllStatistics.asScala.map(_.getThreadStatistics.getBytesRead).sum


Why are you changing this?

For the previous code, threadStats and f function can be executed in two threads, so the metrics we got can be wrong.

jiangxb1987 · 2017-05-26T23:40:03Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

+      context.addTaskCompletionListener { context =>
+        // Update the bytes read before closing is to make sure lingering bytesRead statistics in
+        // this thread get correctly added.
+        updateBytesRead()


Will this duplicate with what we do in close()?

close can be called in another thread as I remembered, so I added here to avoid lingering bytesRead in task running thread (Some bytes can be read when creating InputFormat), also it is no harm to call this updateBytesRead again.

Change-Id: I76c6ff84904211e3fae4dcd11772fb7fa5ec503c

jerryshao · 2017-05-27T07:08:26Z

@jiangxb1987 the UT I wrote cannot actually reflect this issue, I just update the UT, please review, thanks!

SparkQA · 2017-05-27T09:54:54Z

Test build #77453 has finished for PR 17617 at commit a23633d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM except for some nit in test cases.

jiangxb1987 · 2017-05-27T23:19:20Z

core/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala

+    val bytesRead = runAndReturnBytesRead {
+      sc.textFile(tmpFilePath, 4).mapPartitions { iter =>
+        val buf = new ArrayBuffer[String]()
+        val thread = new Thread() {


nit: We could use ThreadUtils.runInNewThread() to make this shorter, like:

ThreadUtils.runInNewThread("TestThread") { iter.flatMap(_.split(" ")).foreach(buf.append(_)) }

jiangxb1987 · 2017-05-27T23:20:00Z

core/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala

+      sc.newAPIHadoopFile(tmpFilePath, classOf[NewTextInputFormat], classOf[LongWritable],
+        classOf[Text]).mapPartitions { iter =>
+        val buf = new ArrayBuffer[String]()
+        val thread = new Thread() {


nit: Same as above, we could rewrite to:

ThreadUtils.runInNewThread("TestThread") { iter.map(_._2.toString).flatMap(_.split(" ")).foreach(buf.append(_)) }

jiangxb1987 · 2017-05-30T21:58:55Z

ping @jerryshao

Change-Id: Ie8cc1f19719956184afea2ba04a59f9221469da7

SparkQA · 2017-05-31T05:19:08Z

Test build #77567 has finished for PR 17617 at commit 8b16017.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-05-31T05:26:13Z

LGTM, cc @cloud-fan @ueshin

cloud-fan · 2017-05-31T05:39:02Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -143,14 +144,18 @@ class SparkHadoopUtil extends Logging {
   * Returns a function that can be called to find Hadoop FileSystem bytes read. If
   * getFSBytesReadOnThreadCallback is called from thread r at time t, the returned callback will
   * return the bytes read on r since t.
-   *
-   * @return None if the required method can't be found.
   */
  private[spark] def getFSBytesReadOnThreadCallback(): () => Long = {


let's update the document to say that, the returned function may be called in multiple threads.

cloud-fan · 2017-05-31T05:53:07Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

+
+    () => {
+      bytesReadMap.put(Thread.currentThread().getId, f())
+      bytesReadMap.asScala.map { case (k, v) =>


this is not atomic, shall we synchronize on bytesReadMap when calculating the sum?

I see. Let me fix it.

cloud-fan · 2017-05-31T05:54:48Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

+    val baseline = (Thread.currentThread().getId, f())
+    val bytesReadMap = new ConcurrentHashMap[Long, Long]()
+
+    () => {


I think it's better to create an anonymous Function0 instance and treat bytesReadMap as a member variable and document the multi-thread semantic for the apply method.

That's a good idea, let me change the code.

Change-Id: I5eba16903914932392e05ba56c27808c36b033b3

ueshin

LGTM except for a minor comment.

ueshin · 2017-05-31T08:38:18Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -21,8 +21,10 @@ import java.io.IOException
 import java.security.PrivilegedExceptionAction
 import java.text.DateFormat
 import java.util.{Arrays, Comparator, Date, Locale}
+import java.util.concurrent.ConcurrentHashMap


nit: unneeded import.

SparkQA · 2017-05-31T10:30:18Z

Test build #77581 has finished for PR 17617 at commit 1e3fb8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: Id3b501645fca858ec4636cee30163ea39fe7ce4f

SparkQA · 2017-05-31T16:12:50Z

Test build #77591 has finished for PR 17617 at commit 2b08b48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-31T23:02:17Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

+      private val bytesReadMap = new mutable.HashMap[Long, Long]()
+
+      /**
+       * Returns a function that can be called to calculate Hadoop FileSystem bytes read.


move these comments before new Function0[Long] or before def getFSBytesReadOnThreadCallback. The apply here doesn't return a function...

cloud-fan · 2017-05-31T23:07:54Z

core/src/test/scala/org/apache/spark/metrics/InputOutputMetricsSuite.scala

+        buf.iterator
+      }.count()
+    }
+    assert(bytesRead != 0)


this assert is unnecessary.

Change-Id: I6e7870698108a52d577a59478a0f88bc645d1133

SparkQA · 2017-06-01T05:32:19Z

Test build #77621 has finished for PR 17617 at commit c068f43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…park ## What changes were proposed in this pull request? Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's `next()` and `close()` may run in different threads. This could be happened when using PySpark with PythonRDD. So here building a map to track the `bytesRead` for different thread and add them together. This method will be used in three RDDs, `HadoopRDD`, `NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`. ## How was this patch tested? Unit test and local cluster verification. Author: jerryshao <sshao@hortonworks.com> Closes #17617 from jerryshao/SPARK-20244. (cherry picked from commit 5854f77) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-06-01T05:36:53Z

thanks, merging to master/2.2!

jerryshao · 2017-06-01T05:52:09Z

Thanks @jiangxb1987 @cloud-fan @ueshin for your review!

yhuai · 2017-06-02T19:46:34Z

core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

@@ -143,14 +144,29 @@ class SparkHadoopUtil extends Logging {
   * Returns a function that can be called to find Hadoop FileSystem bytes read. If
   * getFSBytesReadOnThreadCallback is called from thread r at time t, the returned callback will
   * return the bytes read on r since t.
-   *
-   * @return None if the required method can't be found.


Why removing this line instead of the doc?

this doesn't return a None, but the doc is still corrected about the behavior.

…park Hadoop FileSystem's statistics in based on thread local variables, this is ok if the RDD computation chain is running in the same thread. But if child RDD creates another thread to consume the iterator got from Hadoop RDDs, the bytesRead computation will be error, because now the iterator's `next()` and `close()` may run in different threads. This could be happened when using PySpark with PythonRDD. So here building a map to track the `bytesRead` for different thread and add them together. This method will be used in three RDDs, `HadoopRDD`, `NewHadoopRDD` and `FileScanRDD`. I assume `FileScanRDD` cannot be called directly, so I only fixed `HadoopRDD` and `NewHadoopRDD`. Unit test and local cluster verification. Author: jerryshao <sshao@hortonworks.com> Closes apache#17617 from jerryshao/SPARK-20244. Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala

jerryshao changed the title ~~[SPARK-20244][Core] Handle get bytesRead from different thread in Hadoop RDD~~ [SPARK-20244][Core] Handle incorrect bytesRead metrics when using PySpark Apr 12, 2017

jiangxb1987 reviewed May 26, 2017

View reviewed changes

Handle get bytesRead from different thread

a23633d

Change-Id: I76c6ff84904211e3fae4dcd11772fb7fa5ec503c

jerryshao force-pushed the SPARK-20244 branch from d6f3c42 to a23633d Compare May 27, 2017 07:05

jiangxb1987 reviewed May 27, 2017

View reviewed changes

Address the comments

8b16017

Change-Id: Ie8cc1f19719956184afea2ba04a59f9221469da7

cloud-fan reviewed May 31, 2017

View reviewed changes

Further address the comments

1e3fb8a

Change-Id: I5eba16903914932392e05ba56c27808c36b033b3

ueshin reviewed May 31, 2017

View reviewed changes

Remove unused import

2b08b48

Change-Id: Id3b501645fca858ec4636cee30163ea39fe7ce4f

cloud-fan reviewed May 31, 2017

View reviewed changes

Changes according to the feedback

c068f43

Change-Id: I6e7870698108a52d577a59478a0f88bc645d1133

asfgit closed this in 5854f77 Jun 1, 2017

yhuai reviewed Jun 2, 2017

View reviewed changes

[SPARK-20244][Core] Handle incorrect bytesRead metrics when using PySpark #17617

[SPARK-20244][Core] Handle incorrect bytesRead metrics when using PySpark #17617

Conversation

jerryshao commented Apr 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 12, 2017

jerryshao commented Apr 12, 2017

SparkQA commented Apr 12, 2017

holdenk commented May 2, 2017

jerryshao commented May 4, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao May 27, 2017 • edited Loading

Choose a reason for hiding this comment

jerryshao commented May 27, 2017

SparkQA commented May 27, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiangxb1987 commented May 30, 2017

SparkQA commented May 31, 2017

jiangxb1987 commented May 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 31, 2017

SparkQA commented May 31, 2017

cloud-fan May 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 1, 2017

cloud-fan commented Jun 1, 2017

jerryshao commented Jun 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao May 27, 2017 •

edited

Loading

cloud-fan May 31, 2017 •

edited

Loading