[SPARK-4874] [CORE] Collect record count metrics #4067

ksakellis · 2015-01-16T02:18:58Z

Collects record counts for both Input/Output and Shuffle Metrics. For the input/output metrics, it just appends the counter every time the iterators get accessed.

For shuffle on the write side, we count the metrics post aggregation (after a map side combine) and on the read side we count the metrics pre aggregation. This allows both the bytes read/written metrics and the records read/written to line up.

For backwards compatibility, if we deserialize an older event that doesn't have record metrics, we set the metric to -1.

sryza · 2015-01-16T03:09:58Z

core/src/main/scala/org/apache/spark/CacheManager.scala

@@ -17,6 +17,8 @@

 package org.apache.spark

+import org.apache.spark.util.AfterNextInterceptingIterator


this should go with the other org.apache.spark imports

SparkQA · 2015-01-16T03:49:11Z

Test build #25628 has finished for PR 4067 at commit 571cb69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-01-16T06:33:12Z

core/src/main/scala/org/apache/spark/util/interceptingIterator.scala

+ * @tparam A the iterable type
+ */
+private[spark]
+class InterceptingIterator[A](sub: Iterator[A]) extends Iterator[A] {


Can we avoid this? Seems fairly expensive by adding a lot more method calls ...

So this is supposed to be a generic way of intercepting iterators. If we don't have this, i'd have to do something custom like CompletionIterator - i was trying to make something reusable.

Yea the thing is there is only a very limited number of places that you'd need to increment the counters. I'm not sure if this super generic design is worth it, unless you want to do a lot of performance studies of the differences ...

rxin · 2015-01-16T06:34:08Z

Can you also paste some screenshots on what the UI changes look like? Thanks.

rxin · 2015-01-16T06:35:08Z

core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

+      val readMetrics = context.taskMetrics().createShuffleReadMetricsForDependency()
+      override def afterNext(next: T) : T = {
+        readMetrics.recordsRead += 1
+        logError("Read record " + next)


this is not intended to be here, is it?

whoops.. nope.

rxin · 2015-01-16T06:53:55Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

+   * Total records read.
+   */
+  def recordsRead: Long = _recordsRead.get()
+  @volatile @transient var bytesReadCallback: Option[() => Long] = None


can you explain what this does?

ksakellis · 2015-01-16T07:00:18Z

This change was dependent on #3120, that just got merged and now there are some merge conflicts. I need to fix those first and will update the pr.

rxin · 2015-01-16T07:00:53Z

core/src/test/scala/org/apache/spark/storage/BlockObjectWriterSuite.scala

@@ -31,6 +31,8 @@ class BlockObjectWriterSuite extends FunSuite {
      new JavaSerializer(new SparkConf()), 1024, os => os, true, writeMetrics)

    writer.write(Long.box(20))
+    // Record metrics update on every write
+    assert(writeMetrics.recordsWritten == 1)


you'd want to use === instead of ==

rxin · 2015-01-16T07:06:31Z

Hey @ksakellis - Thanks for working on this.

I took a very quick look at the patch. Overall I feel the patch should be fairly straightforward, but the specific implementations might've gone a bit over board with Scala language features (a lot of Options, orElse, closures, etc) and design (too many new classes introduced). If we can reduce those, the pr would be a lot easier to understand.

rxin · 2015-01-16T07:15:05Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

+    _bytesRead.addAndGet(bytes)
+  }
+
+  def addRecordsRead(records: Long) = {


maybe incrementRecordsRead and incrementBytesRead are better names?

ksakellis · 2015-01-16T07:46:43Z

@rxin I updated the PR after doing a rebase and also incorporated some of your feedback. You made two general comments:

specific implementations might've gone a bit over board with Scala language features
too many new classes introduced
Can you please be more specific here pointing to the specific code. These comments are not very actionable as is.

rxin · 2015-01-16T07:47:02Z

Hi again - can't find my previous comment since the line is no longer in the diff due to the other pr being merged. Can you still add comment for that one (the part with Option and orElse and set ...)? Want to make sure if we read that code one year from now, we can still understand what's going on.

rxin · 2015-01-16T07:47:46Z

The Scala stuff was mostly about the previous PR that got merged (and now no longer showing up as part of this diff).

ksakellis · 2015-01-16T07:57:00Z

So is this code you were referring to in HadoopRDD?

      // Find a function that will return the FileSystem bytes read by this thread. Do this before
      // creating RecordReader, because RecordReader's constructor might read some bytes
      val bytesReadCallback = inputMetrics.bytesReadCallback.orElse(
        split.inputSplit.value match {
          case split: FileSplit =>
            SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(split.getPath, jobConf)
          case _ => None
        }
      )

ksakellis · 2015-01-16T08:12:43Z

Shows a stage that has Input Metrics (reading from a file) and writes data for next stage.

Shows a stage that has both shuffle reading and writing - no input or output metrics.

Shows a stage that has outputting to a file.

SparkQA · 2015-01-16T08:52:38Z

Test build #25638 timed out for PR 4067 at commit 1572054 after a configured wait of 120m.

SparkQA · 2015-01-16T08:56:36Z

Test build #25641 has finished for PR 4067 at commit 3c2d021.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AfterNextInterceptingIterator[A](sub: Iterator[A]) extends Iterator[A]

pwendell · 2015-01-16T09:36:58Z

What about combining the input size and records in the same column. Overall this will help with the expansion in the number of columns. The title could be "Input Size / Records"

ksakellis · 2015-01-16T09:52:17Z

If we do that we wouldn't be able to sort on num records and bytes independently.

pwendell · 2015-01-16T10:17:23Z

Yes - you'd only be able to sort on bytes. Wouldn't that be okay? These would likely track closely in most cases.

ksakellis · 2015-01-16T18:46:33Z

A big motivation to add recordsRead/Written was to detect data skew. In these cases bytes and records might not track very closely.

Thinking more about this, I suspect that having an Avg. record Size column (bytesRead/recordsRead) would be what you'd want to sort on. We could add this metric to the UI, make it sortable and then combine the bytesRead and recordsRead metrics into one column. Thoughts?

sryza · 2015-01-21T08:26:48Z

core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala

@@ -238,6 +245,10 @@ case class InputMetrics(readMethod: DataReadMethod.Value) {
    _bytesRead.addAndGet(bytes)
  }

+  def addRecordsRead(records: Long) = {


This should be incRecordsRead in keeping with SPARK-3288.

SparkQA · 2015-02-06T02:12:20Z

Test build #26876 has finished for PR 4067 at commit e156560.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ksakellis · 2015-02-06T04:40:41Z

Jenkins, retest this please

SparkQA · 2015-02-06T05:55:41Z

Test build #26896 has finished for PR 4067 at commit e156560.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Collects record counts for both Input/Output and Shuffle Metrics. For the input/output metrics, it just appends the counter everytime the iterators get accessed. For shuffle on the write side, we count the metrics post aggregation (after a map side combine) and on the read side we count the metrics pre aggregation. This allows both the bytes read/written metrics and the records read/written to line up. For backwards compatibiliy, if we deserialize an older event that doesn't have record metrics, we set the metric to -1.

…c implementation

Also made the availabiliy of the # records more complete.

…e a function call

- Hide columns in executor summary table if no data - revert change to show output metrics for hadoop < 2.4 - other cr feedback.

…pend.

pwendell · 2015-02-06T06:53:01Z

core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala

@@ -25,7 +25,7 @@ import org.apache.spark._
 import org.apache.spark.serializer.Serializer
 import org.apache.spark.shuffle.FetchFailedException
 import org.apache.spark.storage.{BlockId, BlockManagerId, ShuffleBlockFetcherIterator, ShuffleBlockId}
-import org.apache.spark.util.CompletionIterator
+import org.apache.spark.util.{CompletionIterator}


you don't need braces here if it is a single import.

…cords Read'

SparkQA · 2015-02-06T07:48:27Z

Test build #26904 has finished for PR 4067 at commit dad4d57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-06T08:40:01Z

Test build #26906 has finished for PR 4067 at commit bd919be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2015-02-06T20:04:26Z

Jenkins, test this please. This LGTM pending tests.

SparkQA · 2015-02-06T21:17:21Z

Test build #26936 has finished for PR 4067 at commit bd919be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2015-02-06T22:31:05Z

Merging this, thanks Kos.

Collects record counts for both Input/Output and Shuffle Metrics. For the input/output metrics, it just appends the counter every time the iterators get accessed. For shuffle on the write side, we count the metrics post aggregation (after a map side combine) and on the read side we count the metrics pre aggregation. This allows both the bytes read/written metrics and the records read/written to line up. For backwards compatibility, if we deserialize an older event that doesn't have record metrics, we set the metric to -1. Author: Kostas Sakellis <kostas@cloudera.com> Closes #4067 from ksakellis/kostas-spark-4874 and squashes the following commits: bd919be [Kostas Sakellis] Changed 'Records Read' in shuffleReadMetrics json output to 'Total Records Read' dad4d57 [Kostas Sakellis] Add a comment and check to BlockObjectWriter so that it cannot be reopend. 6f236a1 [Kostas Sakellis] Renamed _recordsWritten in ShuffleWriteMetrics to be more consistent 70620a0 [Kostas Sakellis] CR Feedback 17faa3a [Kostas Sakellis] Removed AtomicLong in favour of using Long b6f9923 [Kostas Sakellis] Merge AfterNextInterceptingIterator with InterruptableIterator to save a function call 46c8186 [Kostas Sakellis] Combined Bytes and # records into one column 57551c1 [Kostas Sakellis] Conforms to SPARK-3288 6cdb44e [Kostas Sakellis] Removed the generic InterceptingIterator and repalced it with specific implementation 1aa273c [Kostas Sakellis] CR Feedback 1bb78b1 [Kostas Sakellis] [SPARK-4874] [CORE] Collect record count metrics (cherry picked from commit dcd1e42) Signed-off-by: Patrick Wendell <patrick@databricks.com>

JoshRosen · 2015-02-08T02:34:30Z

It looks like the "InputOutputMetricsSuite input metrics with mixed read methods" and "InputOutputMetricsSuite input metrics with interleaved reads" test may have started failing in the hadoop-2.2 build since this patch:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/29/testReport/

ksakellis · 2015-02-08T20:15:51Z

Yikes, @JoshRosen i'm looking into this.

aguyyala · 2016-04-27T06:07:47Z

@ksakellis, @SparkQA , @preaudc How do I collect these metrics on a console (Spark Shell or Spark submit job) right after the task or job is done.

We are using Spark to load data from Mysql to Cassandra and it is quite huge (ex: ~200 GB and 600M rows). When the task the done, we want to verify how many rows exactly did spark process? We can get the number from Spark UI, but how can we retrieve that number ("Output Records Written") from spark shell or in spark-submit job.

Sample Command to load from Mysql to Cassandra.

val pt = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://...:3306/...").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "payment_types").option("user", "hadoop").option("password", "...").load()
pt.save("org.apache.spark.sql.cassandra",SaveMode.Overwrite,options = Map( "table" -> "payment_types", "keyspace" -> "test"))

I want to retrieve all the Spark UI metrics on the above task mainly Output size and Records Written.

Please help.

Thanks for your time!

sryza reviewed Jan 16, 2015
View reviewed changes

rxin reviewed Jan 16, 2015
View reviewed changes

ksakellis force-pushed the kostas-spark-4874 branch from 571cb69 to 1572054 Compare January 16, 2015 06:50

rxin reviewed Jan 16, 2015
View reviewed changes

ksakellis force-pushed the kostas-spark-4874 branch from 1572054 to 3c2d021 Compare January 16, 2015 07:43

sryza reviewed Jan 21, 2015
View reviewed changes

Kostas Sakellis added 10 commits February 5, 2015 22:22

CR Feedback

1aa273c

Removed the generic InterceptingIterator and repalced it with specifi…

6cdb44e

…c implementation

Conforms to SPARK-3288

57551c1

Combined Bytes and # records into one column

46c8186

Also made the availabiliy of the # records more complete.

Merge AfterNextInterceptingIterator with InterruptableIterator to sav…

b6f9923

…e a function call

Removed AtomicLong in favour of using Long

17faa3a

CR Feedback

70620a0

- Hide columns in executor summary table if no data - revert change to show output metrics for hadoop < 2.4 - other cr feedback.

Renamed _recordsWritten in ShuffleWriteMetrics to be more consistent

6f236a1

Add a comment and check to BlockObjectWriter so that it cannot be reo…

dad4d57

…pend.

ksakellis force-pushed the kostas-spark-4874 branch from e156560 to dad4d57 Compare February 6, 2015 06:35

pwendell reviewed Feb 6, 2015
View reviewed changes

Changed 'Records Read' in shuffleReadMetrics json output to 'Total Re…

bd919be

…cords Read'

ksakellis mentioned this pull request Feb 6, 2015

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

Closed

asfgit closed this in dcd1e42 Feb 6, 2015

		@@ -17,6 +17,8 @@

		package org.apache.spark

		import org.apache.spark.util.AfterNextInterceptingIterator

[SPARK-4874] [CORE] Collect record count metrics #4067

[SPARK-4874] [CORE] Collect record count metrics #4067

Conversation

ksakellis commented Jan 16, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jan 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ksakellis commented Jan 16, 2015

Choose a reason for hiding this comment

rxin commented Jan 16, 2015

Choose a reason for hiding this comment

ksakellis commented Jan 16, 2015

rxin commented Jan 16, 2015

rxin commented Jan 16, 2015

ksakellis commented Jan 16, 2015

ksakellis commented Jan 16, 2015

SparkQA commented Jan 16, 2015

SparkQA commented Jan 16, 2015

pwendell commented Jan 16, 2015

ksakellis commented Jan 16, 2015

pwendell commented Jan 16, 2015

ksakellis commented Jan 16, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 6, 2015

ksakellis commented Feb 6, 2015

SparkQA commented Feb 6, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

pwendell commented Feb 6, 2015

SparkQA commented Feb 6, 2015

pwendell commented Feb 6, 2015

JoshRosen commented Feb 8, 2015

ksakellis commented Feb 8, 2015

aguyyala commented Apr 27, 2016 • edited Loading

aguyyala commented Apr 27, 2016 •

edited

Loading