[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

shenh062326 · 2015-01-22T00:57:38Z

When inputFormatClass is set to CombineFileInputFormat, input metrics show that input is empty. It don't appear is spark-1.1.0. It's because in HadoopRDD, inputMetrics only been set when split is instanceOf FileSplit, but CombineFileInputFormat use InputSplit. It's not nessesary to instanceOf FileSplit, only have to instanceOf InputSplit.

SparkQA · 2015-01-22T01:02:46Z

Test build #25937 has started for PR 4150 at commit 9e04a54.

This patch merges cleanly.

SparkQA · 2015-01-22T02:11:31Z

Test build #25937 has finished for PR 4150 at commit 9e04a54.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected case class Keyword(str: String)
- class SqlLexical extends StdLexical

AmplabJenkins · 2015-01-22T02:11:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25937/
Test PASSed.

srowen · 2015-01-22T10:18:43Z

My only question was whether getLength() is indeed defined in the InputSplit interface in older Hadoop versions, but it looks like it is. This change compiles with default Hadoop versions in the build.

sryza · 2015-01-22T17:00:14Z

I think this is a duplicate of #4050, which only adds support for CombineFileSplits. We shouldn't add support for generic InputSplits because many input formats do not read from HDFS and having a metrics column for HDFS-bytes-read in these cases would be confusing and a waste of space. getLength should be defined in all versions of Hadoop, but it doesn't necessarily mean bytes for all InputFormats. For example, it means # of records read for DBInputFormat.

srowen · 2015-01-23T10:50:52Z

Given this reasoning, it does seem like this is a duplicate of SPARK-5199

shenh062326 · 2015-01-25T02:44:48Z

If we use a inputFormat that don‘t instanc of org.apache.hadoop.mapreduce.lib.input.{CombineFileSplit, FileSplit}, then we can't get information of input metrics.

srowen · 2015-02-06T17:00:39Z

@shenh062326 Sandy is saying that in those other cases, the values you are getting are not even in the same units, and so would be invalid. I believe we should close this PR in favor of #4050 which accomplishes the part of this change that is possible.

ksakellis · 2015-02-06T17:15:37Z

I agree with @srowen and @sryza. Also given #4067 this metric should really just report size

andrewor14 · 2015-02-19T20:41:41Z

Hi @shenh062326 since this is a duplicate would you mind closing this PR? The associated JIRA is already closed. Thanks.

change FileSplit to InputSplit in update inputMetrics

9e04a54

asfgit closed this in 46462ff Feb 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

shenh062326 commented Jan 22, 2015

SparkQA commented Jan 22, 2015

SparkQA commented Jan 22, 2015

AmplabJenkins commented Jan 22, 2015

srowen commented Jan 22, 2015

sryza commented Jan 22, 2015

srowen commented Jan 23, 2015

shenh062326 commented Jan 25, 2015

srowen commented Feb 6, 2015

ksakellis commented Feb 6, 2015

andrewor14 commented Feb 19, 2015

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

Conversation

shenh062326 commented Jan 22, 2015

SparkQA commented Jan 22, 2015

SparkQA commented Jan 22, 2015

AmplabJenkins commented Jan 22, 2015

srowen commented Jan 22, 2015

sryza commented Jan 22, 2015

srowen commented Jan 23, 2015

shenh062326 commented Jan 25, 2015

srowen commented Feb 6, 2015

ksakellis commented Feb 6, 2015

andrewor14 commented Feb 19, 2015