Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5347][CORE] Change FileSplit to InputSplit in update inputMetrics #4150

Closed
wants to merge 1 commit into from
Closed

Conversation

shenh062326
Copy link
Contributor

When inputFormatClass is set to CombineFileInputFormat, input metrics show that input is empty. It don't appear is spark-1.1.0. It's because in HadoopRDD, inputMetrics only been set when split is instanceOf FileSplit, but CombineFileInputFormat use InputSplit. It's not nessesary to instanceOf FileSplit, only have to instanceOf InputSplit.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25937 has started for PR 4150 at commit 9e04a54.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25937 has finished for PR 4150 at commit 9e04a54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected case class Keyword(str: String)
    • class SqlLexical extends StdLexical

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25937/
Test PASSed.

@srowen
Copy link
Member

srowen commented Jan 22, 2015

My only question was whether getLength() is indeed defined in the InputSplit interface in older Hadoop versions, but it looks like it is. This change compiles with default Hadoop versions in the build.

@sryza
Copy link
Contributor

sryza commented Jan 22, 2015

I think this is a duplicate of #4050, which only adds support for CombineFileSplits. We shouldn't add support for generic InputSplits because many input formats do not read from HDFS and having a metrics column for HDFS-bytes-read in these cases would be confusing and a waste of space. getLength should be defined in all versions of Hadoop, but it doesn't necessarily mean bytes for all InputFormats. For example, it means # of records read for DBInputFormat.

@srowen
Copy link
Member

srowen commented Jan 23, 2015

Given this reasoning, it does seem like this is a duplicate of SPARK-5199

@shenh062326
Copy link
Contributor Author

If we use a inputFormat that don‘t instanc of org.apache.hadoop.mapreduce.lib.input.{CombineFileSplit, FileSplit}, then we can't get information of input metrics.

@srowen
Copy link
Member

srowen commented Feb 6, 2015

@shenh062326 Sandy is saying that in those other cases, the values you are getting are not even in the same units, and so would be invalid. I believe we should close this PR in favor of #4050 which accomplishes the part of this change that is possible.

@ksakellis
Copy link

I agree with @srowen and @sryza. Also given #4067 this metric should really just report size

@andrewor14
Copy link
Contributor

Hi @shenh062326 since this is a duplicate would you mind closing this PR? The associated JIRA is already closed. Thanks.

@asfgit asfgit closed this in 46462ff Feb 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants