Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16664][SQL] Fix persist call on Data frames with more than 200… #14324

Closed
wants to merge 4 commits into from

Conversation

breakdawn
Copy link

@breakdawn breakdawn commented Jul 23, 2016

What changes were proposed in this pull request?

f12f11e introduced this bug, missed foreach as map

How was this patch tested?

Test added

@rxin
Copy link
Contributor

rxin commented Jul 23, 2016

Can you add a test case?

@breakdawn
Copy link
Author

Yes, working on that

@lw-lin
Copy link
Contributor

lw-lin commented Jul 23, 2016

@breakdawn it'd be great to do more tests when you open a request. As I'm investigating into this too, I found that my same fix works for 201 cols but fails for 8118 cols. The exact limit is 8117.

@breakdawn
Copy link
Author

@lw-lin You're right, thanks for your suggestion.

@breakdawn
Copy link
Author

8118 cols limit due to janino, the exception like following, might be another story
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 25 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1509)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884)
... 29 more

@lw-lin
Copy link
Contributor

lw-lin commented Jul 23, 2016

@breakdawn yes that's a different issue and I've been looking into it.

Regarding what this PR tries to fix, could you run this PR's change against this test case to see whether it's sufficient?

@@ -1571,4 +1571,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
checkAnswer(joined, Row("x", null, null))
checkAnswer(joined.filter($"new".isNull), Row("x", null, null))
}

test("SPARK-16664: persist with more than 200 columns") {
val size = 201l
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: write 201L for a long literal; it's too easy to read this as 2011.

@srowen
Copy link
Member

srowen commented Jul 23, 2016

There are actually 55 occurrences of this type of problem in the code base. I think I will open a PR separately to fix them. It might or might not cause a problem in practice in other cases, but many are in examples or tests, where we might not observe the consequence.

@breakdawn
Copy link
Author

@lw-lin umm, thanks for pointing it out. Since the limit is 8117, 10000 will fail, that case needs a update.

@lw-lin
Copy link
Contributor

lw-lin commented Jul 24, 2016

@breakdawn what else can we do to actually fix the ≥ 8118 cols issue? We're actually running out of the constant pool when we compile the generated code. So maybe compile it into multiple classes? Or just fall back to the non-code-gen path? Thanks.

@breakdawn
Copy link
Author

@lw-lin Personally, multiple classes way is smoother base on current implementation. But no matter in what way, it's a big change, maybe it's better to open another jira issue to involve more discussions.

@breakdawn
Copy link
Author

@rxin @srowen Anything i can follow?

@srowen
Copy link
Member

srowen commented Jul 26, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62876 has finished for PR 14324 at commit 0d6c29b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -227,7 +227,8 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext {
val columnTypes1 = List.fill(length1)(IntegerType)
val columnarIterator1 = GenerateColumnAccessor.generate(columnTypes1)

val length2 = 10000
//SPARK-16664: the limit of janino is 8117
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this needs a space after //

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Sorry for that.

@srowen
Copy link
Member

srowen commented Jul 27, 2016

Jenkins retest this please

@srowen
Copy link
Member

srowen commented Jul 27, 2016

Jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Jul 27, 2016

Test build #62920 has finished for PR 14324 at commit b3f60fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jul 29, 2016
## What changes were proposed in this pull request?

f12f11e introduced this bug, missed foreach as map

## How was this patch tested?

Test added

Author: Wesley Tang <tangmingjun@mininglamp.com>

Closes #14324 from breakdawn/master.

(cherry picked from commit d1d5069)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@asfgit asfgit closed this in d1d5069 Jul 29, 2016
asfgit pushed a commit that referenced this pull request Jul 29, 2016
f12f11e introduced this bug, missed foreach as map

Test added

Author: Wesley Tang <tangmingjun@mininglamp.com>

Closes #14324 from breakdawn/master.

(cherry picked from commit d1d5069)
Signed-off-by: Sean Owen <sowen@cloudera.com>
@srowen
Copy link
Member

srowen commented Jul 29, 2016

Merged to master/2.0/1.6

@srowen
Copy link
Member

srowen commented Jul 29, 2016

Darn, this breaks 1.6, because the test doesn't compile. I'll revert it in 1.6. @breakdawn if you're willing, could you open a PR vs 1.6 that updates the test to work in that branch?

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Jul 29, 2016
f12f11e introduced this bug, missed foreach as map

Test added

Author: Wesley Tang <tangmingjun@mininglamp.com>

Closes apache#14324 from breakdawn/master.

(cherry picked from commit d1d5069)
Signed-off-by: Sean Owen <sowen@cloudera.com>
(cherry picked from commit 15abbf9)
@breakdawn
Copy link
Author

@srowen sure, please refer to #14404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants