[SPARK-16664][SQL] Fix persist call on Data frames with more than 200… #14324

breakdawn · 2016-07-23T04:46:17Z

What changes were proposed in this pull request?

f12f11e introduced this bug, missed foreach as map

How was this patch tested?

Test added

… columns is wiping out the data.

rxin · 2016-07-23T05:43:40Z

Can you add a test case?

breakdawn · 2016-07-23T05:46:59Z

Yes, working on that

lw-lin · 2016-07-23T05:48:32Z

@breakdawn it'd be great to do more tests when you open a request. As I'm investigating into this too, I found that my same fix works for 201 cols but fails for 8118 cols. The exact limit is 8117.

breakdawn · 2016-07-23T06:08:21Z

@lw-lin You're right, thanks for your suggestion.

breakdawn · 2016-07-23T08:59:46Z

8118 cols limit due to janino, the exception like following, might be another story
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:889)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:941)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:938)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
... 25 more
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1509)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:914)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:912)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:912)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:884)
... 29 more

lw-lin · 2016-07-23T09:58:11Z

@breakdawn yes that's a different issue and I've been looking into it.

Regarding what this PR tries to fix, could you run this PR's change against this test case to see whether it's sufficient?

srowen · 2016-07-23T11:26:58Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -1571,4 +1571,12 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
    checkAnswer(joined, Row("x", null, null))
    checkAnswer(joined.filter($"new".isNull), Row("x", null, null))
  }
+
+  test("SPARK-16664: persist with more than 200 columns") {
+    val size = 201l


Nit: write 201L for a long literal; it's too easy to read this as 2011.

srowen · 2016-07-23T12:08:38Z

There are actually 55 occurrences of this type of problem in the code base. I think I will open a PR separately to fix them. It might or might not cause a problem in practice in other cases, but many are in examples or tests, where we might not observe the consequence.

breakdawn · 2016-07-23T14:58:10Z

@lw-lin umm, thanks for pointing it out. Since the limit is 8117, 10000 will fail, that case needs a update.

lw-lin · 2016-07-24T01:12:01Z

@breakdawn what else can we do to actually fix the ≥ 8118 cols issue? We're actually running out of the constant pool when we compile the generated code. So maybe compile it into multiple classes? Or just fall back to the non-code-gen path? Thanks.

breakdawn · 2016-07-24T03:28:54Z

@lw-lin Personally, multiple classes way is smoother base on current implementation. But no matter in what way, it's a big change, maybe it's better to open another jira issue to involve more discussions.

breakdawn · 2016-07-26T03:02:14Z

@rxin @srowen Anything i can follow?

srowen · 2016-07-26T08:36:50Z

Jenkins test this please

SparkQA · 2016-07-26T08:44:36Z

Test build #62876 has finished for PR 14324 at commit 0d6c29b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-07-26T09:01:18Z

...core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala

@@ -227,7 +227,8 @@ class InMemoryColumnarQuerySuite extends QueryTest with SharedSQLContext {
    val columnTypes1 = List.fill(length1)(IntegerType)
    val columnarIterator1 = GenerateColumnAccessor.generate(columnTypes1)

-    val length2 = 10000
+    //SPARK-16664: the limit of janino is 8117


Oh, this needs a space after //

@srowen Sorry for that.

srowen · 2016-07-27T10:25:31Z

Jenkins retest this please

srowen · 2016-07-27T10:25:36Z

Jenkins add to whitelist

SparkQA · 2016-07-27T12:31:32Z

Test build #62920 has finished for PR 14324 at commit b3f60fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? f12f11e introduced this bug, missed foreach as map ## How was this patch tested? Test added Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #14324 from breakdawn/master. (cherry picked from commit d1d5069) Signed-off-by: Sean Owen <sowen@cloudera.com>

f12f11e introduced this bug, missed foreach as map Test added Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #14324 from breakdawn/master. (cherry picked from commit d1d5069) Signed-off-by: Sean Owen <sowen@cloudera.com>

srowen · 2016-07-29T11:28:25Z

Merged to master/2.0/1.6

srowen · 2016-07-29T12:39:44Z

Darn, this breaks 1.6, because the test doesn't compile. I'll revert it in 1.6. @breakdawn if you're willing, could you open a PR vs 1.6 that updates the test to work in that branch?

f12f11e introduced this bug, missed foreach as map Test added Author: Wesley Tang <tangmingjun@mininglamp.com> Closes apache#14324 from breakdawn/master. (cherry picked from commit d1d5069) Signed-off-by: Sean Owen <sowen@cloudera.com> (cherry picked from commit 15abbf9)

breakdawn · 2016-07-29T17:53:44Z

@srowen sure, please refer to #14404

[SPARK-16664][SQL] Fix persist call on Data frames with more than 200…

7040dc9

… columns is wiping out the data.

breakdawn force-pushed the master branch from d432f8c to 7040dc9 Compare July 23, 2016 06:00

[SPARK-16664] Add test

42b5c11

srowen reviewed Jul 23, 2016
View reviewed changes

Update tests

0d6c29b

srowen reviewed Jul 26, 2016
View reviewed changes

Fix style

b3f60fa

asfgit closed this in d1d5069 Jul 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16664][SQL] Fix persist call on Data frames with more than 200… #14324

[SPARK-16664][SQL] Fix persist call on Data frames with more than 200… #14324

breakdawn commented Jul 23, 2016 •

edited

Loading

rxin commented Jul 23, 2016

breakdawn commented Jul 23, 2016

lw-lin commented Jul 23, 2016 •

edited

Loading

breakdawn commented Jul 23, 2016

breakdawn commented Jul 23, 2016

lw-lin commented Jul 23, 2016 •

edited

Loading

srowen Jul 23, 2016

srowen commented Jul 23, 2016

breakdawn commented Jul 23, 2016

lw-lin commented Jul 24, 2016 •

edited

Loading

breakdawn commented Jul 24, 2016

breakdawn commented Jul 26, 2016

srowen commented Jul 26, 2016

SparkQA commented Jul 26, 2016

srowen Jul 26, 2016

breakdawn Jul 26, 2016

srowen commented Jul 27, 2016

srowen commented Jul 27, 2016

SparkQA commented Jul 27, 2016

srowen commented Jul 29, 2016

srowen commented Jul 29, 2016

breakdawn commented Jul 29, 2016

[SPARK-16664][SQL] Fix persist call on Data frames with more than 200… #14324

[SPARK-16664][SQL] Fix persist call on Data frames with more than 200… #14324

Conversation

breakdawn commented Jul 23, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Jul 23, 2016

breakdawn commented Jul 23, 2016

lw-lin commented Jul 23, 2016 • edited Loading

breakdawn commented Jul 23, 2016

breakdawn commented Jul 23, 2016

lw-lin commented Jul 23, 2016 • edited Loading

srowen Jul 23, 2016

Choose a reason for hiding this comment

srowen commented Jul 23, 2016

breakdawn commented Jul 23, 2016

lw-lin commented Jul 24, 2016 • edited Loading

breakdawn commented Jul 24, 2016

breakdawn commented Jul 26, 2016

srowen commented Jul 26, 2016

SparkQA commented Jul 26, 2016

srowen Jul 26, 2016

Choose a reason for hiding this comment

breakdawn Jul 26, 2016

Choose a reason for hiding this comment

srowen commented Jul 27, 2016

srowen commented Jul 27, 2016

SparkQA commented Jul 27, 2016

srowen commented Jul 29, 2016

srowen commented Jul 29, 2016

breakdawn commented Jul 29, 2016

breakdawn commented Jul 23, 2016 •

edited

Loading

lw-lin commented Jul 23, 2016 •

edited

Loading

lw-lin commented Jul 23, 2016 •

edited

Loading

lw-lin commented Jul 24, 2016 •

edited

Loading