[SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once #15437

brkyvz · 2016-10-11T18:27:26Z

What changes were proposed in this pull request?

The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch.
You may come across stacktraces that look like:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)

The safer way is to write to an output stream so that we don't have to materialize a huge string.

How was this patch tested?

Existing unit tests

SparkQA · 2016-10-11T20:40:20Z

Test build #66754 has finished for PR 15437 at commit 4d50be5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2016-10-11T20:58:53Z

cc @tdas @zsxwing Would one of you want to look at this?

zsxwing · 2016-10-12T20:54:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala

 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs._
 import org.apache.hadoop.fs.permission.FsPermission
+import org.apache.hadoop.io.IOUtils


nit: use org.apache.commons.io.IOUtils instead. Hadoop's IOUtils is @InterfaceStability.Evolving which can break compatibility at minor release

zsxwing · 2016-10-12T20:56:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

-    val lines = new String(bytes, UTF_8).split("\n")
-    if (lines.length == 0) {
+  override def deserialize(in: InputStream): Array[T] = {
+    val lines = IOUtils.lineIterator(in, UTF_8).asScala


nit: why not use Source.getLines since this is Scala.

zsxwing · 2016-10-12T20:57:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

+      out.write('\n')
+      out.write(serializeData(data).getBytes(UTF_8))
+    }
+    out.flush()


nit: no need to flush since it will be closed at once.

zsxwing · 2016-10-12T20:58:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

@@ -17,6 +17,8 @@

 package org.apache.spark.sql.execution.streaming

+import java.io.OutputStream


nit: unused import

zsxwing · 2016-10-12T20:58:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala

@@ -17,6 +17,7 @@

 package org.apache.spark.sql.execution.streaming

+import java.io.OutputStream


nit: unused import

zsxwing

Looks pretty good. Just some nits

brkyvz · 2016-10-12T23:27:12Z

Thanks @zsxwing addressed your comments

SparkQA · 2016-10-12T23:33:01Z

Test build #66855 has finished for PR 15437 at commit cece672.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-13T01:40:47Z

Test build #66856 has finished for PR 15437 at commit 9c4fe72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-10-13T04:40:06Z

LGTM. Thanks! Merging to master and 2.0.

…terializing all at once ## What changes were proposed in this pull request? The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch. You may come across stacktraces that look like: ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127) ``` The safer way is to write to an output stream so that we don't have to materialize a huge string. ## How was this patch tested? Existing unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15437 from brkyvz/ser-to-stream. (cherry picked from commit edeb51a) Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

…terializing all at once ## What changes were proposed in this pull request? The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch. You may come across stacktraces that look like: ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127) ``` The safer way is to write to an output stream so that we don't have to materialize a huge string. ## How was this patch tested? Existing unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#15437 from brkyvz/ser-to-stream.

brkyvz added 3 commits October 11, 2016 09:25

save

5ef06d9

ready for review

988df08

charset

4d50be5

zsxwing reviewed Oct 12, 2016

View reviewed changes

zsxwing requested changes Oct 12, 2016

View reviewed changes

address nits

cece672

shade source

9c4fe72

asfgit closed this in edeb51a Oct 13, 2016

brkyvz deleted the ser-to-stream branch February 3, 2019 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once #15437

[SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once #15437

brkyvz commented Oct 11, 2016

SparkQA commented Oct 11, 2016

brkyvz commented Oct 11, 2016

zsxwing Oct 12, 2016

zsxwing Oct 12, 2016

zsxwing Oct 12, 2016

zsxwing Oct 12, 2016

zsxwing Oct 12, 2016

zsxwing left a comment

brkyvz commented Oct 12, 2016

SparkQA commented Oct 12, 2016

SparkQA commented Oct 13, 2016

zsxwing commented Oct 13, 2016

		@@ -17,6 +17,8 @@

		package org.apache.spark.sql.execution.streaming

		import java.io.OutputStream

		@@ -17,6 +17,7 @@

		package org.apache.spark.sql.execution.streaming

		import java.io.OutputStream

[SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once #15437

[SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once #15437

Conversation

brkyvz commented Oct 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 11, 2016

brkyvz commented Oct 11, 2016

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

zsxwing Oct 12, 2016

Choose a reason for hiding this comment

zsxwing left a comment

Choose a reason for hiding this comment

brkyvz commented Oct 12, 2016

SparkQA commented Oct 12, 2016

SparkQA commented Oct 13, 2016

zsxwing commented Oct 13, 2016