[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 1.4.1 #19651

dongjoon-hyun · 2017-11-03T04:58:10Z

What changes were proposed in this pull request?

Since SPARK-2883, Apache Spark supports Apache ORC inside sql/hive module with Hive dependency. This PR aims to add a new ORC data source inside sql/core and to replace the old ORC data source eventually. This PR resolves the following three issues.

SPARK-20682: Add new ORCFileFormat based on Apache ORC 1.4.1
SPARK-15474: ORC data source fails to write and read back empty dataframe
SPARK-21791: ORC should support column names with dot

How was this patch tested?

Pass the Jenkins with the existing all tests and new tests for SPARK-15474 and SPARK-21791.

SparkQA · 2017-11-03T07:05:01Z

Test build #83382 has finished for PR 19651 at commit fdde274.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-03T15:24:44Z

retest this please

SparkQA · 2017-11-03T15:27:46Z

Test build #83407 has started for PR 19651 at commit fdde274.

dongjoon-hyun · 2017-11-03T15:30:03Z

Thank you, @HyukjinKwon !

SparkQA · 2017-11-04T07:05:01Z

Test build #83431 has finished for PR 19651 at commit fdde274.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-11-04T07:21:59Z

Retest this please

SparkQA · 2017-11-04T10:20:16Z

Test build #83433 has finished for PR 19651 at commit fdde274.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-11-04T14:42:23Z

Hi, @cloud-fan and @gatorsmile .
According to the decision at #19571 , I made a ORCFileFormat under sql/core again.
Could you review this PR?

dongjoon-hyun · 2017-11-04T14:44:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+
+object OrcUtils extends Logging {
+
+  def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {


This is moved from OrcFileOperator in sql/hive.

dongjoon-hyun · 2017-11-04T14:45:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala

@@ -67,4 +67,11 @@ object OrcOptions {
    "snappy" -> "SNAPPY",
    "zlib" -> "ZLIB",
    "lzo" -> "LZO")
+
+  // The extensions for ORC compression codecs
+  val extensionsForCompressionCodecNames = Map(


This is moved from object ORCFileFormat in sql/hive.

dongjoon-hyun · 2017-11-04T14:47:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala

-      .filterNot(_.getName.startsWith("_"))
-      .filterNot(_.getName.startsWith("."))
-    paths
+  def setRequiredColumns(


This is moved from object ORCFileFormat inside sql/hive.

cloud-fan · 2017-11-04T16:25:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+
+  private[orc] def readSchema(sparkSession: SparkSession, files: Seq[FileStatus])
+      : Option[StructType] = {
+    val conf = sparkSession.sparkContext.hadoopConfiguration


sparkSession.sessionState.newHadoopConf

cloud-fan · 2017-11-04T16:25:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+  private[orc] def readSchema(sparkSession: SparkSession, files: Seq[FileStatus])
+      : Option[StructType] = {
+    val conf = sparkSession.sparkContext.hadoopConfiguration
+    files.map(_.getPath).flatMap(readSchema(_, conf)).headOption.map { schema =>


shouldn't we do schema merging?

Later, I will implement schema merging in a parallel manner like Parquet.

cloud-fan · 2017-11-04T16:34:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+    true
+  }
+
+  override def buildReaderWithPartitionValues(


we should override buildReader and return GenericInternalRow here. Then the parent class will merge the partition values and output UnsafeRow. This is what the current OrcFileFormat does and let's keep it first.

Yep. I see. It was because I preferred to be consistent with ParquetFileFormat here.

cloud-fan · 2017-11-04T16:37:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

+
+    val convertibleFilters = for {
+      filter <- filters
+      _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())


why call this function inside a loop? Can we put it at the beginning?

This is a two-step approach which validates each individual filter is convertible.
I'll add the comment of SPARK-12218.

cloud-fan · 2017-11-04T16:37:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala

@@ -67,4 +67,11 @@ object OrcOptions {
    "snappy" -> "SNAPPY",
    "zlib" -> "ZLIB",
    "lzo" -> "LZO")
+
+  // The extensions for ORC compression codecs
+  val extensionsForCompressionCodecNames = Map(


this doesn't belong to OrcOptions, maybe OrcUtils?

It's moved to OrcUtils.

cloud-fan · 2017-11-04T16:41:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala

+  override def write(row: InternalRow): Unit = {
+    recordWriter.write(
+      NullWritable.get,
+      OrcUtils.convertInternalRowToOrcStruct(


ideally we should make it into a function a use it in write, like the old OrcOutputWriter did.

cloud-fan · 2017-11-04T16:46:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala

 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.execution.datasources.orc.OrcUtils
+import org.apache.spark.sql.hive.HiveShim
 import org.apache.spark.sql.types.StructType

 private[hive] object OrcFileOperator extends Logging {


shall we merge this class to OrcUtils?

OrcFileOperator defines functions depending on Hive. We cannot merge these functions into sql/core.

import org.apache.hadoop.hive.ql.io.orc.{OrcFile, Reader} import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector

cloud-fan · 2017-11-04T16:48:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+   * Convert Apache ORC OrcStruct to Apache Spark InternalRow.
+   * If internalRow is not None, fill into it. Otherwise, create a SpecificInternalRow and use it.
+   */
+  private[orc] def convertOrcStructToInternalRow(


like the old orc format, can we create a OrcSerializer to capsulate these serializing logic?

Thanks. It's done.

dongjoon-hyun · 2017-11-06T21:58:11Z

Thank you so much for review, @cloud-fan . I'll try to update the PR tonight.

HyukjinKwon · 2017-11-07T10:30:49Z

@dongjoon-hyun, btw, if I understood correctly,

Note that this PR intentionally removes old ORCFileFormat to demonstrate a complete replacement. We will bring back the old ORCFileFormat and make them switchable in SPARK-20728

we don't necessarily remove the old sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala itself when it's ready for merging?

(I said this because I'd like to keep the blame easy to track if possible).

dongjoon-hyun · 2017-11-07T10:35:36Z

Right. @HyukjinKwon . I'll follow the final decision on this PR.

SparkQA · 2017-11-07T13:46:20Z

Test build #83543 has finished for PR 19651 at commit f644c6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-11-07T15:32:47Z

The PR is updated according to your advice. Thank you again, @cloud-fan !

dongjoon-hyun · 2017-11-09T16:54:41Z

Hi, @cloud-fan and @gatorsmile .
Could you review this PR?

dongjoon-hyun · 2017-11-10T06:55:17Z

Retest this please.

SparkQA · 2017-11-10T08:05:02Z

Test build #83669 has finished for PR 19651 at commit f644c6a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-11-10T08:08:32Z

Retest this please.

cloud-fan · 2017-11-10T10:23:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala

+  private[this] val valueWrappers = requiredSchema.fields.map(f => getValueWrapper(f.dataType))
+
+  def deserialize(writable: OrcStruct): InternalRow = {
+    convertOrcStructToInternalRow(writable, dataSchema, requiredSchema,


can you follow the code style in OrcFileFormat.unwrapOrcStructs? Basically create an unwrapper for each field, and unwrapper is a (Any, InternalRow, Int) => Unit

your implementation here doesn't consider boxing for primitive types at all.

We use valueWrapper for each field here. Do you mean changing name?

Your wrapper returns a value, while the old implementation's wrapper set value to InternalRow, which avoids boxing.

cloud-fan · 2017-11-10T10:26:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+
+    val broadcastedConf =
+      sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
+    val resolver = sparkSession.sessionState.conf.resolver


nit: we can use sparkSession.sessionState.conf.isCaseSensitive here, as it's much cheaper than serializing a function.

cloud-fan · 2017-11-10T10:30:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

+
+    for {
+      conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
+      builder <- buildSearchArgument(dataTypeMap, conjunction, SearchArgumentFactory.newBuilder())


do you mean, even if each individual filter is convertible, the final filter(combine filters by And) may be un-convertible?

Your previous question was about line 40.

why call this function inside a loop? Can we put it at the beginning?

+ val convertibleFilters = for { + filter <- filters + _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder())

Here. It seems you are asking another one.

ah you are just following the previous code:

// First, tries to convert each filter individually to see whether it's convertible, and then // collect all convertible ones to build the final `SearchArgument`. val convertibleFilters = for { filter <- filters _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder()) } yield filter for { // Combines all convertible filters using `And` to produce a single conjunction conjunction <- convertibleFilters.reduceOption(And) // Then tries to build a single ORC `SearchArgument` for the conjunction predicate builder <- buildSearchArgument(dataTypeMap, conjunction, SearchArgumentFactory.newBuilder()) } yield builder.build()

can you add back those comments?

dongjoon-hyun · 2017-12-01T17:51:11Z

Thank you so much, @cloud-fan .

cloud-fan · 2017-12-01T17:55:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

+ * builder methods mentioned above can only be found in test code, where all tested filters are
+ * known to be convertible.
+ */
+private[orc] object OrcFilters {


I didn't review it carefully, just assume it's same with the old version, with API update.

Yes. It's logically the same with old version. Only API usage is updated here.

SparkQA · 2017-12-01T20:37:34Z

Test build #84378 has finished for PR 19651 at commit 74cb053.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OrcDeserializer(
sealed trait CatalystDataUpdater
final class RowUpdater(row: InternalRow) extends CatalystDataUpdater
final class ArrayDataUpdater(array: ArrayData) extends CatalystDataUpdater
class OrcSerializer(dataSchema: StructType)

bug fix

SparkQA · 2017-12-02T23:44:32Z

Test build #84393 has finished for PR 19651 at commit 520837f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-03T00:39:48Z

great, all tests pass! Let's restore to old ORC implementation and merge it.

dongjoon-hyun · 2017-12-03T00:49:26Z

Sure, @cloud-fan .

dongjoon-hyun · 2017-12-03T03:03:23Z

Now, this PR has only new OrcFileFormat-related addition: 1009 insertions(+), 2 deletions(-)

$ git diff master --stat
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcDeserializer.scala | 243 +++++++++++++++++++++++++++++++++++++++++++++++++++++
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala   | 139 +++++++++++++++++++++++++++++-
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala      | 210 +++++++++++++++++++++++++++++++++++++++++++++
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOutputWriter.scala |  53 ++++++++++++
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala   | 228 +++++++++++++++++++++++++++++++++++++++++++++++++
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala        | 113 +++++++++++++++++++++++++
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala                             |  25 ++++++
 7 files changed, 1009 insertions(+), 2 deletions(-)

SparkQA · 2017-12-03T05:00:04Z

Test build #84396 has finished for PR 19651 at commit 71be008.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-12-03T05:05:28Z

@cloud-fan . It pass the Jenkins again. Could you take a look again?

kiszk · 2017-12-03T05:54:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala

+    val paths = SparkHadoopUtil.get.listLeafStatuses(fs, origPath)
+      .filterNot(_.isDirectory)
+      .map(_.getPath)
+      .filterNot(_.getName.startsWith("_"))


nit: How about combining two filterNot into one filterNot by creating one condition with two startsWith?

@kiszk . This comes from the existing code, OrcFileOperator.scala. This PR keeps the original function because I don't want to make a possibility of difference. We had better do those kind of improvement later in a separate PR.

Thank you for your explanation, got it.

cloud-fan · 2017-12-03T14:23:35Z

thanks, merging to master!

followups:

add a config to use new orc by default
move orc test to sql core
columnar orc reader

dongjoon-hyun · 2017-12-03T17:13:10Z

Thank you so much for making ORC move forward, @cloud-fan !
Also, thank you, @HyukjinKwon , @gatorsmile , @viirya , @kiszk .

…a sources ## What changes were proposed in this pull request? After [SPARK-20682](apache#19651), Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises `org.apache.spark.sql.catalyst.parser.ParseException`. This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has [a single column with a fixed name 'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71). ## How was this patch tested? Pass the newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20266 from dongjoon-hyun/SPARK-23072.

…a sources ## What changes were proposed in this pull request? After [SPARK-20682](#19651), Apache Spark 2.3 is able to read ORC files with Unicode schema. Previously, it raises `org.apache.spark.sql.catalyst.parser.ParseException`. This PR adds a Unicode schema test for CSV/JSON/ORC/Parquet file-based data sources. Note that TEXT data source only has [a single column with a fixed name 'value'](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L71). ## How was this patch tested? Pass the newly added test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20266 from dongjoon-hyun/SPARK-23072. (cherry picked from commit a0aedb0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun commented Nov 4, 2017

View reviewed changes

cloud-fan reviewed Nov 4, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Nov 6, 2017

[SPARK-20728][SQL] Make ORCFileFormat configurable between sql/hive and sql/core #17980

Closed

cloud-fan reviewed Nov 10, 2017

View reviewed changes

dongjoon-hyun added 2 commits December 1, 2017 09:39

Revert the change on HiveInspectors.scala.

8a34731

Merge branch 'PR-2' into SPARK-20682

74cb053

cloud-fan reviewed Dec 1, 2017

View reviewed changes

bug fix

eae50b3

dongjoon-hyun mentioned this pull request Dec 2, 2017

[SPARK-22672][TEST][SQL] Move OrcTest to sql/core #19863

Closed

Merge pull request #3 from cloud-fan/orc

520837f

bug fix

Restore old ORC implementation.

71be008

kiszk reviewed Dec 3, 2017

View reviewed changes

asfgit closed this in f23dddf Dec 3, 2017

dongjoon-hyun deleted the SPARK-20682 branch December 3, 2017 17:13

dongjoon-hyun mentioned this pull request Jan 14, 2018

[SPARK-23072][SQL][TEST] Add a Unicode schema test for file-based data sources #20266

Closed


		object OrcUtils extends Logging {

		def listOrcFiles(pathStr: String, conf: Configuration): Seq[Path] = {

[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 1.4.1 #19651

[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 1.4.1 #19651

Conversation

dongjoon-hyun commented Nov 3, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 3, 2017

HyukjinKwon commented Nov 3, 2017

SparkQA commented Nov 3, 2017

dongjoon-hyun commented Nov 3, 2017

SparkQA commented Nov 4, 2017

dongjoon-hyun commented Nov 4, 2017

SparkQA commented Nov 4, 2017

dongjoon-hyun commented Nov 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 6, 2017

HyukjinKwon commented Nov 7, 2017 • edited Loading

dongjoon-hyun commented Nov 7, 2017

SparkQA commented Nov 7, 2017

dongjoon-hyun commented Nov 7, 2017

dongjoon-hyun commented Nov 9, 2017

dongjoon-hyun commented Nov 10, 2017

SparkQA commented Nov 10, 2017

dongjoon-hyun commented Nov 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 1, 2017

cloud-fan Dec 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 1, 2017

SparkQA commented Dec 2, 2017

cloud-fan commented Dec 3, 2017

dongjoon-hyun commented Dec 3, 2017

dongjoon-hyun commented Dec 3, 2017

SparkQA commented Dec 3, 2017

dongjoon-hyun commented Dec 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 3, 2017

dongjoon-hyun commented Dec 3, 2017

dongjoon-hyun commented Nov 3, 2017 •

edited

Loading

HyukjinKwon commented Nov 7, 2017 •

edited

Loading

cloud-fan Dec 1, 2017 •

edited

Loading