[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options #29160

dongjoon-hyun · 2020-07-20T06:04:34Z

What changes were proposed in this pull request?

When a user have multiple options like path, paTH, and PATH for the same key path, option/options is non-deterministic because extraOptions is HashMap. This PR aims to use CaseInsensitiveMap instead of HashMap to fix this bug fundamentally.

Why are the changes needed?

Like the following, DataFrame's option/options have been non-deterministic in terms of case-insensitivity because it stores the options at extraOptions which is using HashMap class.

spark.read
  .option("paTh", "1")
  .option("PATH", "2")
  .option("Path", "3")
  .option("patH", "4")
  .load("5")
...
org.apache.spark.sql.AnalysisException:
Path does not exist: file:/.../1;

Does this PR introduce any user-facing change?

Yes. However, this is a bug fix for the indeterministic cases.

How was this patch tested?

Pass the Jenkins or GitHub Action with newly added test cases.

dongjoon-hyun · 2020-07-20T06:17:55Z

Could you review this please, @cloud-fan ?

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

dongjoon-hyun · 2020-07-22T08:02:40Z

Hi, @cloud-fan and @HyukjinKwon .
This PR aims to fix this problem completely. Please review the approach.

…options

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

dongjoon-hyun · 2020-07-22T08:55:06Z

I updated the PR according to your comment, @cloud-fan . It looks much better indeed. Thank you so much.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

cloud-fan

LGTM except one comment

dongjoon-hyun · 2020-07-22T09:15:50Z

Thank you, @cloud-fan . The PR is updated to use Seq instead of Map.

dongjoon-hyun · 2020-07-22T11:14:45Z

Retest this please.

dongjoon-hyun · 2020-07-22T11:18:04Z

Thank you, @HyukjinKwon .

…options ### What changes were proposed in this pull request? When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is non-deterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally. ### Why are the changes needed? Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class. ```scala spark.read .option("paTh", "1") .option("PATH", "2") .option("Path", "3") .option("patH", "4") .load("5") ... org.apache.spark.sql.AnalysisException: Path does not exist: file:/.../1; ``` ### Does this PR introduce _any_ user-facing change? Yes. However, this is a bug fix for the indeterministic cases. ### How was this patch tested? Pass the Jenkins or GitHub Action with newly added test cases. Closes #29160 from dongjoon-hyun/SPARK-32364. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit cd16a10) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-07-22T14:59:57Z

Merged to master/3.0 since GitHub Action passed. Thank you all. I'll make a backporting PR to branch-2.4.

SparkQA · 2020-07-22T17:27:36Z

Test build #126332 has finished for PR 29160 at commit 7ca5da6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…mentation ### What changes were proposed in this pull request? This is a follow-up of #29160. We already removed the indeterministicity. This PR aims the following for the existing code base. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` ### Why are the changes needed? `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action with the existing tests and newly add test case at `JDBCSuite`. Closes #29191 from dongjoon-hyun/SPARK-32364-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…mentation ### What changes were proposed in this pull request? This is a follow-up of #29160. We already removed the indeterministicity. This PR aims the following for the existing code base. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` ### Why are the changes needed? `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action with the existing tests and newly add test case at `JDBCSuite`. Closes #29191 from dongjoon-hyun/SPARK-32364-3. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit aed8dba) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…iter options ### What changes were proposed in this pull request? This PR is a backport of SPARK-32364 (#29160, #29191). When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is indeterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally. Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class. ```scala spark.read .option("paTh", "1") .option("PATH", "2") .option("Path", "3") .option("patH", "4") .load("5") ... org.apache.spark.sql.AnalysisException: Path does not exist: file:/.../1; ``` Also, this PR adds the following. 1. Add an explicit document to `DataFrameReader/DataFrameWriter`. 2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`. 3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`. ```scala - val params = extraOptions.toMap ++ connectionProperties.asScala.toMap + val params = extraOptions ++ connectionProperties.asScala ``` 4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later. ```scala - val options = sessionOptions ++ extraOptions + val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap val dsOptions = new CaseInsensitiveStringMap(options.asJava) ``` `extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`. ### Why are the changes needed? This will fix indeterministic behavior. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the Jenkins with the existing tests and newly add test cases. Closes #29209 from dongjoon-hyun/SPARK-32364-2.4. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a follow-up of #29160. This allows Spark SQL project to compile for Scala 2.13. ### Why are the changes needed? It's needed for #28545 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I compiled with Scala 2.13. It fails in `Spark REPL` project, which will be fixed by #28545 Closes #29584 from karolchmist/SPARK-32364-scala-2.13. Authored-by: Karol Chmist <info+github@chmist.com> Signed-off-by: Sean Owen <srowen@gmail.com>

cloud-fan · 2020-09-09T07:31:21Z

@dongjoon-hyun shall we fix the issue in DataStreamReader/Writer as well? cc @HeartSaVioR

dongjoon-hyun · 2020-09-09T07:42:14Z

Sure, I'll make a PR for that tomorrow, @cloud-fan .

dongjoon-hyun · 2020-09-09T08:53:04Z

BTW, @cloud-fan . Is @HeartSaVioR working on that? I'm wondering the reason why you ping him in that task.

HeartSaVioR · 2020-09-09T11:36:07Z

I guess he pinged me to ask for reviewing on your next PR as it would be SS-related change, probably.

cloud-fan · 2020-09-09T14:41:43Z

Yea, I was wanting @HeartSaVioR to review your PR :)

dongjoon-hyun · 2020-09-09T16:25:36Z

Got it. If the PR is ready, I'll ping both of you. :)

probot-autolabeler bot added the SQL label Jul 20, 2020

This comment has been minimized.

Sign in to view

HyukjinKwon reviewed Jul 20, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

dongjoon-hyun changed the title ~~[SPARK-32364][SQL] path argument of DataFrame.load/save should override the existing options~~ [SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options Jul 22, 2020

This comment has been minimized.

Sign in to view

[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer …

4019789

…options

cloud-fan reviewed Jul 22, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated Show resolved Hide resolved

Address comments

4de7852

This comment has been minimized.

Sign in to view

cloud-fan reviewed Jul 22, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jul 22, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Jul 22, 2020

View reviewed changes

Use updated function explicitly for Scala 2.13

2b47cd6

This comment has been minimized.

Sign in to view

Use Seq instead of Map.

16eaf45

This comment has been minimized.

Sign in to view

Use originalMap instead of toMap to preserve Hadoop conf

7ca5da6

HyukjinKwon approved these changes Jul 22, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

dongjoon-hyun closed this in cd16a10 Jul 22, 2020

dongjoon-hyun mentioned this pull request Jul 22, 2020

[SPARK-32364][SQL][FOLLOWUP] Add toMap to return originalMap and documentation #29191

Closed

dongjoon-hyun mentioned this pull request Jul 23, 2020

[SPARK-32364][SQL][2.4] Use CaseInsensitiveMap for DataFrameReader/Writer options #29209

Closed

karolchmist mentioned this pull request Aug 30, 2020

[SPARK-32756][SQL] Fix CaseInsensitiveMap usage for Scala 2.13 #29584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options #29160

[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options #29160

dongjoon-hyun commented Jul 20, 2020 •

edited

Loading

dongjoon-hyun commented Jul 20, 2020

This comment has been minimized.

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

cloud-fan left a comment

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

SparkQA commented Jul 22, 2020

cloud-fan commented Sep 9, 2020

dongjoon-hyun commented Sep 9, 2020

dongjoon-hyun commented Sep 9, 2020

HeartSaVioR commented Sep 9, 2020

cloud-fan commented Sep 9, 2020

dongjoon-hyun commented Sep 9, 2020

[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options #29160

[SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options #29160

Conversation

dongjoon-hyun commented Jul 20, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Jul 20, 2020

This comment has been minimized.

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

cloud-fan left a comment

Choose a reason for hiding this comment

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

dongjoon-hyun commented Jul 22, 2020

This comment has been minimized.

dongjoon-hyun commented Jul 22, 2020

SparkQA commented Jul 22, 2020

cloud-fan commented Sep 9, 2020

dongjoon-hyun commented Sep 9, 2020

dongjoon-hyun commented Sep 9, 2020

HeartSaVioR commented Sep 9, 2020

cloud-fan commented Sep 9, 2020

dongjoon-hyun commented Sep 9, 2020

dongjoon-hyun commented Jul 20, 2020 •

edited

Loading