Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23312][SQL] add a config to turn off vectorized cache reader #20483

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-23309 reported a performance regression about cached table in Spark 2.3. While the investigating is still going on, this PR adds a conf to turn off the vectorized cache reader, to unblock the 2.3 release.

How was this patch tested?

a new test

@cloud-fan
Copy link
Contributor Author

cc @kiszk @gatorsmile

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending Jenkins.

buildConf("spark.sql.inMemoryColumnarStorage.enableVectorizedReader")
.doc("Enables vectorized reader for columnar caching.")
.booleanConf
.createWithDefault(true)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock 2.3, I think we need to disable this with false.
Sorry, I'm taking this back since it's too radical in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parquet/orc vectorized reader conf is also public.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86966 has finished for PR 20483 at commit 376c855.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86967 has finished for PR 20483 at commit 53d3259.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 2, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86983 has finished for PR 20483 at commit 53d3259.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cloud-fan
Copy link
Contributor Author

thanks, merging to master/2.3!

asfgit pushed a commit that referenced this pull request Feb 2, 2018
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-23309 reported a performance regression about cached table in Spark 2.3. While the investigating is still going on, this PR adds a conf to turn off the vectorized cache reader, to unblock the 2.3 release.

## How was this patch tested?

a new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20483 from cloud-fan/cache.

(cherry picked from commit b9503fc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in b9503fc Feb 2, 2018
@kiszk
Copy link
Member

kiszk commented Feb 5, 2018

Thank you for adding this. I will look at the performance regression.

ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 6, 2018
…e reader

## What changes were proposed in this pull request?

apache#20483 tried to provide a way to turn off the new columnar cache reader, to restore the behavior in 2.2. However even we turn off that config, the behavior is still different than 2.2.

If the output data are rows, we still enable whole stage codegen for the scan node, which is different with 2.2, we should also fix it.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#20513 from cloud-fan/cache.
asfgit pushed a commit that referenced this pull request Feb 6, 2018
…e reader

## What changes were proposed in this pull request?

#20483 tried to provide a way to turn off the new columnar cache reader, to restore the behavior in 2.2. However even we turn off that config, the behavior is still different than 2.2.

If the output data are rows, we still enable whole stage codegen for the scan node, which is different with 2.2, we should also fix it.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20513 from cloud-fan/cache.

(cherry picked from commit ac7454c)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
…e reader

## What changes were proposed in this pull request?

apache#20483 tried to provide a way to turn off the new columnar cache reader, to restore the behavior in 2.2. However even we turn off that config, the behavior is still different than 2.2.

If the output data are rows, we still enable whole stage codegen for the scan node, which is different with 2.2, we should also fix it.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#20513 from cloud-fan/cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants