Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression #27429

Closed
wants to merge 13 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Feb 1, 2020

What changes were proposed in this pull request?

This is a ANSI SQL and feature id is F861

<query expression> ::=
[ <with clause> ] <query expression body>
[ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]

<result offset clause> ::=
OFFSET <offset row count> { ROW | ROWS }

For example:

SELECT customer_name, customer_gender FROM customer_dimension 
   WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name;
    customer_name     | customer_gender
----------------------+-----------------
 Amy X. Lang          | Female
 Anna H. Li           | Female
 Brian O. Weaver      | Male
 Craig O. Pavlov      | Male
 Doug Z. Goldberg     | Male
 Harold S. Jones      | Male
 Jack E. Perkins      | Male
 Joseph W. Overstreet | Male
 Kevin . Campbell     | Male
 Raja Y. Wilson       | Male
 Samantha O. Brown    | Female
 Steve H. Gauthier    | Male
 William . Nielson    | Male
 William Z. Roy       | Male
(14 rows)

SELECT customer_name, customer_gender FROM customer_dimension 
   WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name OFFSET 8;
   customer_name   | customer_gender
-------------------+-----------------
 Kevin . Campbell  | Male
 Raja Y. Wilson    | Male
 Samantha O. Brown | Female
 Steve H. Gauthier | Male
 William . Nielson | Male
 William Z. Roy    | Male
(6 rows)

There are some mainstream database support the syntax.

Druid
https://druid.apache.org/docs/latest/querying/sql.html#offset

Kylin
http://kylin.apache.org/docs/tutorial/sql_reference.html#QUERYSYNTAX

Exasol
https://docs.exasol.com/sql/select.htm

Greenplum
http://docs.greenplum.org/6-8/ref_guide/sql_commands/SELECT.html

MySQL
https://dev.mysql.com/doc/refman/5.6/en/select.html

Monetdb
https://www.monetdb.org/Documentation/SQLreference/SQLSyntaxOverview#SELECT

PostgreSQL
https://www.postgresql.org/docs/11/queries-limit.html

Sqlite
https://www.sqlite.org/lang_select.html

Vertica
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/SELECT/OFFSETClause.htm?zoom_highlight=offset

The description for design:
1. Consider OFFSET as the special case of LIMIT. For example:
SELECT * FROM a limit 10; similar to SELECT * FROM a limit 10 offset 0;
SELECT * FROM a offset 10; similar to SELECT * FROM a limit -1 offset 10;
2. Because the current implement of LIMIT has good performance. For example:
SELECT * FROM a limit 10; parsed to the logic plan as below:

GlobalLimit (limit = 10)
|--LocalLimit (limit = 10)

and then the physical plan as below:

GlobalLimitExec (limit = 10) // Take the first 10 rows globally
|--LocalLimitExec (limit = 10) // Take the first 10 rows locally

This operator reduce massive shuffle and has good performance.
Sometimes, the logic plan transformed to the physical plan as:

CollectLimitExec (limit = 10) // Take the first 10 rows globally

If the SQL contains order by, such as SELECT * FROM a order by c limit 10;.
This SQL will be transformed to the physical plan as below:

TakeOrderedAndProjectExec (limit = 10) // Take the first 10 rows after sort globally

Based on this situation, this PR produces the following operations. For example:
SELECT * FROM a limit 10 offset 10; parsed to the logic plan as below:

GlobalLimit (limit = 10)
|--LocalLimit (limit = 10)
   |--Offset (offset = 10)

After optimization, the above logic plan will be transformed to:

GlobalLimitAndOffset (limit = 10, offset = 10) // Limit clause accompanied by offset clause
|--LocalLimit (limit = 20)   // 10 + offset = 20

and then the physical plan as below:

GlobalLimitAndOffsetExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally
|--LocalLimitExec (limit = 20) // Take the first 20(limit + offset) rows locally

Sometimes, the logic plan transformed to the physical plan as:

CollectLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally

If the SQL contains order by, such as SELECT * FROM a order by c limit 10 offset 10;.
This SQL will be transformed to the physical plan as below:

TakeOrderedAndProjectExec (limit = 10, offset 10) // Skip the first 10 rows and take the next 10 rows after sort globally

3.In addition to the above, there is a special case that is only offset but no limit. For example:
SELECT * FROM a offset 10; parsed to the logic plan as below:

Offset (offset = 10) // Only offset clause

If offset is very large, will generate a lot of overhead. So this PR will refuse use offset clause without limit clause, although we can parse, transform and execute it.

A balanced idea is add a configuration item spark.sql.forceUsingOffsetWithoutLimit to force running query when user knows the offset is small enough. The default value of spark.sql.forceUsingOffsetWithoutLimit is false. This PR just came up with the idea so that it could be implemented at a better time in the future.

Note: The origin PR to support this feature is #25416.
Because the origin PR too old, there exists massive conflict which is hard to resolve. So I open this new PR to support this feature.

Why are the changes needed?

new feature

Does this PR introduce any user-facing change?

'No'

How was this patch tested?

Exists and new UT

@beliefer
Copy link
Contributor Author

beliefer commented Feb 1, 2020

@gatorsmile @cloud-fan I'm sorry for open a new PR. Because I made some mistake operator.

@SparkQA
Copy link

SparkQA commented Feb 1, 2020

Test build #117719 has finished for PR 27429 at commit 7316cca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Offset(offsetExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class GlobalLimit(
  • case class LocalLimit(
  • case class CollectLimitExec(limit: Int, offset: Int, child: SparkPlan) extends LimitExec
  • case class LocalLimitExec(limit: Int, child: SparkPlan, offset: Int = 0) extends BaseLimitExec
  • case class GlobalLimitExec(limit: Int, child: SparkPlan, offset: Int = 0) extends BaseLimitExec

@SparkQA
Copy link

SparkQA commented Feb 2, 2020

Test build #117730 has finished for PR 27429 at commit 4f22a7a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 2, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2020

Test build #117736 has finished for PR 27429 at commit 4f22a7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LocalLimitExec (limit = 10, offset = 10) // Take the first 20(limit + offset) rows locally

This is LocalLimitExec(20), seems we don't need to add "offset" parameter to LocalLimitExec?

GlobalLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally

Can we create a new operator GlobalLimitAndOffset for it? It's a bad idea to overload the sematic of an existing operator.

... force running query when user knows the offset is small enough.

Let's not add a feature that is disabled by default and not needed by most users. Maintaining a feature has cost.

For OFFSET without LIMIT, we already have an operator Tail, which fails if it's not the root node. I think root node OFFSET is the most common use case and Tail is good enough.

So a solution can be

  1. For LIMIT ... OFFSET ... , create GlobalLimitAndOffset(..., LocalLimit(...))
  2. For OFFSET only, create Tail.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 3, 2020

@cloud-fan Thanks a lot !
For the first suggestion , I understand it.
For the second suggestion, I have a question, OFFSET and Tail mean different things.
If use use OFFSET only just want to skip some rows and reserve remaining rows. Users cannot use Tail instead of OFFSET because users don't know the number of remaining rows.

@cloud-fan
Copy link
Contributor

ah sorry made a mistake. Let's just don't support OFFSET only for now. Its output is kind of unlimited.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 3, 2020

ah sorry made a mistake. Let's just don't support OFFSET only for now. Its output is kind of unlimited.

OK.

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117798 has finished for PR 27429 at commit 1149727.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 4, 2020

@cloud-fan I have refactored the implementation.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 4, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117867 has finished for PR 27429 at commit 8bcee2f.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class GlobalLimitAndOffset(
  • case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec
  • case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec
  • case class GlobalLimitAndOffsetExec(

@beliefer
Copy link
Contributor Author

beliefer commented Feb 5, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117896 has finished for PR 27429 at commit 8bcee2f.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class GlobalLimitAndOffset(
  • case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec
  • case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec
  • case class GlobalLimitAndOffsetExec(

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117905 has finished for PR 27429 at commit 8bcee2f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class GlobalLimit(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
  • case class GlobalLimitAndOffset(
  • case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec
  • case class GlobalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec
  • case class GlobalLimitAndOffsetExec(

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117907 has finished for PR 27429 at commit 484e86a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 5, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117922 has finished for PR 27429 at commit 484e86a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @maryannxue @hvanhovell

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121199 has finished for PR 27429 at commit 44b1861.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121215 has finished for PR 27429 at commit 44b1861.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123718 has finished for PR 27429 at commit b5821d5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123756 has finished for PR 27429 at commit b5821d5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123764 has finished for PR 27429 at commit b5821d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126825 has finished for PR 27429 at commit 2820b6c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 31, 2020

Test build #126863 has finished for PR 27429 at commit 2820b6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Aug 5, 2020

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33759/

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33759/

@SparkQA
Copy link

SparkQA commented Sep 27, 2020

Test build #129143 has finished for PR 27429 at commit bbe10d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34655/

@SparkQA
Copy link

SparkQA commented Oct 20, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34655/

@SparkQA
Copy link

SparkQA commented Nov 10, 2020

Test build #130867 has finished for PR 27429 at commit bbe10d6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35504/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35504/

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Test build #130898 has finished for PR 27429 at commit bbe10d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

github-actions bot commented Apr 8, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 8, 2021
@github-actions github-actions bot closed this Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants