Skip to content

Commit

Permalink
[SPARK-38978][SQL] DS V2 supports push down OFFSET operator (#491)
Browse files Browse the repository at this point in the history
* [SPARK-28330][SQL] Support ANSI SQL: result offset clause in query expression
### What changes were proposed in this pull request?
This is a ANSI SQL and feature id is `F861`
```
<query expression> ::=
[ <with clause> ] <query expression body>
[ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]

<result offset clause> ::=
OFFSET <offset row count> { ROW | ROWS }
```
For example:
```
SELECT customer_name, customer_gender FROM customer_dimension
   WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name;
    customer_name     | customer_gender
----------------------+-----------------
 Amy X. Lang          | Female
 Anna H. Li           | Female
 Brian O. Weaver      | Male
 Craig O. Pavlov      | Male
 Doug Z. Goldberg     | Male
 Harold S. Jones      | Male
 Jack E. Perkins      | Male
 Joseph W. Overstreet | Male
 Kevin . Campbell     | Male
 Raja Y. Wilson       | Male
 Samantha O. Brown    | Female
 Steve H. Gauthier    | Male
 William . Nielson    | Male
 William Z. Roy       | Male
(14 rows)

SELECT customer_name, customer_gender FROM customer_dimension
   WHERE occupation='Dancer' AND customer_city = 'San Francisco' ORDER BY customer_name OFFSET 8;
   customer_name   | customer_gender
-------------------+-----------------
 Kevin . Campbell  | Male
 Raja Y. Wilson    | Male
 Samantha O. Brown | Female
 Steve H. Gauthier | Male
 William . Nielson | Male
 William Z. Roy    | Male
(6 rows)
```
There are some mainstream database support the syntax.

**Druid**
https://druid.apache.org/docs/latest/querying/sql.html#offset

**Kylin**
http://kylin.apache.org/docs/tutorial/sql_reference.html#QUERYSYNTAX

**Exasol**
https://docs.exasol.com/sql/select.htm

**Greenplum**
http://docs.greenplum.org/6-8/ref_guide/sql_commands/SELECT.html

**MySQL**
https://dev.mysql.com/doc/refman/5.6/en/select.html

**Monetdb**
https://www.monetdb.org/Documentation/SQLreference/SQLSyntaxOverview#SELECT

**PostgreSQL**
https://www.postgresql.org/docs/11/queries-limit.html

**Sqlite**
https://www.sqlite.org/lang_select.html

**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/SELECT/OFFSETClause.htm?zoom_highlight=offset

The description for design:
**1**. Consider `OFFSET` as the special case of `LIMIT`. For example:
`SELECT * FROM a limit 10;` similar to `SELECT * FROM a limit 10 offset 0;`
`SELECT * FROM a offset 10;` similar to `SELECT * FROM a limit -1 offset 10;`
**2**. Because the current implement of `LIMIT` has good performance. For example:
`SELECT * FROM a limit 10;` parsed to the logic plan as below:
```
GlobalLimit (limit = 10)
|--LocalLimit (limit = 10)
```
and then the physical plan as below:
```
GlobalLimitExec (limit = 10) // Take the first 10 rows globally
|--LocalLimitExec (limit = 10) // Take the first 10 rows locally
```
This operator reduce massive shuffle and has good performance.
Sometimes, the logic plan transformed to the physical plan as:
```
CollectLimitExec (limit = 10) // Take the first 10 rows globally
```
If the SQL contains order by, such as `SELECT * FROM a order by c limit 10;`.
This SQL will be transformed to the physical plan as below:
```
TakeOrderedAndProjectExec (limit = 10) // Take the first 10 rows after sort globally
```

Based on this situation, this PR produces the following operations. For example:
`SELECT * FROM a limit 10 offset 10;` parsed to the logic plan as below:
```
GlobalLimit (limit = 10)
|--LocalLimit (limit = 10)
   |--Offset (offset = 10)
```
After optimization, the above logic plan will be transformed to:
```
GlobalLimitAndOffset (limit = 10, offset = 10) // Limit clause accompanied by offset clause
|--LocalLimit (limit = 20)   // 10 + offset = 20
```

and then the physical plan as below:
```
GlobalLimitAndOffsetExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally
|--LocalLimitExec (limit = 20) // Take the first 20(limit + offset) rows locally
```
Sometimes, the logic plan transformed to the physical plan as:
```
CollectLimitExec (limit = 10, offset = 10) // Skip the first 10 rows and take the next 10 rows globally
```
If the SQL contains order by, such as `SELECT * FROM a order by c limit 10 offset 10;`.
This SQL will be transformed to the physical plan as below:
```
TakeOrderedAndProjectExec (limit = 10, offset 10) // Skip the first 10 rows and take the next 10 rows after sort globally
```
**3**.In addition to the above, there is a special case that is only offset but no limit. For example:
`SELECT * FROM a offset 10;` parsed to the logic plan as below:
```
Offset (offset = 10) // Only offset clause
```
If offset is very large, will generate a lot of overhead. So this PR will refuse use offset clause without limit clause, although we can parse, transform and execute it.

A balanced idea is add a configuration item `spark.sql.forceUsingOffsetWithoutLimit` to force running query when user knows the offset is small enough. The default value of `spark.sql.forceUsingOffsetWithoutLimit` is false. This PR just came up with the idea so that it could be implemented at a better time in the future.

Note: The origin PR to support this feature is apache#25416.
Because the origin PR too old, there exists massive conflict which is hard to resolve. So I open this new PR to support this feature.

### Why are the changes needed?
new feature

### Does this PR introduce any user-facing change?
'No'

### How was this patch tested?
Exists and new UT

Closes apache#35975 from beliefer/SPARK-28330.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39057][SQL] Offset could work without Limit
### What changes were proposed in this pull request?
Currently, `Offset` must work with `Limit`. The behavior not allow to use offset alone and add offset API into `DataFrame`.

If we use `Offset` alone, there are two situations:
1. If `Offset` is the last operator, collect the result to the driver and then drop/skip the first n (offset value) rows. Users can test or debug `Offset` in the way.
2. If `Offset` is the intermediate operator, shuffle all the result to one task and drop/skip the first n (offset value) rows and the result will be passed to the downstream operator.

For example, `SELECT * FROM a offset 10; ` parsed to the logic plan as below:
```
Offset (offset = 10) // Only offset clause
|--Relation
```

and then the physical plan as below:
```
CollectLimitExec(limit = -1, offset = 10) // Collect the result to the driver and skip the first 10 rows
|--JDBCRelation
```
or
```
GlobalLimitAndOffsetExec(limit = -1, offset = 10) // Collect the result and skip the first 10 rows
|--JDBCRelation
```

After this PR merged, users could input the SQL show below:
```
SELECT '' AS ten, unique1, unique2, stringu1
 		FROM onek
 		ORDER BY unique1 OFFSET 990;
```

Note: apache#35975 supports offset clause, it create a logical node named
`GlobalLimitAndOffset`. In fact, we can avoid use this node and use `Offset` instead and the latter is good with unify name.

### Why are the changes needed?
Improve the implement of offset clause.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
Exists test cases.

Closes apache#36417 from beliefer/SPARK-28330_followup2.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39159][SQL] Add new Dataset API for Offset

### What changes were proposed in this pull request?
Currently, Spark added `Offset` operator.
This PR try to add `offset` API into `Dataset`.

### Why are the changes needed?
`offset` API is very useful and construct test case more easily.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36519 from beliefer/SPARK-39159.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39180][SQL] Simplify the planning of limit and offset

### What changes were proposed in this pull request?

This PR simplifies the planning of limit and offset:
1. Unify the semantics of physical plans that need to deal with limit + offset. These physical plans always do limit first, then offset. The planner rule should set limit and offset properly, for different plans, such as limit + offset and offset + limit.
2. Refactor the planner rule `SpecialLimit` to reuse the code of planning `TakeOrderedAndProjectExec`.
3. Let `GlobalLimitExec` to handle offset as well, so that we can remove `GlobalLimitAndOffsetExec`. This matches `CollectLimitExec`.
### Why are the changes needed?

code simplification

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#36541 from cloud-fan/offset.

Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39037][SQL] DS V2 aggregate push-down supports order by expressions

### What changes were proposed in this pull request?
Currently, Spark DS V2 aggregate push-down only supports order by column.
But the SQL show below is very useful and common.
```
SELECT
  CASE
    WHEN 'SALARY' > 8000.00
      AND 'SALARY' < 10000.00
    THEN 'SALARY'
    ELSE 0.00
  END AS key,
  dept,
  name
FROM "test"."employee"
ORDER BY key
```

### Why are the changes needed?
Let DS V2 aggregate push-down supports order by expressions

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests

Closes apache#36370 from beliefer/SPARK-39037.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-38978][SQL] DS V2 supports push down OFFSET operator

### What changes were proposed in this pull request?
Currently, DS V2 push-down supports `LIMIT` but `OFFSET`.
If we can pushing down `OFFSET` to JDBC data source, it will be better performance.

### Why are the changes needed?
push down `OFFSET` could improves the performance.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36295 from beliefer/SPARK-38978.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* fix ut

* [SPARK-39340][SQL] DS v2 agg pushdown should allow dots in the name of top-level columns

### What changes were proposed in this pull request?

It turns out that I was wrong in apache#36727 . We still have the limitation (column name cannot contain dot) in master and 3.3 braches, in a very implicit way: The `V2ExpressionBuilder` has a boolean flag `nestedPredicatePushdownEnabled` whose default value is false. When it's false, it uses `PushableColumnWithoutNestedColumn` to match columns, which doesn't support dot in names.

`V2ExpressionBuilder` is only used in 2 places:
1. `PushableExpression`. This is a pattern match that is only used in v2 agg pushdown
2. `PushablePredicate`. This is a pattern match that is used in various places, but all the caller sides set `nestedPredicatePushdownEnabled` to true.

This PR removes the `nestedPredicatePushdownEnabled` flag from `V2ExpressionBuilder`, and makes it always support nested fields. `PushablePredicate` is also updated accordingly to remove the boolean flag, as it's always true.

### Why are the changes needed?

Fix a mistake to eliminate an unexpected limitation in DS v2 pushdown.

### Does this PR introduce _any_ user-facing change?

No for end users. For data source developers, they can trigger agg pushdowm more often.

### How was this patch tested?

a new test

Closes apache#36945 from cloud-fan/dsv2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39453][SQL] DS V2 supports push down misc non-aggregate functions(non ANSI)

### What changes were proposed in this pull request?
apache#36039 makes DS V2 supports push down misc non-aggregate functions are claimed by ANSI standard.
Spark have a lot common used misc non-aggregate functions are not claimed by ANSI standard.
https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L362.

The mainstream databases support these functions show below.
|  Function name   | PostgreSQL  | ClickHouse  | H2  | MySQL  | Oracle  | Redshift  | Presto  | Teradata  | Snowflake  | DB2  | Vertica  | Exasol  | SqlServer  | Yellowbrick  | Impala  | Mariadb | Druid | Pig | Singlestore | ElasticSearch | SQLite | Influxdata | Sybase |
|  ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  |
| `GREATEST` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | No | No |
| `LEAST` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | No | No |
| `IF` | No | Yes | No | Yes | No | No | Yes | No | Yes | No | No | Yes | No | Yes | Yes | Yes | No | No | Yes | Yes | Yes | No | No |
| `RAND` | No | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | No | Yes |

### Why are the changes needed?
DS V2 supports push down misc non-aggregate functions supported by mainstream databases.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36830 from beliefer/SPARK-38761_followup.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-39479][SQL] DS V2 supports push down math functions(non ANSI)

### What changes were proposed in this pull request?
apache#36140 makes DS V2 supports push down math functions are claimed by ANSI standard.
Spark have a lot common used math functions are not claimed by ANSI standard.
https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L388

The mainstream databases support these functions show below.
|  Function name   | PostgreSQL  | ClickHouse  | H2  | MySQL  | Oracle  | Redshift  | Presto  | Teradata  | Snowflake  | DB2  | Vertica  | Exasol  | SqlServer  | Yellowbrick  | Impala  | Mariadb | Druid | Pig | Singlestore | ElasticSearch | SQLite | Influxdata | Sybase |
|  ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  | ----  |
| `SIN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `SINH` | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | Yes | No | Yes | Yes | Yes | No |
| `COS` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `COSH` | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | Yes | No | Yes | Yes | Yes | No |
| `TAN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `TANH` | Yes | No | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | Yes | No | No | No | Yes | No |
| `COT` | Yes | No | Yes | Yes | No | Yes | No | No | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | No | No | Yes |
| `ASIN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `ASINH` | Yes | Yes | No | No | No | No | No | Yes | Yes | No | No | No | No | No | No | No | No | No | No | No | Yes | Yes | No |
| `ACOS` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `ACOSH` | Yes | Yes | No | No | No | No | No | Yes | Yes | No | No | No | No | No | No | No | No | No | No | No | Yes | Yes | No |
| `ATAN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `ATAN2` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes |
| `ATANH` | Yes | Yes | No | No | No | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | No | Yes | Yes | No |
| `LOG` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `LOG10` | Yes | Yes | Yes | Yes | No | No | Yes | Yes | No | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| `LOG2` | No | Yes | No | Yes | No | No | Yes | Yes | No | No | No | Yes | No | No | Yes | Yes | No | No | Yes | No | Yes | Yes | No |
| `CBRT` | Yes | Yes | No | No | No | Yes | Yes | No | Yes | No | Yes | No | No | Yes | No | No | No | Yes | No | Yes | No | Yes | No |
| `DEGREES` | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes |
| `RADIANS` | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes |
| `ROUND` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes |
| `SIGN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | Yes | No | No | Yes |

### Why are the changes needed?
DS V2 supports push down math functions supported by mainstream databases.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New tests.

Closes apache#36877 from beliefer/SPARK-39479.

Authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
4 people authored Jul 6, 2022
1 parent cde6def commit fc63de1
Show file tree
Hide file tree
Showing 276 changed files with 2,293 additions and 544 deletions.
2 changes: 1 addition & 1 deletion docs/sql-ref-ansi-compliance.md
Original file line number Diff line number Diff line change
Expand Up @@ -452,7 +452,7 @@ Below is a list of all the keywords in Spark SQL.
|NULL|reserved|non-reserved|reserved|
|NULLS|non-reserved|non-reserved|non-reserved|
|OF|non-reserved|non-reserved|reserved|
|OFFSET|non-reserved|non-reserved|reserved|
|OFFSET|reserved|non-reserved|reserved|
|ON|reserved|strict-non-reserved|reserved|
|ONLY|reserved|non-reserved|reserved|
|OPTION|non-reserved|non-reserved|non-reserved|
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,42 @@
* <li>Since version: 3.3.0</li>
* </ul>
* </li>
* <li>Name: <code>GREATEST</code>
* <ul>
* <li>SQL semantic: <code>GREATEST(expr, ...)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>LEAST</code>
* <ul>
* <li>SQL semantic: <code>LEAST(expr, ...)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>RAND</code>
* <ul>
* <li>SQL semantic: <code>RAND([seed])</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>LOG</code>
* <ul>
* <li>SQL semantic: <code>LOG(base, expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>LOG10</code>
* <ul>
* <li>SQL semantic: <code>LOG10(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>LOG2</code>
* <ul>
* <li>SQL semantic: <code>LOG2(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>LN</code>
* <ul>
* <li>SQL semantic: <code>LN(expr)</code></li>
Expand Down Expand Up @@ -142,6 +178,120 @@
* <li>Since version: 3.3.0</li>
* </ul>
* </li>
* <li>Name: <code>ROUND</code>
* <ul>
* <li>SQL semantic: <code>ROUND(expr, [scale])</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>SIN</code>
* <ul>
* <li>SQL semantic: <code>SIN(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>SINH</code>
* <ul>
* <li>SQL semantic: <code>SINH(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>COS</code>
* <ul>
* <li>SQL semantic: <code>COS(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>COSH</code>
* <ul>
* <li>SQL semantic: <code>COSH(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>TAN</code>
* <ul>
* <li>SQL semantic: <code>TAN(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>TANH</code>
* <ul>
* <li>SQL semantic: <code>TANH(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>COT</code>
* <ul>
* <li>SQL semantic: <code>COT(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ASIN</code>
* <ul>
* <li>SQL semantic: <code>ASIN(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ASINH</code>
* <ul>
* <li>SQL semantic: <code>ASINH(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ACOS</code>
* <ul>
* <li>SQL semantic: <code>ACOS(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ACOSH</code>
* <ul>
* <li>SQL semantic: <code>ACOSH(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ATAN</code>
* <ul>
* <li>SQL semantic: <code>ATAN(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ATANH</code>
* <ul>
* <li>SQL semantic: <code>ATANH(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>ATAN2</code>
* <ul>
* <li>SQL semantic: <code>ATAN2(exprY, exprX)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>CBRT</code>
* <ul>
* <li>SQL semantic: <code>CBRT(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>DEGREES</code>
* <ul>
* <li>SQL semantic: <code>DEGREES(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>RADIANS</code>
* <ul>
* <li>SQL semantic: <code>RADIANS(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>SIGN</code>
* <ul>
* <li>SQL semantic: <code>SIGN(expr)</code></li>
* <li>Since version: 3.4.0</li>
* </ul>
* </li>
* <li>Name: <code>WIDTH_BUCKET</code>
* <ul>
* <li>SQL semantic: <code>WIDTH_BUCKET(expr)</code></li>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@
* An interface for building the {@link Scan}. Implementations can mixin SupportsPushDownXYZ
* interfaces to do operator push down, and keep the operator push down result in the returned
* {@link Scan}. When pushing down operators, the push down order is:
* sample -&gt; filter -&gt; aggregate -&gt; limit -&gt; column pruning.
* sample -&gt; filter -&gt; aggregate -&gt; limit/top-n(sort + limit) -&gt; offset -&gt;
* column pruning.
*
* @since 3.0.0
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@

/**
* A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to
* push down LIMIT. Please note that the combination of LIMIT with other operations
* such as AGGREGATE, GROUP BY, SORT BY, CLUSTER BY, DISTRIBUTE BY, etc. is NOT pushed down.
* push down LIMIT. We can push down LIMIT with many other operations if they follow the
* operator order we defined in {@link ScanBuilder}'s class doc.
*
* @since 3.3.0
*/
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.sql.connector.read;

import org.apache.spark.annotation.Evolving;

/**
* A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to
* push down OFFSET. We can push down OFFSET with many other operations if they follow the
* operator order we defined in {@link ScanBuilder}'s class doc.
*
* @since 3.4.0
*/
@Evolving
public interface SupportsPushDownOffset extends ScanBuilder {

/**
* Pushes down OFFSET to the data source.
*/
boolean pushOffset(int offset);
}
Original file line number Diff line number Diff line change
Expand Up @@ -22,23 +22,22 @@

/**
* A mix-in interface for {@link ScanBuilder}. Data sources can implement this interface to
* push down top N(query with ORDER BY ... LIMIT n). Please note that the combination of top N
* with other operations such as AGGREGATE, GROUP BY, CLUSTER BY, DISTRIBUTE BY, etc.
* is NOT pushed down.
* push down top N(query with ORDER BY ... LIMIT n). We can push down top N with many other
* operations if they follow the operator order we defined in {@link ScanBuilder}'s class doc.
*
* @since 3.3.0
*/
@Evolving
public interface SupportsPushDownTopN extends ScanBuilder {

/**
* Pushes down top N to the data source.
*/
boolean pushTopN(SortOrder[] orders, int limit);
/**
* Pushes down top N to the data source.
*/
boolean pushTopN(SortOrder[] orders, int limit);

/**
* Whether the top N is partially pushed or not. If it returns true, then Spark will do top N
* again. This method will only be called when {@link #pushTopN} returns true.
*/
default boolean isPartiallyPushed() { return true; }
/**
* Whether the top N is partially pushed or not. If it returns true, then Spark will do top N
* again. This method will only be called when {@link #pushTopN} returns true.
*/
default boolean isPartiallyPushed() { return true; }
}
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,37 @@ public String build(Expression expr) {
return visitUnaryArithmetic(name, inputToSQL(e.children()[0]));
case "ABS":
case "COALESCE":
case "GREATEST":
case "LEAST":
case "RAND":
case "LOG":
case "LOG10":
case "LOG2":
case "LN":
case "EXP":
case "POWER":
case "SQRT":
case "FLOOR":
case "CEIL":
case "ROUND":
case "SIN":
case "SINH":
case "COS":
case "COSH":
case "TAN":
case "TANH":
case "COT":
case "ASIN":
case "ASINH":
case "ACOS":
case "ACOSH":
case "ATAN":
case "ATANH":
case "ATAN2":
case "CBRT":
case "DEGREES":
case "RADIANS":
case "SIGN":
case "WIDTH_BUCKET":
case "SUBSTRING":
case "UPPER":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -393,20 +393,17 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
val offset = offsetExpr.eval().asInstanceOf[Int]
if (Int.MaxValue - limit < offset) {
failAnalysis(
s"""The sum of limit and offset must not be greater than Int.MaxValue,
| but found limit = $limit, offset = $offset.""".stripMargin)
s"""
|The sum of the LIMIT clause and the OFFSET clause must not be greater than
|the maximum 32-bit integer value (2,147,483,647),
|but found limit = $limit, offset = $offset.
|""".stripMargin.replace("\n", " "))
}
case _ =>
}

case Offset(offsetExpr, _) => checkLimitLikeClause("offset", offsetExpr)

case o if !o.isInstanceOf[GlobalLimit] && !o.isInstanceOf[LocalLimit]
&& o.children.exists(_.isInstanceOf[Offset]) =>
failAnalysis(
s"""Only the OFFSET clause is allowed in the LIMIT clause, but the OFFSET
| clause found in: ${o.nodeName}.""".stripMargin)

case Tail(limitExpr, _) => checkLimitLikeClause("tail", limitExpr)

case _: Union | _: SetOperation if operator.children.length > 1 =>
Expand Down Expand Up @@ -567,7 +564,6 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
}
}
checkCollectedMetrics(plan)
checkOutermostOffset(plan)
extendedCheckRules.foreach(_(plan))
plan.foreachUp {
case o if !o.resolved =>
Expand All @@ -578,20 +574,6 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
plan.setAnalyzed()
}

/**
* Validate that the root node of query or subquery is [[Offset]].
*/
private def checkOutermostOffset(plan: LogicalPlan): Unit = {
plan match {
case Offset(offsetExpr, _) =>
checkLimitLikeClause("limit", offsetExpr)
failAnalysis(
s"""Only the OFFSET clause is allowed in the LIMIT clause, but the OFFSET
| clause is found to be the outermost node.""".stripMargin)
case _ =>
}
}

/**
* Validates subquery expressions in the plan. Upon failure, returns an user facing error.
*/
Expand Down
Loading

0 comments on commit fc63de1

Please sign in to comment.