Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30581][DOC] Document SORT BY Clause of SELECT statement in SQLReference #27289

Closed
wants to merge 9 commits into from

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Document SORT BY clause of SELECT statement in SQL Reference Guide.

Why are the changes needed?

Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

Does this PR introduce any user-facing change?

Yes.

Before:
There was no documentation for this.

After.
Screen Shot 2020-01-20 at 1 25 57 AM
Screen Shot 2020-01-20 at 1 26 11 AM
Screen Shot 2020-01-20 at 1 26 28 AM
Screen Shot 2020-01-20 at 1 26 46 AM
Screen Shot 2020-01-20 at 1 27 02 AM

How was this patch tested?

Tested using jykyll build --serve

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117108 has finished for PR 27289 at commit 0f37760.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<dl>
<dt><code><em>SORT BY</em></code></dt>
<dd>
Specifies a comma separated list of expression along with optional parameters sort_direction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma-separated
expression -> expresisons
Back-tick things like sort_direction, etc.

<dt><code><em>nulls_sort_order</em></code></dt>
<dd>
Optionally specifies whether NULL values are returned before/after non-NULL values, based on the
sort direction. In spark, NULL values are considered to be lower than any non-NULL values. Therefore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark -> Spark. Are you describing a default here? make it explicit if so. Because this option lets you control that behavior, right?

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117136 has finished for PR 27289 at commit a849883.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

limitations under the License.
---
The <code>SORT BY</code> clause is used to return the result rows sorted
within each partition in the user specified order. When there are more than one partition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are more than one partition -> When there is more than one partition?

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117159 has finished for PR 27289 at commit 338fbc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


### Syntax
{% highlight sql %}
SORT BY { expression [ sort_direction | nulls_sort_oder ] [ , ...] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually different from ORDER BY or are they aliases? do we need to copy the docs if so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Yeah.. they are a bit different. ORDER BY performs a total sort where as SORT BY does sort within a partition. I kept them separate for only reason that to the best of my understanding "SORT BY" is not a frequently used clause. Its a hive compatibility thing. I thought of keeping them separate as with discussion with @gatorsmile we think we should have a top level link for "simple select" and "full select". In the "simple select" link, we will only include the most used clauses (i.e not include things like sort by, distribute by and cluster by).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK same comments here apply to the ORDER BY PR too then, I guess

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If most parts are duplicate, how about working on ORDER BY first, then copying the committed (final) version of ORDER BY here for less review overheads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @maropu.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117265 has finished for PR 27289 at commit 154cce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu changed the title [SPARK-30581][DOC] Document SORT BY Clause of SELECT statement in SQLReference. [SPARK-30581][DOC] Document SORT BY Clause of SELECT statement in SQLReference Jan 23, 2020
@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117279 has finished for PR 27289 at commit e7b7c59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117293 has finished for PR 27289 at commit e847798.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117299 has finished for PR 27289 at commit f5f42e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117302 has finished for PR 27289 at commit ea68204.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal dilipbiswal force-pushed the sql-ref-select-sortby branch from ea68204 to 6177310 Compare January 26, 2020 05:51
@dilipbiswal
Copy link
Contributor Author

@maropu @srowen I have reconciled the duplicate parts between orderby and sortby. Please take a look when you get a chance. Thanks !!

@SparkQA
Copy link

SparkQA commented Jan 26, 2020

Test build #117414 has finished for PR 27289 at commit 6177310.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 26, 2020

How about explicitly leaving some comments about the difference between ORDER BY and SORT BY behaviours somewhere in this doc? The other parts looks fine to me.

@dilipbiswal
Copy link
Contributor Author

@maropu I had tried to document this in the main description section like this :

The SORT BY clause is used to return the result rows sorted within each partition in the user specified order. When there is more than one partition SORT BY may return result that is partially ordered. This is different than ORDER BY clause which guarantees a total order of the output.

What do you think ?

@maropu
Copy link
Member

maropu commented Jan 27, 2020

@dilipbiswal Yea, that looks ok.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @srowen

@srowen srowen closed this in 7e1b991 Jan 27, 2020
@srowen
Copy link
Member

srowen commented Jan 27, 2020

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants