Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30589][DOC] Document DISTRIBUTE BY Clause of SELECT statement in SQL Reference #27298

Closed

Conversation

dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

Document DISTRIBUTE BY clause of SELECT statement in SQL Reference Guide.

Why are the changes needed?

Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

Does this PR introduce any user-facing change?

Yes.

Before:
There was no documentation for this.

After.
Screen Shot 2020-01-20 at 3 08 24 PM
Screen Shot 2020-01-20 at 3 08 34 PM

How was this patch tested?

Tested using jykyll build --serve

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117140 has finished for PR 27298 at commit e5dc12e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

limitations under the License.
---
The <code>DISTRIBUTE BY</code> clause is used to repartition the data based
on the input expressions. Unlike the `CLUSTER BY` clause, this does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to CLUSTER BY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao will do it in the finalization pr when the links are available.

('John A', 18),
('Jack N', 16);
-- Reduce the number of shuffle partitions to 2 to illustrate the behaviour of `DISTRIBUTE BY`.
-- Its easier to see the clustering and sorting behaviour with less number of partitions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its -> it's?

SET spark.sql.shuffle.partitions = 2;

-- Select the rows with no ordering. Please note that without any sort directive, the results
-- of the query is not deterministic. Its included here to just contrast it with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its -> it's?

-- Its easier to see the clustering and sorting behaviour with less number of partitions.
SET spark.sql.shuffle.partitions = 2;

-- Select the rows with no ordering. Please note that without any sort directive, the results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the results of the query is... -> the result of the query is...?


-- Select the rows with no ordering. Please note that without any sort directive, the results
-- of the query is not deterministic. Its included here to just contrast it with the
-- behaviour of `DISTRIBUTE BY`. The query below produces rows where age column are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

age column are not... -> age columns are not...?

@SparkQA
Copy link

SparkQA commented Jan 21, 2020

Test build #117148 has finished for PR 27298 at commit fbd4096.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


### Syntax
{% highlight sql %}
DISTRIBUTE BY { expression [ , ...] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ , ...] -> [ , ... ]?

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117278 has finished for PR 27298 at commit 7e40347.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2020

Test build #117300 has finished for PR 27298 at commit 96b5628.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 27, 2020

Merged to master. @dilipbiswal did you want to make a follow up to link several pages?

@dilipbiswal
Copy link
Contributor Author

@srowen Thanks a lot Sean. Yeah.. I will do it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants