Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7322] [SQL] [WIP] Support Window Function in DataFrame #6104

Closed
wants to merge 12 commits into from

Conversation

chenghao-intel
Copy link
Contributor

This is a WIP PR for early feedback.

The usage is kind of like:

df.select(
        $"key",
        avg("value").over(
          partitionBy($"key")
          .orderBy($"value")
          .rows   // RowFrame
          .between
          .preceding(1)
          .and
          .following(1))

df.select(
        $"key",
        avg("value").over(
          partitionBy($"key")
          .orderBy($"value")
          .range// RangeFrame
          .preceding(1))

// Define a new window
val w = partitionBy("key").orderBy("value")
df.select(
   lead("key").over(w), // binding the predefined window
   lead("value").over(w)) // binding the predefined window
  • API design
  • Scala Doc
  • More specific Unit Tests
  • Python API Support
  • More function added e.g. (NTILE, ROW_NUMBER, DENSE_RANK, RANK, CUME_DIST and PERCENT_RANK, which supported by Hive)

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32572 has started for PR 6104 at commit 4368a81.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32572 has finished for PR 6104 at commit 4368a81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WindowFunctionDefinition protected[sql](

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32572/
Test PASSed.

@hvanhovell
Copy link
Contributor

Hi,

In the JIRA the following examples is given:

df.select(
  df.store,
  df.date,
  df.sales,
  avg(df.sales).over.partitionBy(df.store)
                    .orderBy(df.store) 
                    .rowsFollowing(0)    // this means from unbounded preceding to current row
)

Is there a reason for why the aggregation operation has moved from the beginning (the style in the JIRA), to the end (style above)? Are both still possible? I'd prefer the former, since it seems a bit shorter, and more recognizable comming from SQL.

On a related note. Is it also an idea to be able to create a seperate window (groupBy/orderBy) definition and use this definition in one or more windowed aggregates. For example:

val window = partitionBy($"store").orderBy($"date)
df.select (
  $"store"
  ,$"date"
  ,sum($"sales").over(window).rowsFollowing(0).as("TotalSales")
  ,sum($"sales").over(window).rowsFollowing(0).rowsPreceding(2).as("SalesLast3M")
)

@rxin
Copy link
Contributor

rxin commented May 13, 2015

It seems to me it'd be easier to have the aggregate function in the front also. @chenghao-intel any reason you designed it this way? Is it to accommodate multiple aggregates for the same window?

@chenghao-intel
Copy link
Contributor Author

Oh, actually it was my bad, the over should be a independent function call, not a method of Column.

The window definition should be quite independent also, and it can be given a name, as @rxin said, it would be simpler if accommodating multiple functions. And, it would be easier to create functions(aggregate/window function) via the object WindowFunctionDefinition instead of the DataFrame, right?

@chenghao-intel
Copy link
Contributor Author

rowsFollowing(0) is a good idea for representing the current row. :)

For .over(window).rowsFollowing(0) V.S. over.rowsFollowing(0).withPartition, the former might make it a little more complicated in code-wise. I can update the code once we get more feedback?

@rxin
Copy link
Contributor

rxin commented May 13, 2015

@chenghao-intel - I think it's better to follow SQL more closely here. I don't think it is much harder to write multiple aggregates, since the user can easily just add a function to create the window statement. On the contrary, it is more clear how many aggregates we are applying if we put this in the front.

@chenghao-intel
Copy link
Contributor Author

Ok, I will update the code soon.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32702 has started for PR 6104 at commit d37fbe4.

@chenghao-intel
Copy link
Contributor Author

I've updated the code and description, however, I have no idea if we can remove the toColumn method from the WindownFunctionDefinition.
Any thoughts? @rxin

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32702 has finished for PR 6104 at commit d37fbe4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WindowFunctionDefinition protected[sql](

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32702/
Test PASSed.

@ash211
Copy link
Contributor

ash211 commented May 14, 2015

There was talk earlier of referencing the window function API that jooq uses when implementing this in SparkSQL. Is it a goal to make this similar to jooq's syntax?

http://blog.jooq.org/2013/11/03/probably-the-coolest-sql-feature-window-functions/

@rxin
Copy link
Contributor

rxin commented May 14, 2015

Yup it is fairly similar.

@chenghao-intel
Copy link
Contributor Author

Thank you @ash211 That's really a cool idea to support a named window in the page that you send. I will update the code.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32827 has started for PR 6104 at commit a4da7fe.

@chenghao-intel
Copy link
Contributor Author

@rxin @yhuai , any comments on the API design? Then I can start the rest of the work.

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32827 has finished for PR 6104 at commit a4da7fe.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WindowFunctionDefinition protected[sql](

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32827/
Test FAILed.

@chenghao-intel
Copy link
Contributor Author

@rxin Updated!
But still need to add scaladoc, before I doing this, can you review the interface again?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

}

@Test
public void saveTableAndQueryIt() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the function name.

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33254 has started for PR 6104 at commit d625a64.

* based on it.
*/
@Experimental
object Window extends Window()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think about it more -- this is actually problematic for java because none partitionBy, orderBy won't become static methods due to conflicts with the Window class. Here's the fix.

Move "class Window" into "object Window", and rename it to WindowSpec, and then just define the two partitionBy / orderBy top level methods in object Window. If we need another method for a window spec that doesn't have partitionBy/orderBy, we can add another one - I don't have a good name for it yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhuai suggested "allRows"

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33254 has finished for PR 6104 at commit d625a64.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Window

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33254/
Test FAILed.

@rxin
Copy link
Contributor

rxin commented May 21, 2015

Actually I will submit a PR against your branch.

rxin added a commit to rxin/spark that referenced this pull request May 22, 2015
[SPARK-7322] [SQL] [WIP] Support Window Function in DataFrame
asfgit pushed a commit that referenced this pull request May 22, 2015
This closes #6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame
asfgit pushed a commit that referenced this pull request May 22, 2015
This closes #6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame

(cherry picked from commit f6f2eeb)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@chenghao-intel chenghao-intel deleted the df_window branch May 22, 2015 12:39
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This closes apache#6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes apache#6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request apache#6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This closes apache#6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes apache#6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request apache#6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This closes apache#6104.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Reynold Xin <rxin@databricks.com>

Closes apache#6343 from rxin/window-df and squashes the following commits:

026d587 [Reynold Xin] Address code review feedback.
dc448fe [Reynold Xin] Fixed Hive tests.
9794d9d [Reynold Xin] Moved Java test package.
9331605 [Reynold Xin] Refactored API.
3313e2a [Reynold Xin] Merge pull request apache#6104 from chenghao-intel/df_window
d625a64 [Cheng Hao] Update the dataframe window API as suggsted
c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition
3b1865f [Cheng Hao] scaladoc typos
f3fd2d0 [Cheng Hao] polish the unit test
6847825 [Cheng Hao] Add additional analystcs functions
57e3bc0 [Cheng Hao] typos
24a08ec [Cheng Hao] scaladoc
28222ed [Cheng Hao] fix bug of range/row Frame
1d91865 [Cheng Hao] style issue
53f89f2 [Cheng Hao] remove the over from the functions.scala
964c013 [Cheng Hao] add more unit tests and window functions
64e18a7 [Cheng Hao] Add Window Function support for DataFrame
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants