[SPARK-28962][SQL] Provide index argument to filter lambda functions #25666

henrydavidge · 2019-09-03T18:45:00Z

What changes were proposed in this pull request?

Lambda functions to array filter can now take as input the index as well as the element. This behavior matches array transform.

Why are the changes needed?

See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays.

Does this PR introduce any user-facing change?

Previously filter lambdas had to look like
filter(arr, el -> whatever)

Now, lambdas can take an index argument as well
filter(array, (el, idx) -> whatever)

How was this patch tested?

I added unit tests to HigherOrderFunctionsSuite.

hvanhovell · 2019-09-03T18:59:16Z

Ok to test

HyukjinKwon · 2019-09-04T04:11:25Z

ok to test

SparkQA · 2019-09-04T07:05:01Z

Test build #110089 has finished for PR 25666 at commit 30f37d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-13T21:14:52Z

Test build #110572 has finished for PR 25666 at commit 7402935.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-13T23:38:12Z

retest this please

SparkQA · 2019-09-14T03:30:14Z

Test build #110580 has finished for PR 25666 at commit 7402935.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-17T00:34:02Z

ok to test

SparkQA · 2019-09-17T05:51:33Z

Test build #110695 has finished for PR 25666 at commit 7402935.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM so far, but we might need to add tests to DataFrameFunctionsSuite for the new usage.

ueshin · 2019-09-19T19:55:30Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@@ -344,6 +344,8 @@ case class MapFilter(
    Examples:
      > SELECT _FUNC_(array(1, 2, 3), x -> x % 2 == 1);
       [1,3]
+      > SELECT _FUNC_(array(0, 2, 3), (x, i) -> x > i);
+       [2, 3]
  """,
  since = "2.4.0")


Could you add a note to describe this can take the index argument since 3.0.0? E.g., from collectionOperations.scala:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

Lines 1049 to 1051 in 7402935

note = """

Reverse logic for arrays is available since 2.4.0.

"""

ueshin · 2019-09-19T19:59:49Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@@ -344,6 +344,8 @@ case class MapFilter(
    Examples:
      > SELECT _FUNC_(array(1, 2, 3), x -> x % 2 == 1);
       [1,3]
+      > SELECT _FUNC_(array(0, 2, 3), (x, i) -> x > i);
+       [2, 3]


nit: [2,3]?

maropu · 2019-09-19T23:57:05Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

-  @transient lazy val LambdaFunction(_, Seq(elementVar: NamedLambdaVariable), _) = function
+  @transient lazy val (elementVar, indexVar) = {
+    val LambdaFunction(_, (elementVar: NamedLambdaVariable) +: tail, _) = function
+    val indexVar = if (tail.nonEmpty) {


nit: val indexVar = tail.headOption.map(_.asInstanceOf[NamedLambdaVariable])

maropu · 2019-09-19T23:58:07Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+      case LambdaFunction(_, arguments, _) if arguments.size == 2 =>
+        copy(function = f(function, (elementType, containsNull) :: (IntegerType, false) :: Nil))
+      case _ =>
+        copy(function = f(function, (elementType, containsNull) :: Nil))


We don't need to validate # of arguments here? (the case: arguments.size > 2)

Can you check the current error mesasage for the case?

ArrayTransform doesn't validate arguments.size > 2. I'm not sure what happens in that case either.

nvm. I checked the error handling works well for the case.

Yes, it does. See the test here: https://github.com/apache/spark/pull/25666/files#diff-8e1a34391fdefa4a3a0349d7d454d86fR2204.

nvander1 · 2019-09-25T03:56:33Z

Should we also provide similar overloads with index arguments in exists,forall, and aggregate?
Seems strange to only provide them for transform and filter.
@ueshin @henrydavidge @maropu

ueshin · 2019-09-25T04:32:10Z

@nvander1 I'm not sure whether we also need the index argument in exists, forall and aggregate, but we should do it in separate PRs anyway.

henrydavidge · 2019-09-30T04:45:50Z

@ueshin comments addressed. I added a test to DataFrameFunctionsSuite as well.

SparkQA · 2019-09-30T07:05:02Z

Test build #111596 has finished for PR 25666 at commit 9304867.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2019-09-30T18:13:48Z

Jenkins, retest this please.

ueshin

LGTM, pending tests.

SparkQA · 2019-09-30T22:12:13Z

Test build #111619 has finished for PR 25666 at commit 9304867.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-30T23:47:44Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@@ -369,6 +383,9 @@ case class ArrayFilter(
    var i = 0
    while (i < arr.numElements) {
      elementVar.value.set(arr.get(i, elementVar.dataType))
+      if (indexVar.isDefined) {


Can you avoid this per-row check? The current code causes unnecessary runtime overheads.

@maropu do you have a suggestion about how to do this without implementing codegen? I tried rewriting the logic like so:

@transient private lazy val evalFn: (InternalRow, Any) => Any = indexVar match { case None => (inputRow, argumentValue) => val arr = argumentValue.asInstanceOf[ArrayData] val f = functionForEval val buffer = new mutable.ArrayBuffer[Any](arr.numElements) var i = 0 while (i < arr.numElements) { elementVar.value.set(arr.get(i, elementVar.dataType)) if (f.eval(inputRow).asInstanceOf[Boolean]) { buffer += elementVar.value.get } i += 1 } new GenericArrayData(buffer) case Some(expr) => (inputRow, argumentValue) => val arr = argumentValue.asInstanceOf[ArrayData] val f = functionForEval val buffer = new mutable.ArrayBuffer[Any](arr.numElements) var i = 0 while (i < arr.numElements) { elementVar.value.set(arr.get(i, elementVar.dataType)) expr.value.set(i) if (f.eval(inputRow).asInstanceOf[Boolean]) { buffer += elementVar.value.get } i += 1 } new GenericArrayData(buffer) } override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): Any = { evalFn(inputRow, argumentValue) }

But from some hacky microbenchmarking this doesn't seem to be meaningfully faster and if anything is marginally slower.

This is the benchmark code I was using:

test("ArrayFilter - benchmark") { import scala.concurrent.duration._ val b = new Benchmark( "array_filter", 1000, warmupTime = 5.seconds, minTime = 5.seconds) val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) val isEven: Expression => Expression = x => x % 2 === 0 b.addCase("filter") { _ => var i = 0 while (i < 1000) { filter(ai0, isEven).eval() i += 1 } } b.run() }

@maropu @henrydavidge The best performing way to avoid the per-row check in a non-codegen setting is to introduce a new expression type, say ArrayFilterWithIndex.

The tradeoff between the inline per-row check and the lambda batch solution is that on input arrays that are small (like the one @henrydavidge used in his benchmark), the lambda invocation (which is not guaranteed to be inlined+optimized) overhead may exceed the per-row check overhead. You'd need a fairly large input array to amortize that.

If we want to make it stay simple for now, I'm okay with the inline per-row check version.

I thought code like this;

@transient lazy val (elementVar, mayFillIndex) = function match { case LambdaFunction(_, Seq(elemVar: NamedLambdaVariable), _) => (elemVar, (_: Int) => {}) case LambdaFunction(_, Seq(elemVar: NamedLambdaVariable, idxVar: NamedLambdaVariable), _) => (elemVar, (i: Int) => idxVar.value.set(i)) } override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): Any = { val arr = argumentValue.asInstanceOf[ArrayData] val f = functionForEval val buffer = new mutable.ArrayBuffer[Any](arr.numElements) var i = 0 while (i < arr.numElements) { elementVar.value.set(arr.get(i, elementVar.dataType)) mayFillIndex(i) if (f.eval(inputRow).asInstanceOf[Boolean]) { buffer += elementVar.value.get } i += 1 } new GenericArrayData(buffer) }

Ok, tried that as well. It doesn't seem to be significantly different from the others.

Yea, if no big difference, I like the similar handling with the others, e.g., https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L555

I think this is good enough to go.
How about merging this for now, and addressing it in a separate PR?
transform is doing the same way, so I think we should do the same thing if needed, maybe at the same time.

+1 for this is ready to go for now and we can address the optimization separately.

Side-comment on the version that @maropu gave:
The lambda version that @henrydavidge gave (i.e. "batch-wise lambda") would technically have less overhead:

// lambda invocation overhead outside of loop for each element in array do specialized filter action

whereas the version that @maropu gave (i.e. "element-wise lambda") would be:

// shared loop between the two versions for each element in array // lambda invocation overhead per element invoke mayFillIndex lambda

With @maropu 's version, let's assume that we're running on the HotSpot JVM and both the with-index and without-index paths have been used, then the best the HotSpot JIT compiler could have done is a profile-guided bimorphic devirtualization on that lambda call site, which will look like the following after devirtualization+inlining:

local_mayFillIndex = this.mayFillIndex klazz = local_mayFillIndex.klass for each element in array // ... if (klazz == lambda_klass_1) { // no-op } else if (klazz == lambda_klass_2) { idxVar.value.set(i) } else { uncommon_trap() // aka deoptimize, or potentially a full virtual call } }

The point is that this JIT-optimized version is actually a degenerated version of Henry's hand-written inline per-element check version, so I wouldn't want to go down this route.

Thanks, kris! That explanation's very helpful to me.

ueshin · 2019-10-02T20:02:04Z

Thanks all, I'd merge this for now as per the agreement at #25666 (comment).
@henrydavidge @maropu @rednaxelafx Please feel free to file another JIRA to address the discussion at #25666 (review).

ueshin · 2019-10-02T20:02:46Z

Thanks. merging to master.

gatorsmile · 2020-01-23T08:22:07Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

@@ -344,8 +344,13 @@ case class MapFilter(
    Examples:
      > SELECT _FUNC_(array(1, 2, 3), x -> x % 2 == 1);
       [1,3]
+      > SELECT _FUNC_(array(0, 2, 3), (x, i) -> x > i);


Here, the indices start at 0. but it sounds like the other built-in functions start at 1.

I remember there was the (not-merged) PR to standardize one-based column indexes in built-in funcs: #24051
Better to fix them up for consistency?

…cala function API filter ### What changes were proposed in this pull request? This PR is a follow-up PR #25666 for adding the description and example for the Scala function API `filter`. ### Why are the changes needed? It is hard to tell which parameter is the index column. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27336 from gatorsmile/spark28962. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[SQL][JSPARK-28962] Provide index argument to filter lambda functions

30f37d0

HyukjinKwon changed the title ~~[SQL][SPARK-28962] Provide index argument to filter lambda functions~~ [SPARK-28962][SQL] Provide index argument to filter lambda functions Sep 4, 2019

dongjoon-hyun added the SQL label Sep 5, 2019

fix test

7402935

ueshin reviewed Sep 19, 2019

View reviewed changes

ueshin mentioned this pull request Sep 19, 2019

[SPARK-27297] [SQL] Add higher order functions to scala API #24232

Closed

maropu reviewed Sep 19, 2019

View reviewed changes

add test

9304867

ueshin approved these changes Sep 30, 2019

View reviewed changes

maropu reviewed Sep 30, 2019

View reviewed changes

ueshin closed this in 51d6ba7 Oct 2, 2019

henrydavidge mentioned this pull request Oct 15, 2019

Make sample qc exprs return an array instead of a map projectglow/glow#14

Merged

3 tasks

gatorsmile reviewed Jan 23, 2020

View reviewed changes

gatorsmile mentioned this pull request Jan 23, 2020

[SPARK-28962][SQL][FOLLOW-UP] Add the parameter description for the Scala function API filter #27336

Closed

	note = """
	Reverse logic for arrays is available since 2.4.0.
	"""

[SPARK-28962][SQL] Provide index argument to filter lambda functions #25666

[SPARK-28962][SQL] Provide index argument to filter lambda functions #25666

Conversation

henrydavidge commented Sep 3, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

hvanhovell commented Sep 3, 2019

HyukjinKwon commented Sep 4, 2019

SparkQA commented Sep 4, 2019

SparkQA commented Sep 13, 2019

maropu commented Sep 13, 2019

SparkQA commented Sep 14, 2019

HyukjinKwon commented Sep 17, 2019

SparkQA commented Sep 17, 2019

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Sep 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Sep 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvander1 commented Sep 25, 2019

ueshin commented Sep 25, 2019

henrydavidge commented Sep 30, 2019

SparkQA commented Sep 30, 2019

ueshin commented Sep 30, 2019

ueshin left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henrydavidge Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

rednaxelafx Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rednaxelafx Oct 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Oct 2, 2019

ueshin commented Oct 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Sep 20, 2019 •

edited

Loading

maropu Sep 20, 2019 •

edited

Loading

henrydavidge Oct 1, 2019 •

edited

Loading

rednaxelafx Oct 1, 2019 •

edited

Loading

rednaxelafx Oct 2, 2019 •

edited

Loading