Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555

sviscaino · 2022-10-24T21:47:26Z

This PR adds 2 functions in order to get the max or min element of an array, while using a transformation function. This is the equivalent of Scala's maxBy/minBy functions.

The current way of obtaining an equivalent behaviour would be to either use ARRAY_SORT with a custom comparator and selecting the first element, or something similar to the following pattern:

ARRAY_MAX_BY(a, f) :=
  REDUCE(
    a,
    (NULL, NULL),
    ((curr_max, curr_max_elt), elt) ->
      IF(curr_max IS NULL OR f(elt) >= curr_max,
         (f(elt), elt),
         (curr_max, curr_max_elt)),
    (curr_max, curr_max_elt) -> curr_max_elt)

These current methods are neither elegant nor optimized.

Test plan - (Please fill in how you tested your changes)
mvn -Dtest="TestArrayMinByFunction,TestArrayMaxByFunction" test

== RELEASE NOTES ==

General Changes
* Add functions :func:`array_min_by`, :func:`array_max_by`, to find the smallest or largest element of an array when applying a custom measuring function

kaikalur · 2022-10-24T21:54:47Z

For infrequent functions like this, good to use SQL udf. Checkout:

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

It will be a lot easier to read and won't be much perf overhead either

sviscaino · 2022-10-24T21:56:57Z

For infrequent functions like this, good to use SQL udf. Checkout:

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

It will be a lot easier to read and won't be much perf overhead either

I had tried writing it with sql udf some time ago but it didn't seem to work with custom type parameters (array<T>)

kaikalur · 2022-10-24T21:59:04Z

For infrequent functions like this, good to use SQL udf. Checkout:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java
It will be a lot easier to read and won't be much perf overhead either

I had tried writing it with sql udf some time ago but it didn't seem to work with custom type parameters (array<T>)

Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc.

sviscaino · 2022-10-26T18:27:38Z

Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc.

So I think it's quite complicated to write with sql udf in this case, it would require a big refactoring of SqlUdf because the function's signature is quite complex:

array_min_by(array(T), function(T, U)) -> T

It takes 2 type parameters (T which can be any type, and U which can be any [comparable] type), and a lambda.

SQL UDF does not seem to work with lambda arguments, for example I tried adding a simple test function

    @SqlInvokedScalarFunction(value = "my_test_sql_udf", deterministic = true, calledOnNullInput = false)
    @Description("test")
    @SqlParameters({
            @SqlParameter(name = "input", type="array(varchar)"),
            @SqlParameter(name = "f", type="function(varchar,integer)"),
            }
    )
    @SqlType("array(bigint)")
    public static String transform2()
    {
        return "RETURN array[CAST(1 AS BIGINT)]";
    }

And am getting an error when making a simple unit test ("my_test_sql_udf(array['a','b'], x -> 2)"):

java.lang.RuntimeException: Error compiling my_test_sql_udf(CAST(ARRAY(Slice{base=[B@3506d826, address=16, length=1}, Slice{base=[B@35dd9ed3, address=16, length=1})), (x) -> 2): No value present
	at com.facebook.presto.operator.scalar.FunctionAssertions.compileFilterProject(FunctionAssertions.java:955)
	at com.facebook.presto.operator.scalar.FunctionAssertions.executeProjectionWithAll(FunctionAssertions.java:610)
	at com.facebook.presto.operator.scalar.FunctionAssertions.selectUniqueValue(FunctionAssertions.java:313)
Caused by: java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:287)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:112)

Also, SQL UDF does not work with type parameters, and here if we want to be completely generic we'd need to list all combinations of both T and U (say varchar,bigint,map,struct,array,double for T and bigint,double,varchar for U ==> 6*3=18 type specializations).

kaikalur · 2022-10-26T18:38:54Z

Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc.

So I think it's quite complicated to write with sql udf in this case, it would require a big refactoring of SqlUdf because the function's signature is quite complex:
array_min_by(array(T), function(T, U)) -> T
It takes 2 type parameters (T which can be any type, and U which can be any [comparable] type), and a lambda.

SQL UDF does not seem to work with lambda arguments, for example I tried adding a simple test function
    @SqlInvokedScalarFunction(value = "my_test_sql_udf", deterministic = true, calledOnNullInput = false)
    @Description("test")
    @SqlParameters({
            @SqlParameter(name = "input", type="array(varchar)"),
            @SqlParameter(name = "f", type="function(varchar,integer)"),
            }
    )
    @SqlType("array(bigint)")
    public static String transform2()
    {
        return "RETURN array[CAST(1 AS BIGINT)]";
    }
And am getting an error when making a simple unit test ("my_test_sql_udf(array['a','b'], x -> 2)"):
java.lang.RuntimeException: Error compiling my_test_sql_udf(CAST(ARRAY(Slice{base=[B@3506d826, address=16, length=1}, Slice{base=[B@35dd9ed3, address=16, length=1})), (x) -> 2): No value present
	at com.facebook.presto.operator.scalar.FunctionAssertions.compileFilterProject(FunctionAssertions.java:955)
	at com.facebook.presto.operator.scalar.FunctionAssertions.executeProjectionWithAll(FunctionAssertions.java:610)
	at com.facebook.presto.operator.scalar.FunctionAssertions.selectUniqueValue(FunctionAssertions.java:313)
Caused by: java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:287)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:112)
Also, SQL UDF does not work with type parameters, and here if we want to be completely generic we'd need to list all combinations of both T and U (say varchar,bigint,map,struct,array,double for T and bigint,double,varchar for U ==> 6*3=18 type specializations).

Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things.

sviscaino · 2022-10-26T18:48:10Z

Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things.

I see, would having a sql udf reduce this downstream overhead?

kaikalur · 2022-10-26T18:52:27Z

Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things.

I see, would having a sql udf reduce this downstream overhead?

We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures.

sviscaino · 2022-10-26T20:22:09Z

We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures.

Ok, I fixed both issues:

type parameters were not being correctly passed along to the signature for sql udf functions,
the previous error was due to calledOnNullInput, which probably does a IS NULL check under the hood, which doesn't make sense for a lambda

Lambdas still aren't fully supported since they are being treated as parameters, e.g. this doesn't work (it will result in f is not a registered function)

@SqlParameter(name = "f", type="function(T, U)")
public static String myFunction() {
  return "RETURN f(t)";
}

But this works:

return "RETURN TRANSFORM(ARRAY[t], f)";

Which is why I had to hack around using ZIP and TRANSFORM (I'm using the ordering of a tuple/row (a, b) < (c, d) if a < c), I couldn't apply the (probably more performant) logic with REDUCE that I put in the description.

kaikalur · 2022-10-26T20:37:46Z

We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures.

Ok, I fixed both issues:

type parameters were not being correctly passed along to the signature for sql udf functions,

the previous error was due to calledOnNullInput, which probably does a IS NULL check under the hood, which doesn't make sense for a lambda

Lambdas still aren't fully supported since they are being treated as parameters, e.g. this doesn't work (it will result in f is not a registered function)
@SqlParameter(name = "f", type="function(T, U)")
public static String myFunction() {
  return "RETURN f(t)";
}
But this works:
return "RETURN TRANSFORM(ARRAY[t], f)";
Which is why I had to hack around using ZIP and TRANSFORM (I'm using the ordering of a tuple/row (a, b) < (c, d) if a < c), I couldn't apply the (probably more performant) logic with REDUCE that I put in the description.

This is great! Please split the PR into two - one that extends the SQL UDF with type params and then the other one using that to implement your udfs?

Thank you very much for taking the pains to do this cool work!

presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

kaikalur · 2022-10-26T21:21:48Z

Also see the 'apply' function (which is not registered as a builtin but availablle):

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java

Maybe you can try and use that? If that works we can enable the use of it.

kaikalur · 2022-10-26T21:30:53Z

Also see the 'apply' function (which is not registered as a builtin but availablle):

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java

Maybe you can try and use that? If that works we can enable the use of it.

By this I mean if you can do something like TRANSFORM(input, x->(apply(f, x), x)))

sviscaino · 2022-10-29T16:19:28Z

This is great! Please split the PR into two - one that extends the SQL UDF with type params and then the other one using that to implement your udfs?

Thank you very much for taking the pains to do this cool work!

This is done in #18581. Once the other PR is merged I will rebase this one onto master.

sviscaino · 2022-10-29T16:20:59Z

Also see the 'apply' function (which is not registered as a builtin but availablle):

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java

Maybe you can try and use that? If that works we can enable the use of it.

That should work! Do you think we can enable it as a builtin?

kaikalur · 2022-11-01T20:53:08Z

Also see the 'apply' function (which is not registered as a builtin but availablle):
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java
Maybe you can try and use that? If that works we can enable the use of it.

That should work! Do you think we can enable it as a builtin?

I don't see why not. It's fully implemented. Then we can actually make it official way to use lambdas in sql functions. Just apply.

sviscaino · 2022-11-04T23:31:45Z

There's another parser issue when trying to use the lambda argument inside a lambda itself, for example when writing a simple "transform" equivalent function:

return "RETURN TRANSFORM(input, x -> apply(x, f))";

I'm getting:

com.facebook.presto.sql.analyzer.SemanticException: line 1:55: Lambda expression should always be used inside a function

	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:1253)
	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:330)

I think we can probably just go ahead with the current array_zip approach and fix the lambda issue later.

kaikalur · 2022-11-04T23:47:07Z

There's another parser issue when trying to use the lambda argument inside a lambda itself, for example when writing a simple "transform" equivalent function:
return "RETURN TRANSFORM(input, x -> apply(x, f))";
I'm getting:
com.facebook.presto.sql.analyzer.SemanticException: line 1:55: Lambda expression should always be used inside a function

	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:1253)
	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:330)
I think we can probably just go ahead with the current array_zip approach and fix the lambda issue later.

Sounds good!

presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

sviscaino · 2022-11-05T12:07:13Z

I would really like to avoid the second scan. How about:
RETURN input[ ARRAY_MIN(
ZIP_WITH(TRANSFORM(input, f),
SEQUENCE(1, CARDINALITY(input)),
(x, y)->IF(x IS NULL, NULL, (x, y)))]

Yeah that's more clever! I updated the code and unit tests to match that logic

presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

kaikalur

Looks good!

highker · 2022-11-07T06:19:17Z

Can we enhance the release note to explain more what these two functions do?

sviscaino · 2022-11-07T10:44:31Z

Can we enhance the release note to explain more what these two functions do?

Done!

kaikalur · 2022-11-07T14:54:01Z

This PR adds 2 functions in order to get the max or min element of an array, while using a transformation function. This is the equivalent of Scala's maxBy/minBy functions.

The current way of obtaining an equivalent behaviour would be to either use ARRAY_SORT with a custom comparator and selecting the first element, or something similar to the following pattern:
ARRAY_MAX_BY(a, f) :=
  REDUCE(
    a,
    (NULL, NULL),
    ((curr_max, curr_max_elt), elt) ->
      IF(curr_max IS NULL OR f(elt) >= curr_max,
         (f(elt), elt),
         (curr_max, curr_max_elt)),
    (curr_max, curr_max_elt) -> curr_max_elt)

It's best to explain the functionally than giving this complex expression. Or just give a simple example.

sviscaino · 2022-11-07T15:16:54Z

It's best to explain the functionally than giving this complex expression. Or just give a simple example.

Oh that isn't the release note, that's just the original PR description. I updated the release note to a brief description:

Add functions :func:array_min_by, :func:array_max_by, to find the smallest or largest element of an array when applying a custom measuring function

And I put some examples in the doc

sviscaino requested a review from a team as a code owner October 24, 2022 21:47

sviscaino requested a review from presto-oss October 24, 2022 21:47

sviscaino force-pushed the feature/array_min_by-array_max_by branch from 8190317 to d1c7634 Compare October 24, 2022 21:55

sviscaino force-pushed the feature/array_min_by-array_max_by branch from d1c7634 to 07ff3ae Compare October 26, 2022 20:16

sviscaino force-pushed the feature/array_min_by-array_max_by branch from 07ff3ae to 411495e Compare October 26, 2022 20:24

kaikalur requested changes Oct 26, 2022

View reviewed changes

presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java Outdated Show resolved Hide resolved

sviscaino force-pushed the feature/array_min_by-array_max_by branch 2 times, most recently from 981af19 to 5ff3d22 Compare November 1, 2022 20:16

sviscaino force-pushed the feature/array_min_by-array_max_by branch from 5ff3d22 to c5d7b61 Compare November 4, 2022 23:21

sviscaino force-pushed the feature/array_min_by-array_max_by branch from c5d7b61 to 1a48c11 Compare November 4, 2022 23:43

kaikalur requested changes Nov 5, 2022

View reviewed changes

presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java Outdated Show resolved Hide resolved

sviscaino force-pushed the feature/array_min_by-array_max_by branch from 1a48c11 to bd0049a Compare November 5, 2022 12:04

sviscaino force-pushed the feature/array_min_by-array_max_by branch from bd0049a to 461894d Compare November 5, 2022 12:10

Add ARRAY_MIN_BY, ARRAY_MAX_BY functions

aefa4d9

sviscaino force-pushed the feature/array_min_by-array_max_by branch from 461894d to aefa4d9 Compare November 5, 2022 12:14

kaikalur requested a review from highker November 6, 2022 17:38

kaikalur requested changes Nov 6, 2022

View reviewed changes

presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java Show resolved Hide resolved

kaikalur approved these changes Nov 6, 2022

View reviewed changes

highker approved these changes Nov 7, 2022

View reviewed changes

kaikalur merged commit 63804c7 into prestodb:master Nov 7, 2022

sviscaino deleted the feature/array_min_by-array_max_by branch November 7, 2022 15:55

wanglinsong mentioned this pull request Jan 12, 2023

Add release notes for 0.279 #18920

Merged

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555

Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555

sviscaino commented Oct 24, 2022 •

edited

Loading

kaikalur commented Oct 24, 2022

sviscaino commented Oct 24, 2022 •

edited

Loading

kaikalur commented Oct 24, 2022 •

edited

Loading

sviscaino commented Oct 26, 2022 •

edited

Loading

kaikalur commented Oct 26, 2022

sviscaino commented Oct 26, 2022

kaikalur commented Oct 26, 2022

sviscaino commented Oct 26, 2022 •

edited

Loading

kaikalur commented Oct 26, 2022

kaikalur commented Oct 26, 2022

kaikalur commented Oct 26, 2022 •

edited

Loading

sviscaino commented Oct 29, 2022 •

edited

Loading

sviscaino commented Oct 29, 2022

kaikalur commented Nov 1, 2022

sviscaino commented Nov 4, 2022

kaikalur commented Nov 4, 2022

sviscaino commented Nov 5, 2022

kaikalur left a comment

highker commented Nov 7, 2022

sviscaino commented Nov 7, 2022

kaikalur commented Nov 7, 2022

sviscaino commented Nov 7, 2022

Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555

Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555

Conversation

sviscaino commented Oct 24, 2022 • edited Loading

kaikalur commented Oct 24, 2022

sviscaino commented Oct 24, 2022 • edited Loading

kaikalur commented Oct 24, 2022 • edited Loading

sviscaino commented Oct 26, 2022 • edited Loading

kaikalur commented Oct 26, 2022

sviscaino commented Oct 26, 2022

kaikalur commented Oct 26, 2022

sviscaino commented Oct 26, 2022 • edited Loading

kaikalur commented Oct 26, 2022

kaikalur commented Oct 26, 2022

kaikalur commented Oct 26, 2022 • edited Loading

sviscaino commented Oct 29, 2022 • edited Loading

sviscaino commented Oct 29, 2022

kaikalur commented Nov 1, 2022

sviscaino commented Nov 4, 2022

kaikalur commented Nov 4, 2022

sviscaino commented Nov 5, 2022

kaikalur left a comment

Choose a reason for hiding this comment

highker commented Nov 7, 2022

sviscaino commented Nov 7, 2022

kaikalur commented Nov 7, 2022

sviscaino commented Nov 7, 2022

sviscaino commented Oct 24, 2022 •

edited

Loading

sviscaino commented Oct 24, 2022 •

edited

Loading

kaikalur commented Oct 24, 2022 •

edited

Loading

sviscaino commented Oct 26, 2022 •

edited

Loading

sviscaino commented Oct 26, 2022 •

edited

Loading

kaikalur commented Oct 26, 2022 •

edited

Loading

sviscaino commented Oct 29, 2022 •

edited

Loading