Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555

Merged

Conversation

sviscaino
Copy link
Contributor

@sviscaino sviscaino commented Oct 24, 2022

This PR adds 2 functions in order to get the max or min element of an array, while using a transformation function. This is the equivalent of Scala's maxBy/minBy functions.

The current way of obtaining an equivalent behaviour would be to either use ARRAY_SORT with a custom comparator and selecting the first element, or something similar to the following pattern:

ARRAY_MAX_BY(a, f) :=
  REDUCE(
    a,
    (NULL, NULL),
    ((curr_max, curr_max_elt), elt) ->
      IF(curr_max IS NULL OR f(elt) >= curr_max,
         (f(elt), elt),
         (curr_max, curr_max_elt)),
    (curr_max, curr_max_elt) -> curr_max_elt)

These current methods are neither elegant nor optimized.

Test plan - (Please fill in how you tested your changes)
mvn -Dtest="TestArrayMinByFunction,TestArrayMaxByFunction" test

== RELEASE NOTES ==

General Changes
* Add functions :func:`array_min_by`, :func:`array_max_by`, to find the smallest or largest element of an array when applying a custom measuring function

@sviscaino sviscaino requested a review from a team as a code owner October 24, 2022 21:47
@sviscaino sviscaino requested a review from presto-oss October 24, 2022 21:47
@kaikalur
Copy link
Contributor

For infrequent functions like this, good to use SQL udf. Checkout:

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

It will be a lot easier to read and won't be much perf overhead either

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from 8190317 to d1c7634 Compare October 24, 2022 21:55
@sviscaino
Copy link
Contributor Author

sviscaino commented Oct 24, 2022

For infrequent functions like this, good to use SQL udf. Checkout:

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java

It will be a lot easier to read and won't be much perf overhead either

I had tried writing it with sql udf some time ago but it didn't seem to work with custom type parameters (array<T>)

@kaikalur
Copy link
Contributor

kaikalur commented Oct 24, 2022

For infrequent functions like this, good to use SQL udf. Checkout:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java
It will be a lot easier to read and won't be much perf overhead either

I had tried writing it with sql udf some time ago but it didn't seem to work with custom type parameters (array<T>)

Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc.

@sviscaino
Copy link
Contributor Author

sviscaino commented Oct 26, 2022

Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc.

So I think it's quite complicated to write with sql udf in this case, it would require a big refactoring of SqlUdf because the function's signature is quite complex:

array_min_by(array(T), function(T, U)) -> T

It takes 2 type parameters (T which can be any type, and U which can be any [comparable] type), and a lambda.

SQL UDF does not seem to work with lambda arguments, for example I tried adding a simple test function

    @SqlInvokedScalarFunction(value = "my_test_sql_udf", deterministic = true, calledOnNullInput = false)
    @Description("test")
    @SqlParameters({
            @SqlParameter(name = "input", type="array(varchar)"),
            @SqlParameter(name = "f", type="function(varchar,integer)"),
            }
    )
    @SqlType("array(bigint)")
    public static String transform2()
    {
        return "RETURN array[CAST(1 AS BIGINT)]";
    }

And am getting an error when making a simple unit test ("my_test_sql_udf(array['a','b'], x -> 2)"):

java.lang.RuntimeException: Error compiling my_test_sql_udf(CAST(ARRAY(Slice{base=[B@3506d826, address=16, length=1}, Slice{base=[B@35dd9ed3, address=16, length=1})), (x) -> 2): No value present
	at com.facebook.presto.operator.scalar.FunctionAssertions.compileFilterProject(FunctionAssertions.java:955)
	at com.facebook.presto.operator.scalar.FunctionAssertions.executeProjectionWithAll(FunctionAssertions.java:610)
	at com.facebook.presto.operator.scalar.FunctionAssertions.selectUniqueValue(FunctionAssertions.java:313)
Caused by: java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:287)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:112)

Also, SQL UDF does not work with type parameters, and here if we want to be completely generic we'd need to list all combinations of both T and U (say varchar,bigint,map,struct,array,double for T and bigint,double,varchar for U ==> 6*3=18 type specializations).

@kaikalur
Copy link
Contributor

Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc.

So I think it's quite complicated to write with sql udf in this case, it would require a big refactoring of SqlUdf because the function's signature is quite complex:

array_min_by(array(T), function(T, U)) -> T

It takes 2 type parameters (T which can be any type, and U which can be any [comparable] type), and a lambda.

SQL UDF does not seem to work with lambda arguments, for example I tried adding a simple test function

    @SqlInvokedScalarFunction(value = "my_test_sql_udf", deterministic = true, calledOnNullInput = false)
    @Description("test")
    @SqlParameters({
            @SqlParameter(name = "input", type="array(varchar)"),
            @SqlParameter(name = "f", type="function(varchar,integer)"),
            }
    )
    @SqlType("array(bigint)")
    public static String transform2()
    {
        return "RETURN array[CAST(1 AS BIGINT)]";
    }

And am getting an error when making a simple unit test ("my_test_sql_udf(array['a','b'], x -> 2)"):

java.lang.RuntimeException: Error compiling my_test_sql_udf(CAST(ARRAY(Slice{base=[B@3506d826, address=16, length=1}, Slice{base=[B@35dd9ed3, address=16, length=1})), (x) -> 2): No value present
	at com.facebook.presto.operator.scalar.FunctionAssertions.compileFilterProject(FunctionAssertions.java:955)
	at com.facebook.presto.operator.scalar.FunctionAssertions.executeProjectionWithAll(FunctionAssertions.java:610)
	at com.facebook.presto.operator.scalar.FunctionAssertions.selectUniqueValue(FunctionAssertions.java:313)
Caused by: java.util.NoSuchElementException: No value present
	at java.util.Optional.get(Optional.java:135)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:287)
	at com.facebook.presto.sql.gen.RowExpressionCompiler$Visitor.visitLambda(RowExpressionCompiler.java:112)

Also, SQL UDF does not work with type parameters, and here if we want to be completely generic we'd need to list all combinations of both T and U (say varchar,bigint,map,struct,array,double for T and bigint,double,varchar for U ==> 6*3=18 type specializations).

Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things.

@sviscaino
Copy link
Contributor Author

Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things.

I see, would having a sql udf reduce this downstream overhead?

@kaikalur
Copy link
Contributor

Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things.

I see, would having a sql udf reduce this downstream overhead?

We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures.

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from d1c7634 to 07ff3ae Compare October 26, 2022 20:16
@sviscaino
Copy link
Contributor Author

sviscaino commented Oct 26, 2022

We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures.

Ok, I fixed both issues:

  • type parameters were not being correctly passed along to the signature for sql udf functions,
  • the previous error was due to calledOnNullInput, which probably does a IS NULL check under the hood, which doesn't make sense for a lambda

Lambdas still aren't fully supported since they are being treated as parameters, e.g. this doesn't work (it will result in f is not a registered function)

@SqlParameter(name = "f", type="function(T, U)")
public static String myFunction() {
  return "RETURN f(t)";
}

But this works:

return "RETURN TRANSFORM(ARRAY[t], f)";

Which is why I had to hack around using ZIP and TRANSFORM (I'm using the ordering of a tuple/row (a, b) < (c, d) if a < c), I couldn't apply the (probably more performant) logic with REDUCE that I put in the description.

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from 07ff3ae to 411495e Compare October 26, 2022 20:24
@kaikalur
Copy link
Contributor

We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures.

Ok, I fixed both issues:

  • type parameters were not being correctly passed along to the signature for sql udf functions,
  • the previous error was due to calledOnNullInput, which probably does a IS NULL check under the hood, which doesn't make sense for a lambda

Lambdas still aren't fully supported since they are being treated as parameters, e.g. this doesn't work (it will result in f is not a registered function)

@SqlParameter(name = "f", type="function(T, U)")
public static String myFunction() {
  return "RETURN f(t)";
}

But this works:

return "RETURN TRANSFORM(ARRAY[t], f)";

Which is why I had to hack around using ZIP and TRANSFORM (I'm using the ordering of a tuple/row (a, b) < (c, d) if a < c), I couldn't apply the (probably more performant) logic with REDUCE that I put in the description.

This is great! Please split the PR into two - one that extends the SQL UDF with type params and then the other one using that to implement your udfs?

Thank you very much for taking the pains to do this cool work!

@kaikalur
Copy link
Contributor

Also see the 'apply' function (which is not registered as a builtin but availablle):

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java

Maybe you can try and use that? If that works we can enable the use of it.

@kaikalur
Copy link
Contributor

kaikalur commented Oct 26, 2022

Also see the 'apply' function (which is not registered as a builtin but availablle):

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java

Maybe you can try and use that? If that works we can enable the use of it.

By this I mean if you can do something like TRANSFORM(input, x->(apply(f, x), x)))

@sviscaino
Copy link
Contributor Author

sviscaino commented Oct 29, 2022

This is great! Please split the PR into two - one that extends the SQL UDF with type params and then the other one using that to implement your udfs?

Thank you very much for taking the pains to do this cool work!

This is done in #18581. Once the other PR is merged I will rebase this one onto master.

@sviscaino
Copy link
Contributor Author

Also see the 'apply' function (which is not registered as a builtin but availablle):

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java

Maybe you can try and use that? If that works we can enable the use of it.

That should work! Do you think we can enable it as a builtin?

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch 2 times, most recently from 981af19 to 5ff3d22 Compare November 1, 2022 20:16
@kaikalur
Copy link
Contributor

kaikalur commented Nov 1, 2022

Also see the 'apply' function (which is not registered as a builtin but availablle):
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/ApplyFunction.java
Maybe you can try and use that? If that works we can enable the use of it.

That should work! Do you think we can enable it as a builtin?

I don't see why not. It's fully implemented. Then we can actually make it official way to use lambdas in sql functions. Just apply.

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from 5ff3d22 to c5d7b61 Compare November 4, 2022 23:21
@sviscaino
Copy link
Contributor Author

There's another parser issue when trying to use the lambda argument inside a lambda itself, for example when writing a simple "transform" equivalent function:

return "RETURN TRANSFORM(input, x -> apply(x, f))";

I'm getting:

com.facebook.presto.sql.analyzer.SemanticException: line 1:55: Lambda expression should always be used inside a function

	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:1253)
	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:330)

I think we can probably just go ahead with the current array_zip approach and fix the lambda issue later.

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from c5d7b61 to 1a48c11 Compare November 4, 2022 23:43
@kaikalur
Copy link
Contributor

kaikalur commented Nov 4, 2022

There's another parser issue when trying to use the lambda argument inside a lambda itself, for example when writing a simple "transform" equivalent function:

return "RETURN TRANSFORM(input, x -> apply(x, f))";

I'm getting:

com.facebook.presto.sql.analyzer.SemanticException: line 1:55: Lambda expression should always be used inside a function

	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:1253)
	at com.facebook.presto.sql.analyzer.ExpressionAnalyzer$Visitor.visitLambdaExpression(ExpressionAnalyzer.java:330)

I think we can probably just go ahead with the current array_zip approach and fix the lambda issue later.

Sounds good!

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from 1a48c11 to bd0049a Compare November 5, 2022 12:04
@sviscaino
Copy link
Contributor Author

I would really like to avoid the second scan. How about:
RETURN input[ ARRAY_MIN(
ZIP_WITH(TRANSFORM(input, f),
SEQUENCE(1, CARDINALITY(input)),
(x, y)->IF(x IS NULL, NULL, (x, y)))]

Yeah that's more clever! I updated the code and unit tests to match that logic

@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from bd0049a to 461894d Compare November 5, 2022 12:10
@sviscaino sviscaino force-pushed the feature/array_min_by-array_max_by branch from 461894d to aefa4d9 Compare November 5, 2022 12:14
@kaikalur kaikalur requested a review from highker November 6, 2022 17:38
Copy link
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@highker
Copy link
Contributor

highker commented Nov 7, 2022

Can we enhance the release note to explain more what these two functions do?

@sviscaino
Copy link
Contributor Author

Can we enhance the release note to explain more what these two functions do?

Done!

@kaikalur
Copy link
Contributor

kaikalur commented Nov 7, 2022

This PR adds 2 functions in order to get the max or min element of an array, while using a transformation function. This is the equivalent of Scala's maxBy/minBy functions.

The current way of obtaining an equivalent behaviour would be to either use ARRAY_SORT with a custom comparator and selecting the first element, or something similar to the following pattern:

ARRAY_MAX_BY(a, f) :=
  REDUCE(
    a,
    (NULL, NULL),
    ((curr_max, curr_max_elt), elt) ->
      IF(curr_max IS NULL OR f(elt) >= curr_max,
         (f(elt), elt),
         (curr_max, curr_max_elt)),
    (curr_max, curr_max_elt) -> curr_max_elt)

It's best to explain the functionally than giving this complex expression. Or just give a simple example.

@sviscaino
Copy link
Contributor Author

It's best to explain the functionally than giving this complex expression. Or just give a simple example.

Oh that isn't the release note, that's just the original PR description. I updated the release note to a brief description:

Add functions :func:array_min_by, :func:array_max_by, to find the smallest or largest element of an array when applying a custom measuring function

And I put some examples in the doc

@kaikalur kaikalur merged commit 63804c7 into prestodb:master Nov 7, 2022
@sviscaino sviscaino deleted the feature/array_min_by-array_max_by branch November 7, 2022 15:55
@wanglinsong wanglinsong mentioned this pull request Jan 12, 2023
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants