-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555
Add ARRAY_MIN_BY, ARRAY_MAX_BY functions #18555
Conversation
For infrequent functions like this, good to use SQL udf. Checkout: It will be a lot easier to read and won't be much perf overhead either |
8190317
to
d1c7634
Compare
I had tried writing it with sql udf some time ago but it didn't seem to work with custom type parameters ( |
Yeah - if you can fix that, that will be a bonus :) But otherwise, just copy and paste like the rest of them here like ArrayFrequency etc. |
So I think it's quite complicated to write with sql udf in this case, it would require a big refactoring of SqlUdf because the function's signature is quite complex:
It takes 2 type parameters (T which can be any type, and U which can be any [comparable] type), and a lambda. SQL UDF does not seem to work with lambda arguments, for example I tried adding a simple test function @SqlInvokedScalarFunction(value = "my_test_sql_udf", deterministic = true, calledOnNullInput = false)
@Description("test")
@SqlParameters({
@SqlParameter(name = "input", type="array(varchar)"),
@SqlParameter(name = "f", type="function(varchar,integer)"),
}
)
@SqlType("array(bigint)")
public static String transform2()
{
return "RETURN array[CAST(1 AS BIGINT)]";
} And am getting an error when making a simple unit test (
Also, SQL UDF does not work with type parameters, and here if we want to be completely generic we'd need to list all combinations of both T and U (say varchar,bigint,map,struct,array,double for T and bigint,double,varchar for U ==> 6*3=18 type specializations). |
Yeah - I understand the sentiment. But the issue is supporting these one-off/rarely used builtins adds a long tail causing a) support issues and b) portability - say now velox will have to implement it as well. That's why I'm reluctant to do these things. |
I see, would having a sql udf reduce this downstream overhead? |
We don't need to "port" java to c++ - it will be just SQL - that's the main advantage. And my hope is that it will be easy to prove correctness and also fix any issues upfront than relying on the internal data structures. |
d1c7634
to
07ff3ae
Compare
Ok, I fixed both issues:
Lambdas still aren't fully supported since they are being treated as parameters, e.g. this doesn't work (it will result in @SqlParameter(name = "f", type="function(T, U)")
public static String myFunction() {
return "RETURN f(t)";
} But this works: return "RETURN TRANSFORM(ARRAY[t], f)"; Which is why I had to hack around using ZIP and TRANSFORM (I'm using the ordering of a tuple/row (a, b) < (c, d) if a < c), I couldn't apply the (probably more performant) logic with REDUCE that I put in the description. |
07ff3ae
to
411495e
Compare
This is great! Please split the PR into two - one that extends the SQL UDF with type params and then the other one using that to implement your udfs? Thank you very much for taking the pains to do this cool work! |
presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java
Outdated
Show resolved
Hide resolved
Also see the 'apply' function (which is not registered as a builtin but availablle): Maybe you can try and use that? If that works we can enable the use of it. |
By this I mean if you can do something like |
This is done in #18581. Once the other PR is merged I will rebase this one onto master. |
That should work! Do you think we can enable it as a builtin? |
981af19
to
5ff3d22
Compare
I don't see why not. It's fully implemented. Then we can actually make it official way to use lambdas in sql functions. Just apply. |
5ff3d22
to
c5d7b61
Compare
There's another parser issue when trying to use the lambda argument inside a lambda itself, for example when writing a simple "transform" equivalent function:
I'm getting:
I think we can probably just go ahead with the current array_zip approach and fix the lambda issue later. |
c5d7b61
to
1a48c11
Compare
Sounds good! |
presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java
Outdated
Show resolved
Hide resolved
1a48c11
to
bd0049a
Compare
Yeah that's more clever! I updated the code and unit tests to match that logic |
bd0049a
to
461894d
Compare
461894d
to
aefa4d9
Compare
presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/ArraySqlFunctions.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Can we enhance the release note to explain more what these two functions do? |
Done! |
It's best to explain the functionally than giving this complex expression. Or just give a simple example. |
Oh that isn't the release note, that's just the original PR description. I updated the release note to a brief description:
And I put some examples in the doc |
This PR adds 2 functions in order to get the max or min element of an array, while using a transformation function. This is the equivalent of Scala's maxBy/minBy functions.
The current way of obtaining an equivalent behaviour would be to either use
ARRAY_SORT
with a custom comparator and selecting the first element, or something similar to the following pattern:These current methods are neither elegant nor optimized.
Test plan - (Please fill in how you tested your changes)
mvn -Dtest="TestArrayMinByFunction,TestArrayMaxByFunction" test