-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
width_bucket() does not seem to process NULL in the array elements properly. #24055
Comments
This variation of |
I would vote for solution 2 as a null in bins cannot be interpret in any meaningful ways. Checking if they are nulls or not should not be too expensive as the nulls bits are contiguous in memory (so you can check 64 or 256 of them at same time, making it essentially free). |
Yes, I would like to use option 2 as well, however, in Presto Java checking nulls might not be that cheap as in Velox. |
The nulls bits should be stored similarly (contiguous bitmasks) in this case. Even Java cannot check one SIMD register at once, it can still do at least 64 bits which should be fast enough. |
yeah, this seems very clearly to be missing a null check. I'm okay with both options 1 and 2. The function only checks the sorting during the search, so option 1 would be similar to that (it will miss cases that we skip in the binary search). If we do option 2 should we change the sort check as well and the isfinite check? |
@rschlussel Yes I think we should check both the null and sorted for bins. This only needs to be done once for each top level row, then we can loop over the input data array elements doing binary searches like we have now. |
@Yuhta , @rschlussel
What do you mean? |
@spershin I am not sure how bad it would be if we just iterate over all bin values once, my guess is it's not. The binary search does not look important if we only search the bins once. So in this sense would still vote for check everything we can on the bins. One optimization we may want to add is when the bins column are dictionary encoded. Then we just need to check the dictionary values once and do binary search when we iterate over the inputs. |
@Yuhta , @rschlussel |
Error would be more preferable to me, if user wants to suppress the error they can use |
my opinion is that we need data to validate whether changing from binary search to a linear search (with no early exit) will be a performance problem. If someone wants to do this analysis they can, otherwise, we should go with option 1. |
The problem of option 1 is it is not guaranteed to catch the error. Also if we just checking nulls, it probably won't be faster than just check all the null bits. The function itself is likely negligible in terms of CPU cycle cost on an E2E query (we already going through all the data when we construct the bins in memory. Going through it second time in order will not cost us anything big). |
I still think it needs validation before we can switch from binary search. I agree option 1 will miss some issues, but it's consistent with the behavior of all the other validation that velox and Presto currently do for this function (checking that bins are sorted and finite) , where they both only validate as they read, and it's possible a different part of the array is invalid. |
Ok, I think I support option 1 too.
I will try to roll out Presto PR today to change the behavior and then will alter my Velox diff to reflect that. Thank you everyone for participating in the discussion! |
So, I keep thinking that returning NULL might be a option when we encounter a null element. Close example can be contains():
What do you guys think? |
The mentioned PR solves this issue. |
These return 0:
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);
These return 1:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL]) t(c0);
These return 2:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);
These return 3:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);
These fail with "Bin values are not sorted in ascending order":
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);
And so on.
There is no check for null elements in the code and unclear how the function should behave if it encounters one.
From strict perspective it seems like the function should fail whenever it finds a null element because "Bin values are not sorted in ascending order".
What is interesting is depending on the 1st argument and the array, it is not guaranteed that we stumble on a particular NULL due to the binary search nature of the algorithm.
Trying to understand if we should change the function behavior.
Expected Behavior
Unclear.
Current Behavior
See description.
Possible Solution
Steps to Reproduce
In any environment run the example queries.
The text was updated successfully, but these errors were encountered: