Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

width_bucket() does not seem to process NULL in the array elements properly. #24055

Closed
spershin opened this issue Nov 15, 2024 · 17 comments
Closed
Assignees
Labels

Comments

@spershin
Copy link
Contributor

spershin commented Nov 15, 2024

These return 0:
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);

These return 1:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL]) t(c0);

These return 2:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);

These return 3:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);

These fail with "Bin values are not sorted in ascending order":
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);

And so on.
There is no check for null elements in the code and unclear how the function should behave if it encounters one.
From strict perspective it seems like the function should fail whenever it finds a null element because "Bin values are not sorted in ascending order".
What is interesting is depending on the 1st argument and the array, it is not guaranteed that we stumble on a particular NULL due to the binary search nature of the algorithm.

Trying to understand if we should change the function behavior.

Expected Behavior

Unclear.

Current Behavior

See description.

Possible Solution

  1. If NULL element encountered - fail the query.
  2. Scan for any null elements beforehand and fail if found - cold be too expensive.
  3. If NULL element encountered - return NULL.
  4. Leave as is, update documentation.

Steps to Reproduce

In any environment run the example queries.

@mbasmanova
Copy link
Contributor

CC: @aditi-pandit @rschlussel

@tdcmeehan
Copy link
Contributor

This variation of width_bucket is defined in the SQL spec and clearly defines the null behavior. However, I cannot find the overload mentioned in this issue in the SQL spec, nor do I see a comparable version in Postgres.

@Yuhta
Copy link
Contributor

Yuhta commented Nov 15, 2024

I would vote for solution 2 as a null in bins cannot be interpret in any meaningful ways. Checking if they are nulls or not should not be too expensive as the nulls bits are contiguous in memory (so you can check 64 or 256 of them at same time, making it essentially free).

@spershin
Copy link
Contributor Author

I would vote for solution 2 as a null in bins cannot be interpret in any meaningful ways. Checking if they are nulls or not should not be too expensive as the nulls bits are contiguous in memory (so you can check 64 or 256 of them at same time, making it essentially free).

Yes, I would like to use option 2 as well, however, in Presto Java checking nulls might not be that cheap as in Velox.

@Yuhta
Copy link
Contributor

Yuhta commented Nov 16, 2024

Yes, I would like to use option 2 as well, however, in Presto Java checking nulls might not be that cheap as in Velox.

The nulls bits should be stored similarly (contiguous bitmasks) in this case. Even Java cannot check one SIMD register at once, it can still do at least 64 bits which should be fast enough.

@rschlussel
Copy link
Contributor

rschlussel commented Nov 18, 2024

yeah, this seems very clearly to be missing a null check. I'm okay with both options 1 and 2. The function only checks the sorting during the search, so option 1 would be similar to that (it will miss cases that we skip in the binary search). If we do option 2 should we change the sort check as well and the isfinite check?

@Yuhta
Copy link
Contributor

Yuhta commented Nov 18, 2024

@rschlussel Yes I think we should check both the null and sorted for bins. This only needs to be done once for each top level row, then we can loop over the input data array elements doing binary searches like we have now.

@spershin
Copy link
Contributor Author

@Yuhta , @rschlussel
So, what have we decided?
Do we want to check sorted-ness and non-null for all elements of the array before running binary search?
Do we want to check for NaNs and infinities too?

@Yuhta

This only needs to be done once for each top level row

What do you mean?
Was there any scenario we would need to check it more than once?

@Yuhta
Copy link
Contributor

Yuhta commented Nov 18, 2024

@spershin Currently at least in Velox code, we are doing the check while we doing the binary search, and each search is repeated for all the elements in the array for one top level row. Say there are N top level rows (arrays), and each array has M elements, and the bin has K elements. The current checks run for O(N*M*log(K)) times, and if we move the check outside the binary search, the checks will be run O(N*K) times. Scratch that we are not taking array as inputs.

I am not sure how bad it would be if we just iterate over all bin values once, my guess is it's not. The binary search does not look important if we only search the bins once. So in this sense would still vote for check everything we can on the bins. One optimization we may want to add is when the bins column are dictionary encoded. Then we just need to check the dictionary values once and do binary search when we iterate over the inputs.

@spershin
Copy link
Contributor Author

@Yuhta , @rschlussel
Btw, do we need to return an error, I wonder for any null elements or would it be more correct to return NULL instead?

@Yuhta
Copy link
Contributor

Yuhta commented Nov 18, 2024

Error would be more preferable to me, if user wants to suppress the error they can use TRY.

@rschlussel
Copy link
Contributor

my opinion is that we need data to validate whether changing from binary search to a linear search (with no early exit) will be a performance problem. If someone wants to do this analysis they can, otherwise, we should go with option 1.

@Yuhta
Copy link
Contributor

Yuhta commented Nov 19, 2024

The problem of option 1 is it is not guaranteed to catch the error. Also if we just checking nulls, it probably won't be faster than just check all the null bits. The function itself is likely negligible in terms of CPU cycle cost on an E2E query (we already going through all the data when we construct the bins in memory. Going through it second time in order will not cost us anything big).

@rschlussel
Copy link
Contributor

I still think it needs validation before we can switch from binary search. I agree option 1 will miss some issues, but it's consistent with the behavior of all the other validation that velox and Presto currently do for this function (checking that bins are sorted and finite) , where they both only validate as they read, and it's possible a different part of the array is invalid.

@spershin
Copy link
Contributor Author

Ok, I think I support option 1 too.

  1. It is less intrusive.
  2. It solves the problem of possibly having some random number in Velox when reading null entries (Java sees always has zeroes no matter what) and thus making query result 'random'.
  3. It will make result consistent for any input.
  4. It will not add any performance regression.
  5. Array should be sorted, so nulls will be in the end and the algorithm will fail quickly on the null check.
  6. So far we have not encountered any such cases in the prod except for one, which is a recent change in the pipeline of one specific customer and they have only a single element array [null] case. That's how we've discovered this issue in the 1st place (after several months of triaging!).

I will try to roll out Presto PR today to change the behavior and then will alter my Velox diff to reflect that.

Thank you everyone for participating in the discussion!

@spershin
Copy link
Contributor Author

@rschlussel @Yuhta @tdcmeehan

So, I keep thinking that returning NULL might be a option when we encounter a null element.
Simply following the SLQ logic, because it can be valid element and can be not, can potentially affect the result.
In such cases SQL returns NULL.

Close example can be contains():

presto:di> select contains(ARRAY[null, 5, 3, null, 1, 100, null], 1);
 true

presto:di> select contains(ARRAY[null, 5, 3, null, 1, 100, null], -1);
 NULL

presto:di> select contains(ARRAY[0, 5, 3, 11, 1, 100, 4], -1);
 false

What do you guys think?

@spershin
Copy link
Contributor Author

The mentioned PR solves this issue.

@github-project-automation github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Bugs and support requests Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

6 participants