width_bucket() does not seem to process NULL in the array elements properly. #24055

spershin · 2024-11-15T20:09:02Z

These return 0:
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);

These return 1:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL]) t(c0);

These return 2:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4]) t(c0);

These return 3:
SELECT width_bucket(1, c0) from (VALUES ARRAY[NULL, 1, NULL, 4]) t(c0);

These fail with "Bin values are not sorted in ascending order":
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL]) t(c0);
SELECT width_bucket(-1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL]) t(c0);
SELECT width_bucket(1, c0) from (VALUES ARRAY[1, NULL, 4, NULL]) t(c0);

And so on.
There is no check for null elements in the code and unclear how the function should behave if it encounters one.
From strict perspective it seems like the function should fail whenever it finds a null element because "Bin values are not sorted in ascending order".
What is interesting is depending on the 1st argument and the array, it is not guaranteed that we stumble on a particular NULL due to the binary search nature of the algorithm.

Trying to understand if we should change the function behavior.

Expected Behavior

Unclear.

Current Behavior

See description.

Possible Solution

If NULL element encountered - fail the query.
Scan for any null elements beforehand and fail if found - cold be too expensive.
If NULL element encountered - return NULL.
Leave as is, update documentation.

Steps to Reproduce

In any environment run the example queries.

mbasmanova · 2024-11-15T20:15:13Z

CC: @aditi-pandit @rschlussel

tdcmeehan · 2024-11-15T20:54:36Z

This variation of width_bucket is defined in the SQL spec and clearly defines the null behavior. However, I cannot find the overload mentioned in this issue in the SQL spec, nor do I see a comparable version in Postgres.

Yuhta · 2024-11-15T21:59:12Z

I would vote for solution 2 as a null in bins cannot be interpret in any meaningful ways. Checking if they are nulls or not should not be too expensive as the nulls bits are contiguous in memory (so you can check 64 or 256 of them at same time, making it essentially free).

spershin · 2024-11-16T01:16:46Z

I would vote for solution 2 as a null in bins cannot be interpret in any meaningful ways. Checking if they are nulls or not should not be too expensive as the nulls bits are contiguous in memory (so you can check 64 or 256 of them at same time, making it essentially free).

Yes, I would like to use option 2 as well, however, in Presto Java checking nulls might not be that cheap as in Velox.

Yuhta · 2024-11-16T01:33:31Z

Yes, I would like to use option 2 as well, however, in Presto Java checking nulls might not be that cheap as in Velox.

The nulls bits should be stored similarly (contiguous bitmasks) in this case. Even Java cannot check one SIMD register at once, it can still do at least 64 bits which should be fast enough.

rschlussel · 2024-11-18T15:14:28Z

yeah, this seems very clearly to be missing a null check. I'm okay with both options 1 and 2. The function only checks the sorting during the search, so option 1 would be similar to that (it will miss cases that we skip in the binary search). If we do option 2 should we change the sort check as well and the isfinite check?

Yuhta · 2024-11-18T16:48:00Z

@rschlussel Yes I think we should check both the null and sorted for bins. This only needs to be done once for each top level row, then we can loop over the input data array elements doing binary searches like we have now.

spershin · 2024-11-18T17:36:24Z

@Yuhta , @rschlussel
So, what have we decided?
Do we want to check sorted-ness and non-null for all elements of the array before running binary search?
Do we want to check for NaNs and infinities too?

@Yuhta

This only needs to be done once for each top level row

What do you mean?
Was there any scenario we would need to check it more than once?

Yuhta · 2024-11-18T17:47:20Z

@spershin Currently at least in Velox code, we are doing the check while we doing the binary search, and each search is repeated for all the elements in the array for one top level row. Say there are N top level rows (arrays), and each array has M elements, and the bin has K elements. The current checks run for O(N*M*log(K)) times, and if we move the check outside the binary search, the checks will be run O(N*K) times. Scratch that we are not taking array as inputs.

I am not sure how bad it would be if we just iterate over all bin values once, my guess is it's not. The binary search does not look important if we only search the bins once. So in this sense would still vote for check everything we can on the bins. One optimization we may want to add is when the bins column are dictionary encoded. Then we just need to check the dictionary values once and do binary search when we iterate over the inputs.

spershin · 2024-11-18T18:59:39Z

@Yuhta , @rschlussel
Btw, do we need to return an error, I wonder for any null elements or would it be more correct to return NULL instead?

Yuhta · 2024-11-18T19:23:49Z

Error would be more preferable to me, if user wants to suppress the error they can use TRY.

rschlussel · 2024-11-19T17:46:25Z

my opinion is that we need data to validate whether changing from binary search to a linear search (with no early exit) will be a performance problem. If someone wants to do this analysis they can, otherwise, we should go with option 1.

Yuhta · 2024-11-19T18:42:45Z

The problem of option 1 is it is not guaranteed to catch the error. Also if we just checking nulls, it probably won't be faster than just check all the null bits. The function itself is likely negligible in terms of CPU cycle cost on an E2E query (we already going through all the data when we construct the bins in memory. Going through it second time in order will not cost us anything big).

rschlussel · 2024-11-19T21:56:20Z

I still think it needs validation before we can switch from binary search. I agree option 1 will miss some issues, but it's consistent with the behavior of all the other validation that velox and Presto currently do for this function (checking that bins are sorted and finite) , where they both only validate as they read, and it's possible a different part of the array is invalid.

spershin · 2024-11-20T17:52:16Z

Ok, I think I support option 1 too.

It is less intrusive.
It solves the problem of possibly having some random number in Velox when reading null entries (Java sees always has zeroes no matter what) and thus making query result 'random'.
It will make result consistent for any input.
It will not add any performance regression.
Array should be sorted, so nulls will be in the end and the algorithm will fail quickly on the null check.
So far we have not encountered any such cases in the prod except for one, which is a recent change in the pipeline of one specific customer and they have only a single element array [null] case. That's how we've discovered this issue in the 1st place (after several months of triaging!).

I will try to roll out Presto PR today to change the behavior and then will alter my Velox diff to reflect that.

Thank you everyone for participating in the discussion!

spershin · 2024-11-20T18:19:13Z

@rschlussel @Yuhta @tdcmeehan

So, I keep thinking that returning NULL might be a option when we encounter a null element.
Simply following the SLQ logic, because it can be valid element and can be not, can potentially affect the result.
In such cases SQL returns NULL.

Close example can be contains():

presto:di> select contains(ARRAY[null, 5, 3, null, 1, 100, null], 1);
 true

presto:di> select contains(ARRAY[null, 5, 3, null, 1, 100, null], -1);
 NULL

presto:di> select contains(ARRAY[0, 5, 3, 11, 1, 100, 4], -1);
 false

What do you guys think?

spershin · 2024-11-22T17:44:22Z

The mentioned PR solves this issue.

spershin added the bug label Nov 15, 2024

github-project-automation bot added this to Bugs and support requests Nov 15, 2024

github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Nov 15, 2024

spershin assigned kaikalur, tdcmeehan, mbasmanova and Yuhta Nov 15, 2024

spershin mentioned this issue Nov 16, 2024

fix: Make width_bucket() behave like in Presto Java. facebookincubator/velox#11557

Closed

spershin mentioned this issue Nov 20, 2024

Tighten checks for "bins" elements in width_bucket(x, bins) #24103

Merged

6 tasks

spershin closed this as completed Nov 22, 2024

github-project-automation bot moved this from 🆕 Unprioritized to ✅ Done in Bugs and support requests Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

width_bucket() does not seem to process NULL in the array elements properly. #24055

width_bucket() does not seem to process NULL in the array elements properly. #24055

spershin commented Nov 15, 2024 •

edited

Loading

mbasmanova commented Nov 15, 2024

tdcmeehan commented Nov 15, 2024

Yuhta commented Nov 15, 2024 •

edited

Loading

spershin commented Nov 16, 2024

Yuhta commented Nov 16, 2024

rschlussel commented Nov 18, 2024 •

edited

Loading

Yuhta commented Nov 18, 2024 •

edited

Loading

spershin commented Nov 18, 2024

Yuhta commented Nov 18, 2024 •

edited

Loading

spershin commented Nov 18, 2024

Yuhta commented Nov 18, 2024

rschlussel commented Nov 19, 2024

Yuhta commented Nov 19, 2024 •

edited

Loading

rschlussel commented Nov 19, 2024

spershin commented Nov 20, 2024

spershin commented Nov 20, 2024

spershin commented Nov 22, 2024

width_bucket() does not seem to process NULL in the array elements properly. #24055

width_bucket() does not seem to process NULL in the array elements properly. #24055

Comments

spershin commented Nov 15, 2024 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

mbasmanova commented Nov 15, 2024

tdcmeehan commented Nov 15, 2024

Yuhta commented Nov 15, 2024 • edited Loading

spershin commented Nov 16, 2024

Yuhta commented Nov 16, 2024

rschlussel commented Nov 18, 2024 • edited Loading

Yuhta commented Nov 18, 2024 • edited Loading

spershin commented Nov 18, 2024

Yuhta commented Nov 18, 2024 • edited Loading

spershin commented Nov 18, 2024

Yuhta commented Nov 18, 2024

rschlussel commented Nov 19, 2024

Yuhta commented Nov 19, 2024 • edited Loading

rschlussel commented Nov 19, 2024

spershin commented Nov 20, 2024

spershin commented Nov 20, 2024

spershin commented Nov 22, 2024

spershin commented Nov 15, 2024 •

edited

Loading

Yuhta commented Nov 15, 2024 •

edited

Loading

rschlussel commented Nov 18, 2024 •

edited

Loading

Yuhta commented Nov 18, 2024 •

edited

Loading

Yuhta commented Nov 18, 2024 •

edited

Loading

Yuhta commented Nov 19, 2024 •

edited

Loading