Fix bug with handling of null values in dictionaries #70

adriangb · 2025-02-04T01:20:38Z

Currently given the query json_get_text(col, 'a') on the data ['{'x': 0}', '{'x': 0}', '{'a': 1}'] where col is a dictionary encoded column originally with keys [0, 0, 1] and values ['{'x': 0}', '{'x': 1}'] we return a dictionary with keys [0, 0, 1] and values [null, null, 1].
But if you look at how arrow-rs builds up dictionaries they always put the nulls in the keys, not the values. The spec does not require this, but I think that things in arrow-rs or DataFusion assume it is so (based on panics I've seen in prod).
This PR works around those bugs elsewhere while we investigate them further.

adriangb · 2025-02-04T01:57:48Z

Unfortunately:

This seems hard to reproduce in pure DataFusion.
This implementation is buggy and it's hard to implement a generic dictionary rebuild function; ideally we could just use the <Type>DictionaryBuilders.

codecov-commenter · 2025-02-04T02:06:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.02%. Comparing base (999d672) to head (643294d).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #70      +/-   ##
==========================================
+ Coverage   82.80%   83.02%   +0.22%     
==========================================
  Files          15       15              
  Lines        1128     1143      +15     
  Branches     1128     1143      +15     
==========================================
+ Hits          934      949      +15     
  Misses        132      132              
  Partials       62       62

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

davidhewitt · 2025-02-04T14:05:39Z

Let's add a test case to show that we return nulls in the dictionary keys in cases when we would produce null values.

I wonder if we should switch to using dictionary builders immediately if we're going through the effort to re-pack dictionaries at the end anyway.

adriangb · 2025-02-04T15:42:29Z

I pushed a test and simplified to just set keys to null as you suggested. I can confirm the test used to fail.

Fix bug with handling of null values in dictionaries

c0955f3

adriangb requested a review from davidhewitt February 4, 2025 01:20

adriangb added 5 commits February 3, 2025 17:59

fix

7de2ad7

fmt

a1c3f33

simplify

c055526

remove dep

a926639

fix

d69fd4a

adriangb added 2 commits February 3, 2025 18:06

simplify

6073b9e

Update common.rs

f8c71dd

adriangb added 3 commits February 4, 2025 07:27

simplify, add test

be8ba6b

simplify, add test

af6c807

fmt

861f1df

fix

643294d

davidhewitt approved these changes Feb 4, 2025

View reviewed changes

adriangb merged commit 2fffb96 into main Feb 4, 2025
7 checks passed

adriangb deleted the fix-bug-dicts branch February 4, 2025 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with handling of null values in dictionaries #70

Fix bug with handling of null values in dictionaries #70

adriangb commented Feb 4, 2025

adriangb commented Feb 4, 2025

codecov-commenter commented Feb 4, 2025 •

edited

Loading

davidhewitt commented Feb 4, 2025

adriangb commented Feb 4, 2025

Fix bug with handling of null values in dictionaries #70

Fix bug with handling of null values in dictionaries #70

Conversation

adriangb commented Feb 4, 2025

adriangb commented Feb 4, 2025

codecov-commenter commented Feb 4, 2025 • edited Loading

Codecov Report

davidhewitt commented Feb 4, 2025

adriangb commented Feb 4, 2025

codecov-commenter commented Feb 4, 2025 •

edited

Loading