Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with handling of null values in dictionaries #70

Merged
merged 12 commits into from
Feb 4, 2025
Merged

Conversation

adriangb
Copy link
Collaborator

@adriangb adriangb commented Feb 4, 2025

Currently given the query json_get_text(col, 'a') on the data ['{'x': 0}', '{'x': 0}', '{'a': 1}'] where col is a dictionary encoded column originally with keys [0, 0, 1] and values ['{'x': 0}', '{'x': 1}'] we return a dictionary with keys [0, 0, 1] and values [null, null, 1].
But if you look at how arrow-rs builds up dictionaries they always put the nulls in the keys, not the values. The spec does not require this, but I think that things in arrow-rs or DataFusion assume it is so (based on panics I've seen in prod).
This PR works around those bugs elsewhere while we investigate them further.

@adriangb adriangb requested a review from davidhewitt February 4, 2025 01:20
@adriangb
Copy link
Collaborator Author

adriangb commented Feb 4, 2025

Unfortunately:

  1. This seems hard to reproduce in pure DataFusion.
  2. This implementation is buggy and it's hard to implement a generic dictionary rebuild function; ideally we could just use the <Type>DictionaryBuilders.

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.02%. Comparing base (999d672) to head (643294d).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #70      +/-   ##
==========================================
+ Coverage   82.80%   83.02%   +0.22%     
==========================================
  Files          15       15              
  Lines        1128     1143      +15     
  Branches     1128     1143      +15     
==========================================
+ Hits          934      949      +15     
  Misses        132      132              
  Partials       62       62              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidhewitt
Copy link
Collaborator

Let's add a test case to show that we return nulls in the dictionary keys in cases when we would produce null values.

I wonder if we should switch to using dictionary builders immediately if we're going through the effort to re-pack dictionaries at the end anyway.

@adriangb
Copy link
Collaborator Author

adriangb commented Feb 4, 2025

I pushed a test and simplified to just set keys to null as you suggested. I can confirm the test used to fail.

@adriangb adriangb merged commit 2fffb96 into main Feb 4, 2025
7 checks passed
@adriangb adriangb deleted the fix-bug-dicts branch February 4, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants