Databricks tests #218

zerodarkzone · 2024-12-11T21:36:33Z

Hi,

I've added tests suport to the databricks backend and while doing it also fixed a couple of problems I've found.

tests/integration/engines/test_int_functions.py

sqlframe/databricks/dataframe.py

eakmanrq · 2024-12-12T05:44:32Z

Really appreciate your contribution of these tests. 🙏

I made a few comments but can you also summarize some of the key differences you are seeing with the Databricks engine compared to the PySpark API? From what I can tell it was the following:

Timestamp differences (I'm currently refactoring timestamp logic due to a bugfix in SQLGlot that may help. What you have currently is fine though)
Differences in how arrays/maps are handled

Let me know if there is anything else worth noting.

In terms of the maps issue, is there a way that we can detect a map and have it return a dict? You can see an example with DuckDB where it checks the value in the response and tries to determine if it is a map and if so converts it to a dict:

sqlframe/sqlframe/duckdb/session.py

Lines 59 to 69 in ed4c24a

    
           @classmethod 
        
           def _try_get_map(cls, value: t.Any) -> t.Optional[t.Dict[str, t.Any]]: 
        
               if value and isinstance(value, dict): 
        
                   # DuckDB < 1.1.0 support 
        
                   if "key" in value and "value" in value: 
        
                       return dict(zip(value["key"], value["value"])) 
        
                   # DuckDB >= 1.1.0 support 
        
                   # If a key is not a string then it must not represent a column and therefore must be a map 
        
                   if len([k for k in value if not isinstance(k, str)]) > 0: 
        
                       return value 
        
               return None

zerodarkzone · 2024-12-12T13:13:43Z

Really appreciate your contribution of these tests. 🙏

I made a few comments but can you also summarize some of the key differences you are seeing with the Databricks engine compared to the PySpark API? From what I can tell it was the following:

Timestamp differences (I'm currently refactoring timestamp logic due to a bugfix in SQLGlot that may help. What you have currently is fine though)

Differences in how arrays/maps are handled

Let me know if there is anything else worth noting.

In terms of the maps issue, is there a way that we can detect a map and have it return a dict? You can see an example with DuckDB where it checks the value in the response and tries to determine if it is a map and if so converts it to a dict:

sqlframe/sqlframe/duckdb/session.py

Lines 59 to 69 in ed4c24a

@classmethod

def _try_get_map(cls, value: t.Any) -> t.Optional[t.Dict[str, t.Any]]:

if value and isinstance(value, dict):

# DuckDB < 1.1.0 support

if "key" in value and "value" in value:

return dict(zip(value["key"], value["value"]))

# DuckDB >= 1.1.0 support

# If a key is not a string then it must not represent a column and therefore must be a map

if len([k for k in value if not isinstance(k, str)]) > 0:

return value

return None

Hi, I'll give it a try

zerodarkzone · 2024-12-12T14:30:35Z

Hi,
I've found a problem with this function.

When you have a map, it will covert all the keys to lowercase so it makes this test fail:

…xed semantics.

eakmanrq · 2024-12-13T05:13:38Z

When you have a map, it will covert all the keys to lowercase so it makes this test fail:

Yeah I think I know why. Since upper is uppercasing the columns in a struct, and they aren't quoted, during normalization it seems the columns as case-insensitive (since they aren't quoted) and therefore the uppercase can become lower case.

This is a tricky case that is more of an edge case. Want to transform the keys in some other ways that can be measured and do an instance check if it is Databricks and do a special test just for Databricks? Maybe add a character or something.

zerodarkzone · 2024-12-13T13:33:29Z

When you have a map, it will covert all the keys to lowercase so it makes this test fail:

Yeah I think I know why. Since upper is uppercasing the columns in a struct, and they aren't quoted, during normalization it seems the columns as case-insensitive (since they aren't quoted) and therefore the uppercase can become lower case.

This is a tricky case that is more of an edge case. Want to transform the keys in some other ways that can be measured and do an instance check if it is Databricks and do a special test just for Databricks? Maybe add a character or something.

Hi,
This is done

eakmanrq

Thanks @zerodarkzone! Feel free to merge if you are ready.

zerodarkzone · 2024-12-15T19:47:06Z

Hi,
I think everything is ready to merge. I don't have permission to do it. Could You merge it?

zerodarkzone added 5 commits December 11, 2024 16:03

Added databricks tests

7c26394

Fixed test

607f6f3

Added Unit tests

7752199

Added Unit tests

44300e5

Fixed typos

6c39516

zerodarkzone requested a review from eakmanrq as a code owner December 11, 2024 21:36

Fixed style

36b2065

eakmanrq reviewed Dec 12, 2024

View reviewed changes

tests/integration/engines/test_int_functions.py Show resolved Hide resolved

eakmanrq reviewed Dec 12, 2024

View reviewed changes

tests/integration/engines/test_int_functions.py Outdated Show resolved Hide resolved

eakmanrq reviewed Dec 12, 2024

View reviewed changes

tests/integration/engines/test_int_functions.py Show resolved Hide resolved

eakmanrq reviewed Dec 12, 2024

View reviewed changes

sqlframe/databricks/dataframe.py Outdated Show resolved Hide resolved

zerodarkzone added 2 commits December 12, 2024 07:35

Revert part of the syntax

689411b

Revert "removing quotes from schema"

4b7eb45

Revert "removing quotes from schema"

55c1eb6

zerodarkzone added 4 commits December 12, 2024 09:56

"Fixed" struct and map return type. Changed tests to work with the fi…

e8d6c40

…xed semantics.

Fixed test

1b03eb0

Fixed style

c469900

Added disable_pandas option to default databricks connection.

fba060b

zerodarkzone added 3 commits December 13, 2024 08:23

Changed 'test_transform_keys' for databricks.

468c9f9

Added comments to window related functions.

976886c

Added comments to window related functions.

7fc8744

eakmanrq approved these changes Dec 14, 2024

View reviewed changes

eakmanrq merged commit 4de9375 into eakmanrq:main Dec 15, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks tests #218

Databricks tests #218

zerodarkzone commented Dec 11, 2024

eakmanrq commented Dec 12, 2024

zerodarkzone commented Dec 12, 2024

zerodarkzone commented Dec 12, 2024 •

edited

Loading

eakmanrq commented Dec 13, 2024

zerodarkzone commented Dec 13, 2024

eakmanrq left a comment

zerodarkzone commented Dec 15, 2024

Databricks tests #218

Databricks tests #218

Conversation

zerodarkzone commented Dec 11, 2024

eakmanrq commented Dec 12, 2024

zerodarkzone commented Dec 12, 2024

zerodarkzone commented Dec 12, 2024 • edited Loading

eakmanrq commented Dec 13, 2024

zerodarkzone commented Dec 13, 2024

eakmanrq left a comment

Choose a reason for hiding this comment

zerodarkzone commented Dec 15, 2024

zerodarkzone commented Dec 12, 2024 •

edited

Loading