Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks tests #218

Merged
merged 16 commits into from
Dec 15, 2024
Merged

Databricks tests #218

merged 16 commits into from
Dec 15, 2024

Conversation

zerodarkzone
Copy link
Contributor

Hi,

I've added tests suport to the databricks backend and while doing it also fixed a couple of problems I've found.

@eakmanrq
Copy link
Owner

Really appreciate your contribution of these tests. 🙏

I made a few comments but can you also summarize some of the key differences you are seeing with the Databricks engine compared to the PySpark API? From what I can tell it was the following:

  • Timestamp differences (I'm currently refactoring timestamp logic due to a bugfix in SQLGlot that may help. What you have currently is fine though)
  • Differences in how arrays/maps are handled

Let me know if there is anything else worth noting.

In terms of the maps issue, is there a way that we can detect a map and have it return a dict? You can see an example with DuckDB where it checks the value in the response and tries to determine if it is a map and if so converts it to a dict:

@classmethod
def _try_get_map(cls, value: t.Any) -> t.Optional[t.Dict[str, t.Any]]:
if value and isinstance(value, dict):
# DuckDB < 1.1.0 support
if "key" in value and "value" in value:
return dict(zip(value["key"], value["value"]))
# DuckDB >= 1.1.0 support
# If a key is not a string then it must not represent a column and therefore must be a map
if len([k for k in value if not isinstance(k, str)]) > 0:
return value
return None

@zerodarkzone
Copy link
Contributor Author

Really appreciate your contribution of these tests. 🙏

I made a few comments but can you also summarize some of the key differences you are seeing with the Databricks engine compared to the PySpark API? From what I can tell it was the following:

  • Timestamp differences (I'm currently refactoring timestamp logic due to a bugfix in SQLGlot that may help. What you have currently is fine though)
  • Differences in how arrays/maps are handled

Let me know if there is anything else worth noting.

In terms of the maps issue, is there a way that we can detect a map and have it return a dict? You can see an example with DuckDB where it checks the value in the response and tries to determine if it is a map and if so converts it to a dict:

@classmethod
def _try_get_map(cls, value: t.Any) -> t.Optional[t.Dict[str, t.Any]]:
if value and isinstance(value, dict):
# DuckDB < 1.1.0 support
if "key" in value and "value" in value:
return dict(zip(value["key"], value["value"]))
# DuckDB >= 1.1.0 support
# If a key is not a string then it must not represent a column and therefore must be a map
if len([k for k in value if not isinstance(k, str)]) > 0:
return value
return None

Hi, I'll give it a try

@zerodarkzone
Copy link
Contributor Author

zerodarkzone commented Dec 12, 2024

Hi,
I've found a problem with this function.
image
When you have a map, it will covert all the keys to lowercase so it makes this test fail:
image

@eakmanrq
Copy link
Owner

When you have a map, it will covert all the keys to lowercase so it makes this test fail:

Yeah I think I know why. Since upper is uppercasing the columns in a struct, and they aren't quoted, during normalization it seems the columns as case-insensitive (since they aren't quoted) and therefore the uppercase can become lower case.

This is a tricky case that is more of an edge case. Want to transform the keys in some other ways that can be measured and do an instance check if it is Databricks and do a special test just for Databricks? Maybe add a character or something.

@zerodarkzone
Copy link
Contributor Author

When you have a map, it will covert all the keys to lowercase so it makes this test fail:

Yeah I think I know why. Since upper is uppercasing the columns in a struct, and they aren't quoted, during normalization it seems the columns as case-insensitive (since they aren't quoted) and therefore the uppercase can become lower case.

This is a tricky case that is more of an edge case. Want to transform the keys in some other ways that can be measured and do an instance check if it is Databricks and do a special test just for Databricks? Maybe add a character or something.

Hi,
This is done

Copy link
Owner

@eakmanrq eakmanrq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zerodarkzone! Feel free to merge if you are ready.

@zerodarkzone
Copy link
Contributor Author

Hi,
I think everything is ready to merge. I don't have permission to do it. Could You merge it?

@eakmanrq eakmanrq merged commit 4de9375 into eakmanrq:main Dec 15, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants