Fix boolean literals #357

aholyoke · 2024-02-16T18:11:42Z

Reproducing the issue:

import sqlalchemy
from sqlalchemy import select

engine = sqlalchemy.create_engine(
    f"databricks://token:{os.environ['DATABRICKS_TOKEN']}@{os.environ['DATABRICKS_SERVER_HOSTNAME']}:443/default",
    connect_args={
        "http_path": os.environ["DATABRICKS_HTTP_PATH"],
    },
)
connection = engine.connect()
result = connection.execute(select(True, False))

Go to Query History in the workspace. See that it executes: SELECT 1 AS anon_1, 0 AS anon_2

This becomes a problem when trying to compare to a boolean column:

In [11]: from sqlalchemy import MetaData, Column, Table, Boolean
    ...: metadata_obj = MetaData()
    ...: table = Table("table_with_bool", metadata_obj, Column("value", Boolean()))
    ...: expression = select(1).select_from(table).where(table.c.value == True)
    ...: print(str(expression.compile(dialect=engine.dialect)))
SELECT 1
FROM table_with_bool
WHERE table_with_bool.value = 1

In Databricks this produces the error:
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(value = 1)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("BOOLEAN" and "INT").; line 3 pos 6

SQLAlchemy has a flag called supports_native_boolean that controls this behaviour and is set to false by default. Setting this flag to true starts rendering booleans correctly:

In [12]: class FixedDialect(DatabricksDialect):
    ...:     supports_native_boolean: bool = True
    ...:
    ...:
    ...: metadata_obj = MetaData()
    ...: table = Table("table_with_bool", metadata_obj, Column("value", Boolean()))
    ...: expression = select(1).select_from(table).where(table.c.value == True)
    ...: print(str(expression.compile(dialect=FixedDialect())))
SELECT 1
FROM table_with_bool
WHERE table_with_bool.value = true

At growthloop.com we've been using this fix in production for about a month and haven't had any issues.

I'm confused as to why some of the tests are failing on main when the description of _regression.py in README.tests.md implies tests in this file should be passing, maybe I'm running them incorrectly? Nevertheless, this change seems to be fixing a couple of the reusable dialect tests.
Before:

test/_regression.py::BooleanTest_databricks+databricks::test_null PASSED                 [ 25%]
test/_regression.py::BooleanTest_databricks+databricks::test_render_literal_bool FAILED  [ 50%]
test/_regression.py::BooleanTest_databricks+databricks::test_round_trip PASSED           [ 75%]
test/_regression.py::BooleanTest_databricks+databricks::test_whereclause FAILED          [100%]

After:

test/_regression.py::BooleanTest_databricks+databricks::test_null PASSED                 [ 25%]
test/_regression.py::BooleanTest_databricks+databricks::test_render_literal_bool PASSED  [ 50%]
test/_regression.py::BooleanTest_databricks+databricks::test_round_trip PASSED           [ 75%]
test/_regression.py::BooleanTest_databricks+databricks::test_whereclause PASSED          [100%]

Signed-off-by: Alex Holyoke <alexander.holyoke@growthloop.com>

aholyoke · 2024-03-13T20:05:29Z

@susodapop Can you take a look at this when you get a chance?

yunbodeng-db · 2024-03-13T20:17:17Z

@susodapop Can you take a look at this when you get a chance?

@andrefurlan-db @benc-db @jackyhu-db can you take a look? And run the integration tests of our own before approving merge?

susodapop · 2024-03-14T13:24:27Z

Hey @aholyoke thanks for the ping. I'm not working on this repository anymore so what I can do is limited but you are welcome to ping me with questions about the sqlalchemy dialect any time.

What version of SQLAlchemy are you running?

Here's what I can tell you and the Databricks team:

First, consider the tests that are failing. Notice that test_round_trip is passing even before @aholyoke's change. But test_render_literal_bool fails. That's because SQLAlchemy runs this test in two ways: the one that passes is using native parameters (i.e. the parameter value is being bound at the server). The one that fails does so when SQLAlchemy performs its own parameter inlining, which is separate from the connector's own use_inline_params functionality.

The change proposed by @aholyoke in this PR is a good one as it simply instructs SQLAlchemy to render the string literals true and false rather than 1 and 0.

I'm confused as to why some of the tests are failing on main when the description of _regression.py in README.tests.md implies tests in this file should be passing

You understand _regression.py correctly. The test classes in this file should always pass. The BooleanTest was passing the last time I ran this suite against the Databricks internal testing environment. Since I don't have a test environment anymore I can't dig deeper into why this passed before or why it would now be failing. But assuming that it does fail, it should be fixed.

If I had reviewing privileges on this PR I would approve this, but request one change: Add a unit test for this compilation that mirrors your exact reproduction steps. Here's a patch that does this (if you add me as an editor on your branch I can also just push this to it so it's bundled into your PR):

diff --git a/src/databricks/sqlalchemy/test_local/test_ddl.py b/src/databricks/sqlalchemy/test_local/test_ddl.py
index a83ff24..870b4c8 100644
--- a/src/databricks/sqlalchemy/test_local/test_ddl.py
+++ b/src/databricks/sqlalchemy/test_local/test_ddl.py
@@ -1,5 +1,5 @@
 import pytest
-from sqlalchemy import Column, MetaData, String, Table, create_engine
+from sqlalchemy import Column, MetaData, String, Table, create_engine, select
 from sqlalchemy.schema import (
     CreateTable,
     DropColumnComment,
@@ -93,3 +93,18 @@ class TestTableCommentDDL(DDLTestBase):
         stmt = DropTableComment(table_with_comment)
         output = self.compile(stmt)
         assert output == "COMMENT ON TABLE martin IS NULL"
+
+
+class TestBooleanLiteralDDL(DDLTestBase):
+
+    @pytest.mark.parametrize(
+        "select_input, expected_output",
+        [
+            ((True,), "select true"),
+            ((False,), "select false"),
+        ],
+    )
+    def test_literal_rendering(self, select_input, expected_output):
+        stmt = select(*select_input)
+        compiled = self.compile(stmt)
+        assert expected_output in compiled.lower()

susodapop · 2024-03-14T13:34:46Z

Also @aholyoke are you able to share the exact error message you see when test_render_literal_bool fails? Looking through the test definition, I'm actually a little surprised that it's failing at all. Here's the test definition:

    def test_render_literal_bool(self, literal_round_trip):
        literal_round_trip(Boolean(), [True, False], [True, False])

literal_round_trip is a fixture built-in to the SQLAlchemy test suite. It writes a SELECT statement and passes it Python types that will be rendered as string literals. Then it reads the return value of the select and compares the data received to the data sent.

Boolean() is the SQLAlchemy type mapping that will be used for the test. It will use sqlalchemy.types.Boolean to render the value in the query and to marshal the value returned by the query
[True, False] are the values that will be passed to Boolean() for rendering during the emission of the SELECT
[True, False] are the expected return values from the subsequent SELECT

The Boolean() type prior to your change would write True as 1 and False as 0. Which is what causes the type mismatch error you see from Databricks when using .where(some_col == True) in your queries. But I don't think that would be detectable in this test because sqlalchemy.types.Boolean() interprets 1 as True and 0 as False when supports_native_boolean is disabled.

In other words: this test should pass before your change. Thus my question at the top of this comment.

I don't have a databricks environment I can run tests against anymore so I'm pawing around in the dark.

benc-db · 2024-03-14T20:52:28Z

Unit tests pass, trying integration now.

Set supports_native_boolean to True

82a1ca0

Signed-off-by: Alex Holyoke <alexander.holyoke@growthloop.com>

aholyoke requested review from rcypher-databricks, arikfr, yunbodeng-db, andrefurlan-db, jackyhu-db and benc-db as code owners February 16, 2024 18:11

benc-db temporarily deployed to azure-prod March 14, 2024 20:52 — with GitHub Actions Inactive

benc-db approved these changes Mar 14, 2024

View reviewed changes

benc-db merged commit c103352 into databricks:main Mar 14, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix boolean literals #357

Fix boolean literals #357

aholyoke commented Feb 16, 2024

aholyoke commented Mar 13, 2024

yunbodeng-db commented Mar 13, 2024

susodapop commented Mar 14, 2024

susodapop commented Mar 14, 2024

benc-db commented Mar 14, 2024

Fix boolean literals #357

Fix boolean literals #357

Conversation

aholyoke commented Feb 16, 2024

aholyoke commented Mar 13, 2024

yunbodeng-db commented Mar 13, 2024

susodapop commented Mar 14, 2024

susodapop commented Mar 14, 2024

benc-db commented Mar 14, 2024