-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make schema name for the CTA queries and limit configurable #8867
Conversation
We would appreciate it if you could provide us with more info about this issue/pr! Please do not leave the |
fabcf75
to
99b5b6c
Compare
superset/sql_lab.py
Outdated
@@ -366,6 +368,11 @@ def execute_sql_statements( | |||
payload = handle_query_error(msg, query, session, payload) | |||
return payload | |||
|
|||
# Commit the connection so CTA queries will create the table. | |||
# TODO(bk): consider if it's only needed for postgres. | |||
if conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would conn
be falsy?
@@ -222,7 +222,10 @@ def get_query_with_new_limit(self, new_limit: int) -> str: | |||
limit_pos = pos | |||
break | |||
_, limit = statement.token_next(idx=limit_pos) | |||
if limit.ttype == sqlparse.tokens.Literal.Number.Integer: | |||
# Override the limit only when it exceeds the configured value. | |||
if limit.ttype == sqlparse.tokens.Literal.Number.Integer and new_limit < int( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmh, this here in theory changes what the function is expected to do ("returns the query with the specified limit"), so either we change the name/docstring to reflect that, or either we move the conditional logic towards where the function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the docstring, yeah there is a mismatch. It's hard to move this condition outside of this function as it needs to get the existing limit value from the query and would require to parse query twice. I can't think about the usecase where we would want to override lower user limit with the higher configured value, e.g. I would expect to see 1 row when I query select * from bla limit 1 rather than 100.
superset/config.py
Outdated
# Flag that controls if limit should be inforced on the create table as queries. | ||
SQLLAB_CTA_NO_LIMIT = False | ||
|
||
# Function accepts username, schema name and sql that will be run e.g.: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super clear here what this does or why you'd want to use this config element. I had to read a bit of code to understand it.
This allows you to define custom logic around the "CREATE TABLE AS" or CTAS feautre in SQL Lab that defines where the target schema should be for a given user.
Then add a proper example with type annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the database object as a param too as the policy may differ across databases. Also could be good to pass in the user object instead of the username in case someone would want to dig into roles/perms.
superset/views/core.py
Outdated
@@ -2591,6 +2600,15 @@ def sql_json_exec( | |||
# Set tmp_table_name for CTA | |||
if select_as_cta and mydb.force_ctas_schema: | |||
tmp_table_name = f"{mydb.force_ctas_schema}.{tmp_table_name}" | |||
elif select_as_cta: | |||
dest_schema_name = get_cta_schema_name( | |||
schema, sql, g.user.username if g.user else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g.user
will only be there if you're in sync mode (inside the scope of a web request), won't work on Celery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
views/core.py
is in the scope of the web request, this happens before query object is created and passed to the celery worker - we should be safe here.
superset/views/core.py
Outdated
@@ -2644,8 +2662,9 @@ def sql_json_exec( | |||
) | |||
|
|||
# set LIMIT after template processing | |||
limits = [mydb.db_engine_spec.get_limit_from_sql(rendered_query), limit] | |||
query.limit = min(lim for lim in limits if lim is not None) | |||
if not config.get("SQLLAB_CTA_NO_LIMIT", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can assume that the key exists no need for default to False here, and if would be False anyways if the key doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point :)
c52c861
to
47c7e38
Compare
861a68d
to
07787c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice refactor! My only remaining issue is with the following scenarios:
- user enters a fully qualified name in the table field (highlighted in you TODO comment)
- user has chars in CTAS schema/table name that require quotes in fully qualified name.
I think we can live with these restrictions for now, so leaning towards getting this merged soon. Let me test this locally before final approval.
|
||
def upgrade(): | ||
op.add_column( | ||
"query", sa.Column("tmp_schema_name", sa.String(length=256), nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not to other reviewers) At first I was confused by the choice of column name here, but turns out there was already tmp_table_name
for the CTAS target target table. A more appropriate name might be target_table_name
or ctas_table_name
, but no point in changing the current convention in this PR.
# sqlite doesn't support schemas | ||
return | ||
tmp_table_name = "tmp_async_22" | ||
expected_full_table_name = f"{CTAS_SCHEMA_NAME}.{tmp_table_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to the topic of quoting.. select_star
in BaseEngineSpec
quotes schema
and table_name
, and I'm wondering if we shouldn't do that here (views/core.py:Superset.sql_json_exec
), too. If not, then we just need to assume users won't be doing CTAS into schemas/tables with periods, which seems like a reasonable assumption, for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to either, quote while generating SQL or quote early on when getting the user input.
I think both approaches are good, the latter one would be a bit involved as table object creation should be quoted for SQLATable, Column and Metric. This test case just demonstrates the existing behavior, modifying it probably would be out of scope of this change, but definitely a useful improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not push too much complexity into this PR, we can deal with this later.
07787c2
to
5117648
Compare
Fixing unit tests Fix table quoting Mypy Split tests out for sqlite Grant more permissions for mysql user Postgres doesn't support if not exists More logging Commit for table creation Priviliges for postgres Update tests Resolve comments Lint No limits for the CTA queries if configures
5117648
to
6661e39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally and works well. Had missed the type comments before, so would like to get those fixed; beyond that this looks good to go for me.
superset/config.py
Outdated
SQLLAB_CTA_SCHEMA_NAME_FUNC = ( | ||
None | ||
) # type: Optional[Callable[["Database", "models.User", str, str], str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of a nit, but as Superset has deprecated support for python 2.7, we prefer regular type annotations over type comments, i.e.
SQLLAB_CTA_SCHEMA_NAME_FUNC: Optional[
Callable[["Database", "models.User", str, str], str]
] = None
superset/views/core.py
Outdated
func = config.get( | ||
"SQLLAB_CTA_SCHEMA_NAME_FUNC" | ||
) # type: Optional[Callable[[Database, ab_models.User, str, str], str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Also, we prefer using square brackets when reading configs, i.e. config["SQLLAB_CTA_SCHEMA_NAME_FUNC"]
to avoid having to check for None
(doesn't really apply in this case, but it's a convention nonetheless).
superset/views/core.py
Outdated
# Set tmp_schema_name for CTA | ||
# TODO(bkyryliuk): consider parsing, splitting tmp_schema_name from tmp_table_name if user enters | ||
# <schema_name>.<table_name> | ||
tmp_schema_name = schema # type: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @mistercrunch @john-bodley @dpgaspar a second opinion would be good here, as the review process has been fairly lengthy with lots of back and forth, possibly resulting in us overlooking something important.
:param overwrite: table_name will be dropped if true | ||
:return: Create table as query | ||
""" | ||
exec_sql = "" | ||
sql = self.stripped() | ||
# TODO(bkyryliuk): quote full_table_name | ||
full_table_name = f"{schema_name}.{table_name}" if schema_name else table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn’t we address the TODO? Note the quoter needs to be dialect specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-bodley this would be an additional feature. I kept the logic as it was before.
It is worth to resolve this todo and it is existing bug in superset, but I think this PR is not a right place for the fix as it is already quite large & hard to review and comprehend.
@@ -64,8 +64,10 @@ jobs: | |||
- redis-server | |||
before_script: | |||
- mysql -u root -e "DROP DATABASE IF EXISTS superset; CREATE DATABASE superset DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci" | |||
- mysql -u root -e "DROP DATABASE IF EXISTS sqllab_test_db; CREATE DATABASE sqllab_test_db DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not clear from this PR why we need these additional two databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@john-bodley this is purely for testing, e.g. there is a need to have 2 different schemas in the mysql & postgres to test the CTA behavior
As this PR includes a db migration, it would be nice to merge soon if there are no objections. I'll leave this open for a few more days, then merging if no further comments surface. |
Description
This PR implements an ability to configure the logic that generates the schema name for the create table as queries based on the database, user, used schema and executed SQL.
There are 2 new parameters in the config, both are off by default.
Use cases
Other changes
Tests
[x] unit test + integration tests
[x] tested locally on mysql & postgres
[x] is used in production @ dropbox for over a month