Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(pinot): The python_date_format for a temporal column was not being passed to get_timestamp_expr #24942

Merged
merged 27 commits into from
Aug 27, 2023
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion superset/connectors/sqla/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -990,7 +990,9 @@ def adhoc_column_to_sqla( # pylint: disable=too-many-locals
time_grain = col.get("timeGrain")
has_timegrain = col.get("columnType") == "BASE_AXIS" and time_grain
is_dttm = False
pdf = None
if col_in_metadata := self.get_column(expression):
pdf = col_in_metadata.python_date_format
sqla_column = col_in_metadata.get_sqla_col(
template_processor=template_processor
)
Expand All @@ -1011,7 +1013,7 @@ def adhoc_column_to_sqla( # pylint: disable=too-many-locals
if is_dttm and has_timegrain:
sqla_column = self.db_engine_spec.get_timestamp_expr(
col=sqla_column,
pdf=None,
pdf=pdf,
Copy link
Member

@john-bodley john-bodley Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that this change may actually break existing logic—given it was explicitly set to None. Would you mind adding a unit test for this which helps not just to provide code coverage, but also helps reviewers et al. grok the consequence of the change.

@zhaoyongjie it seems like you added this logic in #21163 and thus you probably have the most context as to why we historically weren't defining the pdf variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in my testing whenever I tried to create a chart or use a dashboard, if a column was marked as temporal it would always call get_timestamp_expr via adhoc_column_to_sqla which means that the user defined date format is never passed to the DB Engine Spec.

It's possible that the root cause of the issue is that get_timestamp_expr is being called through adhoc_column_to_sqla which it should be getting called via TableColumn.get_timestamp_expression (the only other call path to get_timestamp_expr I could find. But all my tests pointed to adhoc_column_to_sqla being the root cause.

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-bodley the "pdf" is a shortcut for "date format (seconds or milliseconds)", this code was existing in many years, the "pdf" only used in Calculated Column and Columns from database, but not used in Adhoc expression, so we shouldn't make this change.

image

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st

The design of current Pinot DB spec is completely incorrect. Maintaining our own Pinot driver and db_spec should solve your issue.

class PinotEngineSpec(BaseEngineSpec):  # pylint: disable=abstract-method
    engine = "pinot"
    engine_name = "Apache Pinot"
    allows_subqueries = False
    allows_joins = False
    allows_alias_in_select = True
    allows_alias_in_orderby = False

    # https://docs.pinot.apache.org/users/user-guide-query/supported-transformations#datetime-functions
    _time_grain_expressions = {
        None: "{col}",
        "PT1S": "CAST(DATE_TRUNC('second', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "PT1M": "CAST(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "PT5M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 300000) as TIMESTAMP)",
        "PT10M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 600000) as TIMESTAMP)",
        "PT15M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 900000) as TIMESTAMP)",
        "PT30M": "CAST(ROUND(DATE_TRUNC('minute', CAST({col} AS TIMESTAMP)), 1800000) as TIMESTAMP)",
        "PT1H": "CAST(DATE_TRUNC('hour', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1D": "CAST(DATE_TRUNC('day', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1W": "CAST(DATE_TRUNC('week', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1M": "CAST(DATE_TRUNC('month', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P3M": "CAST(DATE_TRUNC('quarter', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
        "P1Y": "CAST(DATE_TRUNC('year', CAST({col} AS TIMESTAMP)) AS TIMESTAMP)",
    }

    @classmethod
    def column_datatype_to_string(
        cls, sqla_column_type: TypeEngine, dialect: Dialect
    ) -> str:
        # Pinot driver infers TIMESTAMP column as LONG, so make the quick fix.
        # When the Pinot driver fix this bug, current method could be removed.
        if isinstance(sqla_column_type, types.TIMESTAMP):
            return sqla_column_type.compile().upper()
        else:
            return super().column_datatype_to_string(sqla_column_type, dialect)

driver at: https://github.com/BurdaForward/pinot-dbapi/tree/bf_release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie I'm aware of what the Python Date Format (PDF) represents, though thanks for clarifying that this shouldn't be used for ad-hoc expressions.

Note we do already have a Pino DB engine spec, but maybe only adding the column_datatype_to_string method is required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st I also wondered if this was an underlying issue with the Pino SQLAlchemy dialect. You might want to look into the visit_label method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie I've confirmed that this error happens with the latest versions of Pinot: so Superset can't alias a projection to the same name as a column that already exists. I looked at the diff you provided but it appears to be diffing a version of models.py that is not the same as the one in the master branch.

@john-bodley could you provide some more detail? Is SQL Alchemy generating the alias name used in the projection? If so, then it could be an issue with the dialect, but if Superset generates the alias label then I'm not sure how the dialect can address this.

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st the git-diffs are from Superset 2.1.0 branch. There aren't many changes, so you should apply the changes manually 🖨️🖨️🖨️

You should change this part of code on Master branch

def make_sqla_column_compatible(
self, sqla_col: ColumnElement, label: str | None = None
) -> ColumnElement:
"""Takes a sqlalchemy column object and adds label info if supported by engine.
:param sqla_col: sqlalchemy column instance
:param label: alias/label that column is expected to have
:return: either a sql alchemy column or label instance if supported by engine
"""
label_expected = label or sqla_col.name
# add quotes to tables
if self.db_engine_spec.allows_alias_in_select:
label = self.db_engine_spec.make_label_compatible(label_expected)
sqla_col = sqla_col.label(label)
sqla_col.key = label_expected
return sqla_col

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaoyongjie so I believe I figured out a workaround for the alias issue. If I just set allows_alias_in_select = False then the query generated by Superset does not use an alias and the query is then compatible with Pinot. So, I don't think any of the additional changes you kindly suggested are necessary.

One question that I have is: what is the purpose of the pdf that gets defined in the dataset configuration? Since it isn't passed into the engine spec when creating a chart, it can't be used in the query generation, so it doesn't seem to serve a purpose?

Copy link
Member

@zhaoyongjie zhaoyongjie Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ege-st

so I believe I figured out a workaround for the alias issue. If I just set allows_alias_in_select = False then the query generated by Superset does not use an alias and the query is then compatible with Pinot. So, I don't think any of the additional changes you kindly suggested are necessary.

Sounds good! It should be worked.

One question that I have is: what is the purpose of the pdf that gets defined in the dataset configuration? Since it isn't passed into the engine spec when creating a chart, it can't be used in the query generation, so it doesn't seem to serve a purpose?

I think the original design of "pdf" is a hard-code for getting a timestamp from a string, but a type conversion expression is more graceful, --- should push down the function and run in DB rather than calculate in client.

time_grain=time_grain,
)
return self.make_sqla_column_compatible(sqla_column, label)
Expand Down
5 changes: 4 additions & 1 deletion superset/db_engine_specs/pinot.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,9 @@ def get_timestamp_expr(
time_grain: Optional[str],
) -> TimestampExpression:
if not pdf:
# If there is no python date format (pdf) given then we cannot determine how to correctly handle the timestamp
raise NotImplementedError(f"Empty date format for '{col}'")

is_epoch = pdf in ("epoch_s", "epoch_ms")

# The DATETIMECONVERT pinot udf is documented at
Expand All @@ -99,12 +101,13 @@ def get_timestamp_expr(
else:
seconds_or_ms = "MILLISECONDS" if pdf == "epoch_ms" else "SECONDS"
tf = f"1:{seconds_or_ms}:EPOCH"

if time_grain:
granularity = cls.get_time_grain_expressions().get(time_grain)
if not granularity:
raise NotImplementedError(f"No pinot grain spec for '{time_grain}'")
else:
return TimestampExpression("{{col}}", col)
return TimestampExpression("{col}", col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also mind adding some unit tests for Pinot which cover the get_timestamp_expr function. You can find many other examples of this in the other DB engine specs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not a problem.


# In pinot the output is a string since there is no timestamp column like pg
if cls._use_date_trunc_function.get(time_grain):
Expand Down