Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Apply normalization to all dttm columns #25147

Merged
merged 7 commits into from
Oct 6, 2023

Conversation

kgabryje
Copy link
Member

@kgabryje kgabryje commented Aug 31, 2023

SUMMARY

If python_date_format is set for a column, normalize the column to a datetime format. When that column is filtered by a timestamp value, denormalize the timestamp to the original format.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before:

image image

After:

image image

TESTING INSTRUCTIONS

  1. Create a chart using video_games_sales dataset from sample data
  2. Use column year which has python_date_format set to %Y
  3. Verify that the chart works correctly
  4. Verify that in the "Results" table the original value of year column is a timestamp and formatted value is in a human readable datetime format
  5. Verify that adding a time range filter on column year works correctly
  6. Add the chart to a dashboard and add a VALUE native filter
  7. Select a few values (years), apply the filter and verify that the chart was correctly filtered
  8. Repeat those steps using a different dataset and a dttm column which doesn't have python_date_format set

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@kgabryje kgabryje marked this pull request as draft August 31, 2023 16:11
@kgabryje kgabryje requested a review from villebro August 31, 2023 16:12
@john-bodley john-bodley added review:checkpoint Last PR reviewed during the daily review standup and removed review:checkpoint Last PR reviewed during the daily review standup labels Aug 31, 2023
@kgabryje kgabryje force-pushed the fix/normalize-all-dttm-cols branch from ca18dc7 to df2c390 Compare September 26, 2023 14:43
@kgabryje kgabryje marked this pull request as ready for review September 26, 2023 14:58
@kgabryje kgabryje force-pushed the fix/normalize-all-dttm-cols branch from df2c390 to 3fa4dbe Compare September 26, 2023 15:18
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small style comments. Also, could we add a unit test that at least covers the happy path here?

@@ -284,6 +284,7 @@ def _get_timestamp_format(
label
for label in [
*get_base_axis_labels(query_object.columns),
*[col for col in query_object.columns or [] if isinstance(col, str)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there risk that labels will contain duplicated columns? If so, maybe we should put a set(...) inside the tuple(...) to dedupe them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't know if there's such risk but can't hurt to make it bulletproof 👍

Comment on lines 78 to 79
and kwargs.get("filters")
and len(kwargs["filters"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we simplify to

Suggested change
and kwargs.get("filters")
and len(kwargs["filters"])
and kwargs.get("filters", []))

Comment on lines +144 to +146
column = datasource.get_column(filter_col)
if not column:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned during our call that you've added lines 145-146 to make the linter happy. A better way — more explicit — is to do:

            column = cast(BaseColumn, datasource.get_column(filter_col))

This informs the linter that column is always of type BaseColumn.

Copy link
Member Author

@kgabryje kgabryje Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t it safer to check for None since .get_column can return None if the col name doesn't exist? I don't know if such scenario is possible though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought datasource.get_column(filter_col) was guaranteed to return a truthy value. If that's not the case, then this is fine.

@kgabryje kgabryje force-pushed the fix/normalize-all-dttm-cols branch from ccac518 to acc4d2d Compare October 2, 2023 14:21
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 2, 2023
@kgabryje
Copy link
Member Author

kgabryje commented Oct 2, 2023

@betodealmeida @villebro Tests added!

@kgabryje kgabryje force-pushed the fix/normalize-all-dttm-cols branch from e638e7a to 1a57cb8 Compare October 3, 2023 08:23
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -66,6 +73,10 @@ def create( # pylint: disable=too-many-arguments
)
kwargs["from_dttm"] = from_dttm
kwargs["to_dttm"] = to_dttm
if datasource_model_instance and kwargs.get("filters", []):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if datasource_model_instance and kwargs.get("filters", []):
if datasource_model_instance and kwargs.get("filters"):

Comment on lines +144 to +146
column = datasource.get_column(filter_col)
if not column:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought datasource.get_column(filter_col) was guaranteed to return a truthy value. If that's not the case, then this is fine.

@betodealmeida betodealmeida merged commit 58fcd29 into apache:master Oct 6, 2023
@michael-s-molina michael-s-molina added the v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch label Oct 9, 2023
michael-s-molina pushed a commit that referenced this pull request Oct 9, 2023
return value

for query_filter in query_filters:
if query_filter.get("op") == FilterOperator.TEMPORAL_RANGE:
Copy link
Member

@john-bodley john-bodley Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than the continue I think one if statement (based on truthiness and not falseness) is likely more readable. i.e.,

if (
    query_filter.get("op") != FilterOperator.TEMPORAL_RANGE
    and (filter_col := query_filter.get("col"))
    and isinstance(filter_col, str)
    and (column := datasource.get_column(filter_col))
 ):
     ...

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgabryje BTW the reason for the comments post merge is at Airbnb we're started seeing a number of issues with filters returning the wrong temporal type, i.e., %Y-%m-%d columns are being retuned as epoch (in milliseconds) which is throwing errors. My current hypothesis is this PR might be related to the issue as it was part of 3.0.1 which we deployed internally a few days ago.

Additionally per the PR description it states,

If python_date_format is set for a column, normalize the column to a datetime format. When that column is filtered by a timestamp value, denormalize the timestamp to the original format.

however I'm struggling to find the logic the normalization (or denormalization) logic, i.e., I only see one transformation which seems to be stringifying epoch.

def _process_filters(
self, datasource: BaseDatasource, query_filters: list[QueryObjectFilterClause]
) -> list[QueryObjectFilterClause]:
def get_dttm_filter_value(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this logic and the logic defined on lines 156-164 was already defined here. Wouldn't it be better (and potentially less error prone) if we adhered to the DRY principle and reused the same helper function.


def _process_filters(
self, datasource: BaseDatasource, query_filters: list[QueryObjectFilterClause]
) -> list[QueryObjectFilterClause]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we returning the filters where in actuality we're mutating the query_filters in place?

@@ -185,6 +185,7 @@ def _apply_granularity(
filter
for filter in query_object.filter
if filter["col"] != filter_to_remove
or filter["op"] != "TEMPORAL_RANGE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change, i.e., why are we including a filter which was flagged for removal if it operation is not a temporal range?

value: Any, col: BaseColumn, date_format: str
) -> int | str:
if not isinstance(value, int):
return value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If value is Any but not an int there's no guarantee that the return type will be a str. Am I missing something here?

@kgabryje
Copy link
Member Author

kgabryje commented Oct 20, 2023

Hey @john-bodley thanks for comments! I'm going to work with @michael-s-molina on reproducing the bug that you encountered and I'll address the rest of your comments when I work on the fix

return value
if date_format in {"epoch_ms", "epoch_s"}:
if date_format == "epoch_s":
value = str(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If value is coming from the frontend and is a timestamp in milliseconds shouldn't epoch_s be str(value / 1000) and epoch_ms be str(value)?

saghatelian added a commit to 10webio/superset that referenced this pull request Oct 23, 2023
* fix: is_select with UNION (apache#25290)

(cherry picked from commit bb002d6)

* fix: Add explicit ON DELETE CASCADE for dashboard_roles (apache#25320)

(cherry picked from commit d54e827)

* fix(chart): Supporting custom SQL as temporal x-axis column with filter (apache#25126)

Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com>

* fix: Use RLS clause instead of ID for cache key (apache#25229)

(cherry picked from commit fba66c6)

* fix: Improve the reliability of alerts & reports (apache#25239)

(cherry picked from commit f672d5d)

* fix: DashboardRoles cascade operation (apache#25349)

(cherry picked from commit a971a28)

* fix: datetime with timezone excel export (apache#25318)

Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com>
(cherry picked from commit 5ebcd2a)

* fix: Workaround for Cypress ECONNRESET error (apache#25399)

(cherry picked from commit d76ff39)

* fix(sqllab): invalid persisted tab state (apache#25308) (apache#25398)

* fix: Rename on_delete parameter to ondelete (apache#25424)

(cherry picked from commit 893b45f)

* fix: preventing save button from flickering in SQL Lab (apache#25106)

(cherry picked from commit 296ff17)

* fix: chart import (apache#25425)

(cherry picked from commit a4d8f36)

* fix: swagger UI CSP error (apache#25368)

(cherry picked from commit 1716b9f)

* fix: smarter date formatter (apache#25404)

(cherry picked from commit f0080f9)

* fix(sqllab): invalid start date (apache#25437)

* fix(nativeFilters): Speed up native filters by removing unnecessary rerenders (apache#25282)

Co-authored-by: JUST.in DO IT <justin.park@airbnb.com>
(cherry picked from commit a0eeb4d)

* fix(SqlLab): make icon placement even (apache#25372)

(cherry picked from commit 11b49a6)

* fix: Duplicate items when pasting into Select (apache#25447)

(cherry picked from commit 7cf96cd)

* fix: update the SQLAlchemy model definition at json column for Log table (apache#25445)

(cherry picked from commit e83a76a)

* fix(helm chart): set chart appVersion to 3.0.0 (apache#25373)

* fix(mysql): handle string typed decimal results (apache#24241)

(cherry picked from commit 7eab59a)

* fix: Styles not loading because of faulty CSP setting (apache#25468)

(cherry picked from commit 0cebffd)

* fix(sqllab): error with lazy_gettext for tab titles (apache#25469)

(cherry picked from commit ddde178)

* fix: Address Mypy issue which is causing CI to fail (apache#25494)

(cherry picked from commit 36ed617)

* chore: Adds 3.0.1 CHANGELOG

* fix: Unable to sync columns when database or dataset name contains `+` (apache#25390)

(cherry picked from commit dbe0838)

* fix(sqllab): Broken query containing 'children' (apache#25490)

(cherry picked from commit b92957e)

* chore: Expand error detail on screencapture (apache#25519)

(cherry picked from commit ba541e8)

* fix: tags permissions error message (apache#25516)

(cherry picked from commit 50b0816)

* fix: Apply normalization to all dttm columns (apache#25147)

(cherry picked from commit 58fcd29)

* fix: REST API CSRF exempt list (apache#25590)

(cherry picked from commit 549abb5)

* fix(RLS): Fix Info Tooltip + Button Alignment on RLS Modal (apache#25400)

(cherry picked from commit a6d0e6f)

* fix: thubmnails loading - Talisman default config (apache#25486)

(cherry picked from commit 52f631a)

* fix(Presto): catch DatabaseError when testing Presto views (apache#25559)

Co-authored-by: Rui Zhao <zhaorui@dropbox.com>
(cherry picked from commit be3714e)

* fix(Charts): Set max row limit + removed the option to use an empty row limit value (apache#25579)

(cherry picked from commit f556ef5)

* fix(window): unavailable localStorage and sessionStorage (apache#25599)

* fix: finestTemporalGrainFormatter (apache#25618)

(cherry picked from commit 62bffaf)

* fix: revert fix(sqllab): Force trino client async execution (apache#24859) (apache#25541)

(cherry picked from commit e56e0de)

* chore: Updates 3.0.1 CHANGELOG

* fix(sqllab): Mistitled for new tab after rename (apache#25523)

(cherry picked from commit a520124)

* fix(sqllab): template validation error within comments (apache#25626)

(cherry picked from commit b370c66)

* fix: avoid 500 errors with SQLLAB_BACKEND_PERSISTENCE (apache#25553)

(cherry picked from commit 99f79f5)

* fix(import): Make sure query context is overwritten for overwriting imports (apache#25493)

(cherry picked from commit a0a0d80)

* fix: permalink save/overwrites in explore (apache#25112)

Co-authored-by: Elizabeth Thompson <eschutho@gmail.com>
(cherry picked from commit e58a3ab)

* fix(header navlinks): link navlinks to path prefix (apache#25495)

(cherry picked from commit 51c56dd)

* fix: improve upload ZIP file validation (apache#25658)

* fix: warning of nth-child (apache#23638)

(cherry picked from commit 16cc089)

* fix(dremio): Fixes issue with Dremio SQL generation for Charts with Series Limit (apache#25657)

(cherry picked from commit be82657)

---------

Co-authored-by: Beto Dealmeida <roberto@dealmeida.net>
Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
Co-authored-by: Zef Lin <zef@preset.io>
Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com>
Co-authored-by: Jack Fragassi <jfragassi98@gmail.com>
Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com>
Co-authored-by: JUST.in DO IT <justin.park@airbnb.com>
Co-authored-by: Jack <41238731+fisjac@users.noreply.github.com>
Co-authored-by: Daniel Vaz Gaspar <danielvazgaspar@gmail.com>
Co-authored-by: Stepan <66589759+Always-prog@users.noreply.github.com>
Co-authored-by: Corbin Bullard <corbindbullard@gmail.com>
Co-authored-by: Gyuil Han <cnabro91@gmail.com>
Co-authored-by: Celalettin Calis <celalettin1286@gmail.com>
Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com>
Co-authored-by: ʈᵃᵢ <tdupreetan@gmail.com>
Co-authored-by: Michael S. Molina <michael.s.molina@gmail.com>
Co-authored-by: mapledan <mapledan829@gmail.com>
Co-authored-by: Igor Khrol <khroliz@gmail.com>
Co-authored-by: Rui Zhao <105950525+zhaorui2022@users.noreply.github.com>
Co-authored-by: Fabien <18534166+frassinier@users.noreply.github.com>
Co-authored-by: Hugh A. Miles II <hughmil3s@gmail.com>
Co-authored-by: OskarNS <soerensen.oskar@gmail.com>
john-bodley added a commit to john-bodley/superset that referenced this pull request Oct 30, 2023
saghatelian added a commit to 10webio/superset that referenced this pull request Nov 27, 2023
* fix(sqllab): reinstate "Force trino client async execution" (apache#25680)

* fix: remove unnecessary redirect (apache#25679)

(cherry picked from commit da42bf2)

* fix(chore): dashboard requests to database equal the number of slices it has (apache#24709)

(cherry picked from commit 75a7431)

* fix: bump to FAB 4.3.9 remove CSP exception (apache#25712)

(cherry picked from commit 8fb0c8d)

* fix(horizontal filter label): show full tooltip with ellipsis (apache#25732)

(cherry picked from commit e4173d9)

* fix: Revert "fix(Charts): Set max row limit + removed the option to use an empty row limit value" (apache#25753)

(cherry picked from commit e2fe967)

* fix: dataset update uniqueness (apache#25756)

(cherry picked from commit c7f8d11)

* fix(sqllab): slow pop datasource query (apache#25741)

(cherry picked from commit 2a2bc82)

* fix: allow for backward compatible errors (apache#25640)

* fix: DB-specific quoting in Jinja macro (apache#25779)

(cherry picked from commit 5659c87)

* fix: Revert "fix: Apply normalization to all dttm columns (apache#25147)" (apache#25801)

* fix: Resolve issue apache#24195 (apache#25804)

(cherry picked from commit 8737a8a)

* fix(SQL field in edit dataset modal): display full sql query (apache#25768)

(cherry picked from commit 1eba712)

* fix(sqllab): infinite fetching status after results are landed (apache#25814)

(cherry picked from commit 3f28eeb)

* fix: Fires onChange when clearing all values of single select (apache#25853)

(cherry picked from commit 8061d5c)

* fix: the temporal x-axis results in a none time_range. (apache#25429)

Co-authored-by: Elizabeth Thompson <eschutho@gmail.com>
(cherry picked from commit ae619b1)

* fix(table chart): Show Cell Bars correctly apache#25625 (apache#25707)

(cherry picked from commit 916f7bc)

* fix: remove `update_charts_owners` (apache#25843)

* fix(charts): Time grain is None when dataset uses Jinja (apache#25842)

(cherry picked from commit 7536dd1)

* fix: Saving Mixed Chart with dashboard filter applied breaks adhoc_filter_b (apache#25877)

(cherry picked from commit 268c1dc)

* fix: database version field (apache#25898)

(cherry picked from commit 06ffcd2)

* fix: trino cursor (apache#25897)

(cherry picked from commit cdb18e0)

* chore: Updates CHANGELOG.md for 3.0.2

* fix(trino): allow impersonate_user flag to be imported (apache#25872)

Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
(cherry picked from commit 458be8c)

* fix(table): Double percenting ad-hoc percentage metrics (apache#25857)

(cherry picked from commit 784a478)

* fix(sqllab): invalid sanitization on comparison symbol (apache#25903)

(cherry picked from commit 581d3c7)

* fix: update flask-caching to avoid breaking redis cache, solves apache#25339 (apache#25947)

Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com>

* fix: always denorm column value before querying values (apache#25919)

* chore(colors): Updating Airbnb brand colors (apache#23619)

(cherry picked from commit 6d8424c)

* fix: naming denomalized to denormalized in helpers.py (apache#25973)

(cherry picked from commit 5def416)

* fix(helm): Restart all related deployments when bootstrap script changed (apache#25703)

* fix(rls): Update text from tables to datasets in RLS modal (apache#25997)

(cherry picked from commit 210f1f8)

* fix: Make Select component fire onChange listener when a selection is pasted in (apache#25993)

(cherry picked from commit 5fccf67)

* fix(explore): redandant force param (apache#25985)

(cherry picked from commit e7a1876)

* chore: Optimize fetching samples logic (apache#25995)

(cherry picked from commit 326ac4a)

* fix(native filters): rendering performance improvement by reduce overrendering (apache#25901)

(cherry picked from commit e1d73d5)

* fix: update FAB to 4.3.10, Azure user info fix (apache#26037)

(cherry picked from commit 628cd34)

* chore: Updates CHANGELOG.md for 3.0.2 (rc2)

---------

Co-authored-by: Rob Moore <giftig@users.noreply.github.com>
Co-authored-by: Igor Khrol <igor.khrol@automattic.com>
Co-authored-by: Stepan <66589759+Always-prog@users.noreply.github.com>
Co-authored-by: Daniel Vaz Gaspar <danielvazgaspar@gmail.com>
Co-authored-by: Ross Mabbett <92495987+rtexelm@users.noreply.github.com>
Co-authored-by: Geido <60598000+geido@users.noreply.github.com>
Co-authored-by: Beto Dealmeida <roberto@dealmeida.net>
Co-authored-by: JUST.in DO IT <justin.park@airbnb.com>
Co-authored-by: Elizabeth Thompson <eschutho@gmail.com>
Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com>
Co-authored-by: mapledan <mapledan829@gmail.com>
Co-authored-by: Arko <90512504+SA-Ark@users.noreply.github.com>
Co-authored-by: Antonio Rivero <38889534+Antonio-RiveroMartnez@users.noreply.github.com>
Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com>
Co-authored-by: Michael S. Molina <michael.s.molina@gmail.com>
Co-authored-by: FGrobelny <150029280+FGrobelny@users.noreply.github.com>
Co-authored-by: Giacomo Barone <46573388+ggbaro@users.noreply.github.com>
Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com>
Co-authored-by: Hugh A. Miles II <hughmil3s@gmail.com>
Co-authored-by: josedev-union <70741025+josedev-union@users.noreply.github.com>
Co-authored-by: yousoph <sophieyou12@gmail.com>
Co-authored-by: Jack Fragassi <jfragassi98@gmail.com>
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants