Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-551] [Bug] Datetime incremental tables scanning more of table than expected #717

Closed
2 tasks done
pcreux opened this issue May 15, 2023 · 7 comments
Closed
2 tasks done
Labels
feature:partitioning Related to creating, replacing, or pruning partitions to avoid full table scans Stale type:bug Something isn't working

Comments

@pcreux
Copy link

pcreux commented May 15, 2023

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I have a table partitioned by day on a datetime column. When running an incremental run BigQuery performs a full table scan instead of an optimized partition scan.

This is likely due to a BigQuery bug where the function datetime_trunc ignores partitions.

Example:

-- `events` is partitioned by day on the datetime column `created_at`
select * from events where datetime_trunc(created_at, DAY) in (current_date) -- 650GB
select * from events where timestamp_trunc(created_at, DAY) in (current_date) -- 100MB
select * from events where date(created_at) in (current_date) -- 100MB

Expected Behavior

Optimized partition scan with 100mb usage but I'm seeing a full table scan with 650gb usage.

Steps To Reproduce

The events table is partitioned by day on a datetime column with the following config:

{{ config(
    materialized='incremental',
    unique_key='id',
    partition_by={
      "field": "created_at",
      "data_type": "datetime"
    },
    incremental_strategy = 'insert_overwrite',
    partitions = [ 'current_date', 'date_sub(current_date, interval 1 day)' ]
  )
}}

Running an incremental sync will use datetime_trunc resulting in a full table scan. Here is an extract from the run code:

    merge into `zipline-datawarehouse`.`analytics`.`events` as DBT_INTERNAL_DEST
        using (
  --- SQL MODEL CODE
) as DBT_INTERNAL_SOURCE
        on FALSE
    when not matched by source
         and datetime_trunc(DBT_INTERNAL_DEST.created_at, day) in (
              current_date, date_sub(current_date, interval 1 day)
          )
        then delete
    when not matched then insert
        (COLUMNS)
    values
        (COLUMNS)

Relevant log output

No response

Environment

- OS: MacOS
- Python: 3.10.9
- dbt-core: 1.5.0
- dbt-bigquery: 1.5.0

Additional Context

We can work around this issue by setting the partition_by.data_type to timestamp. dbt will happily use timestamp_trunc and BigQuery will only look up the necessary partitions. (Note that this was working in v1.2, it stopped working in v1.4, and it's now working again in v 1.5).

{{ config(
    materialized='incremental',
    unique_key='id',
    partition_by={
      "field": "created_at",
      "data_type": "timestamp"
    },
    incremental_strategy = 'insert_overwrite',
    partitions = [ 'current_date', 'date_sub(current_date, interval 1 day)' ]
  )
}}

A similar issue with date partitioning but it doesn't seem to be the same cause: dbt-labs/dbt-adapters#592

A mention of the bug where datetime_trunc makes BigQuery perform full table scans: dbt-labs/dbt-utils#393

@pcreux pcreux added type:bug Something isn't working triage:product labels May 15, 2023
@github-actions github-actions bot changed the title [Bug] Datetime incremental tables scanning more of table than expected [ADAP-551] [Bug] Datetime incremental tables scanning more of table than expected May 15, 2023
@dbeatty10
Copy link
Contributor

Thanks for reaching out @pcreux !

I'm not a seeing a difference between the config you reported in the "Steps To Reproduce" vs. the workout in the "Additional Context" section.

Is it as simple as this, or is there more involved?

- "data_type": "datetime"
+ "data_type": "timestamp"

Could you take a peek and update as needed?

@pcreux
Copy link
Author

pcreux commented May 15, 2023

Hey @dbeatty10 ! You are right, it's indeed "data_type": "timestamp". I just updated the code snippet.

@dbeatty10
Copy link
Contributor

This is likely due to a BigQuery bug where the function datetime_trunc ignores partitions.

@pcreux do you have a link to this bug report on the Google side of things? It will be good to hear what Google says about it.

If their guidance is to just use timestamp_trunc instead of datetime_trunc when the underlying data type of the partitioned column is datetime, then the fix would fairly simple here:

if self.data_type_should_be_truncated():
return f"{self.data_type}_trunc({column}, {self.granularity})"
else:
return column

Probably something like:

        if self.data_type_should_be_truncated() and self.data_type == "datetime":
            return f"timestamp_trunc({column}, {self.granularity})"
        elif self.data_type_should_be_truncated():
            return f"{self.data_type}_trunc({column}, {self.granularity})"
        else:
            return column

@pcreux
Copy link
Author

pcreux commented May 16, 2023

@pcreux do you have a link to this bug report on the Google side of things? It will be good to hear what Google says about it.

I didn't find one on https://issuetracker.google.com/savedsearches/559654?pli=1&q=(componentid:187149%2B%20status:open)%20OR%20(componentid:187065%2B%20customfield82940:%22BigQuery%22)%20datetime_trunc but I might not be looking at the right place.

I saw mentions of this potential bug in dbt-labs/dbt-utils#393 (comment).

@dbeatty10 dbeatty10 added the feature:partitioning Related to creating, replacing, or pruning partitions to avoid full table scans label May 16, 2023
@github-christophe-oudar
Copy link
Contributor

I had a few changes to solve issues with trunc functions in 7c21644
Could you try 1.6.X version to see if it works better for you?

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Dec 19, 2023
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:partitioning Related to creating, replacing, or pruning partitions to avoid full table scans Stale type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants