Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery incremental update with datetime partitioning results in full table scan #795

Closed
1 task done
pcreux opened this issue May 12, 2023 · 3 comments
Closed
1 task done
Labels
bug Something isn't working triage

Comments

@pcreux
Copy link

pcreux commented May 12, 2023

Describe the bug

I have a table partitioned by day on a datetime column. When running an incremental run BigQuery performs a full table scan instead of an optimized partition scan.

This is likely due to a BigQuery bug where the function datetime_trunc ignores partitions.

Example:

-- `events` is partitioned by day on the datetime column `created_at`
select * from events where datetime_trunc(created_at, DAY) in (current_date) -- 650GB
select * from events where timestamp_trunc(created_at, DAY) in (current_date) -- 100MB
select * from events where date(created_at) in (current_date) -- 100MB

Steps to reproduce

The events table is partitioned by day on a datetime column with the following config:

{{ config(
    materialized='incremental',
    unique_key='id',
    partition_by={
      "field": "created_at",
      "data_type": "datetime"
    },
    incremental_strategy = 'insert_overwrite',
    partitions = [ 'current_date', 'date_sub(current_date, interval 1 day)' ]
  )
}}

Running an incremental sync will use datetime_trunc resulting in a full table scan. Here is an extract from the run code:

    merge into `zipline-datawarehouse`.`analytics`.`events` as DBT_INTERNAL_DEST
        using (
  --- SQL MODEL CODE
) as DBT_INTERNAL_SOURCE
        on FALSE
    when not matched by source
         and datetime_trunc(DBT_INTERNAL_DEST.created_at, day) in (
              current_date, date_sub(current_date, interval 1 day)
          )
        then delete
    when not matched then insert
        (COLUMNS)
    values
        (COLUMNS)

Expected results

Optimized partition scan with 100mb usage.

Actual results

Full table scan with 650gb usage.

System information

Which database are you using dbt with?

  • bigquery

The output of dbt --version:

dbt --version
Core:
  - installed: 1.5.0
  - latest:    1.5.0 - Up to date!

Plugins:
  - bigquery: 1.5.0 - Up to date!
  - postgres: 1.5.0 - Up to date!

Additional context

We can work around this issue by setting the partition_by.data_type to timestamp. dbt will happily use timestamp_trunc and BigQuery will only look up the necessary partitions. (Note that this was working in v1.2, it stopped working in v1.4, and it's now working again in v 1.5).

{{ config(
    materialized='incremental',
    unique_key='id',
    partition_by={
      "field": "created_at",
      "data_type": "datetime"
    },
    incremental_strategy = 'insert_overwrite',
    partitions = [ 'current_date', 'date_sub(current_date, interval 1 day)' ]
  )
}}

Related issues

dbt-labs/dbt-core#3386 (comment)
#393

Are you interested in contributing the fix?

I'm using the work around at the moment.

@pcreux pcreux added bug Something isn't working triage labels May 12, 2023
@pgoslatara
Copy link

@pcreux Maybe I'm missing something but why is this issue opened in dbt-utils and not dbt-bigquery?

@pcreux
Copy link
Author

pcreux commented May 15, 2023

@pgoslatara OMG, you're 100% right. I didn't notice that I was opening it in the wrong repo. 🤦

@pcreux pcreux closed this as completed May 15, 2023
@pcreux
Copy link
Author

pcreux commented May 15, 2023

I created an issue in dbt-bigquery: dbt-labs/dbt-bigquery#717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants