Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-442] [CT-2400] [Bug] Extra timestamp function cause extra data cost in incremental merge operation #660

Closed
2 tasks done
zhanghongkai opened this issue Apr 11, 2023 · 1 comment
Labels
type:bug Something isn't working

Comments

@zhanghongkai
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

For timestamp partitioned incremental model, in the merge sql generated by dbt, the condition on the destination table's timestamp column has an extra layer of timestamp (it looks like timestamp(timestamp_trunc(DBT_INTERNAL_DEST.partition_column day)) where the first timestamp function is unnecessary). I have tested in bigquery that this cause bigquery to scan the whole destination table and cause extra data usage. In some extreme case, this merge step would take more data than full refresh.
This does not cause error - just cost extra data quote in bigquery.
Looking at older issues like dbt-labs/dbt-core#3284 - the run target did not contain this extra layer of timestamp. But I cannot be sure when this first appeared.

Expected Behavior

I hope to be able to remove this extra timestamp layer so that bigquery can actually use the partition optimization during the merge operation.

Steps To Reproduce

For testing purpose, I created this model copying the exact static partition config from v1.4 https://docs.getdbt.com/reference/resource-configs/bigquery-configs#the-insert_overwrite-strategy

{% set partitions_to_replace = [
  'timestamp(current_date)',
  'timestamp(date_sub(current_date, interval 1 day))'
] %}

{{
  config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by = {'field': 'session_start', 'data_type': 'timestamp'},
    partitions = partitions_to_replace
  )
}}

select current_timestamp() as session_start

The dbt run target is

merge into my_table_name as DBT_INTERNAL_DEST
        using (



select current_timestamp() as session_start) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and **timestamp**(timestamp_trunc(DBT_INTERNAL_DEST.session_start, day)) in (
              timestamp(current_date), timestamp(date_sub(current_date, interval 1 day))
          ) 
        then delete

    when not matched then insert
        (`session_start`)
    values
        (`session_start`)
The highlighted timestamp is the extra layer causing the cost issue. 

Relevant log output

No response

Environment

- OS: MacOS 12.4
- Python: 3.9.16
- dbt: 1.4.4

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

@zhanghongkai zhanghongkai added type:bug Something isn't working triage:product labels Apr 11, 2023
@github-actions github-actions bot changed the title [Bug] Extra timestamp function cause extra data cost in incremental merge operation [CT-2400] [Bug] Extra timestamp function cause extra data cost in incremental merge operation Apr 11, 2023
@dbeatty10 dbeatty10 transferred this issue from dbt-labs/dbt-core Apr 11, 2023
@github-actions github-actions bot changed the title [CT-2400] [Bug] Extra timestamp function cause extra data cost in incremental merge operation [ADAP-442] [CT-2400] [Bug] Extra timestamp function cause extra data cost in incremental merge operation Apr 11, 2023
@zhanghongkai
Copy link
Author

sorry my bad - didn't know I should have searched for issues here. Looks like this is the same issue as #536 and already marked as solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants