[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592

dom-devel · 2023-05-10T10:48:20Z

Is this a new bug in dbt-bigquery?

I believe this is a new bug in dbt-bigquery
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

What's the problem?

I think I'm seeing a version similar to this issue?

I have a table where every days partition is approximately 2GB. Config:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        partition_by={
            "field": "event_date_dt",
            "data_type": "date",
        },
        cluster_by=["event_name"],
    )
}}

I'm doing incremental builds on this table in order to reduce the size of runs. However when the incremental build runs it appears to be scanning more the table than I'd expect?

Here's the whole run (with the delete tmp step stripped so I could inspect the tmp table).

The initial incremental build picks up 3 days here which is approximately 6GB (as expected).

It then picks the 3 days to be replaced:

Then the merge query however proceeds to query 15 GB? The 3 days it's replacing are approximately 6GB. How is this scaling to 15GB?


    merge into `project_id`.`analytics_models`.`ga4_stg_events` as DBT_INTERNAL_DEST
        using (
        select
        * from `project_id`.`analytics_models`.`ga4_stg_events__dbt_tmp`
      ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement) 
        then delete

    when not matched then insert
        (`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)
    values
        (`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)

The previous issue suggested it was an issue with DBT wrapping date:

and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement)

However in this case removing the date wrapper, does not change anything.

I think a temporary fix is to turn on copy partitions, because the first part of the query is approx 6GB and I think copy partitions is free. (Although if you use copy partitions DBT doesn't record query size for the first steps so I'm validating by running the queries manually in BQ and measuring).

But is this a bug? It doesn't feel like intended behaviour.

       partition_by={
            "field": "event_date_dt",
            "data_type": "date",
            "copy_partitions": true
        },

Expected Behavior

I would expect only 3 partitions to be scanned which would only cost ~ 6GB.

Steps To Reproduce

Generate a table partitioned by date
Run an incremental build

Relevant log output

No response

Environment

1. dbt-core: 1.5
2. dbt-bigquery: 1.5
2. Windows 10
3. Python 3.8.6

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-05-11T16:42:38Z

Thanks for raising this @dom-devel !

Do you know of any other way other than bytes processed to inspect how many partitions are pruned vs. how many are scanned in BigQuery?

If there is some method with deterministic results that we could get the number of skipped partitions vs scanned ones, that would make it easier to create functional tests for your report (and others).

dom-devel · 2023-05-16T11:10:13Z

Hmm no unfortunately not :(

When we're building models we always do a full-refresh and then a non-full refresh as part of the building process.

At that point we tend to catch if the wrong number of partitions are being scanned, but it's definitely not a great process as these issues only often get spotted on the really big tables, when someone spots we're re-running a 40GB query on an incremental run.

tnk-ysk · 2023-10-28T08:53:29Z

Merge query scans both the source table and target table.
I think the total size of both tables is probably 15GB.

ajrheaume · 2024-03-07T15:47:00Z

I'm also running into this problem of BigQuery running a full scan of the destination table during incremental updates, which is costly when the table is large. I have confirmed that this part of the merge into statement:

and datetime_trunc(DBT_INTERNAL_DEST.dt_column, day) in unnest(dbt_partitions_for_replacement)

has no effect on the amount of data processed by BQ -- BQ says "This query will process X when run", where X equals the current size of the destination table, whether or not the above line is commented out.

My destination table has the following config:

{{ 
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
              "field": "dt_column",
              "data_type": "datetime",
              "granularity": "day"
            },
    )
}}

Any update on the issue?

tnk-ysk · 2024-03-07T17:08:29Z

@ajrheaume
Probably the issue is related to dbt-labs/dbt-bigquery#717.
This issue aren't related.

This problem occurs when data_type is datetime.

I have submitted a pull request for the issue but there is no review.
dbt-labs/dbt-bigquery#993

jx2lee · 2024-08-16T08:24:25Z

@tnk-ysk
Hi, i have a question.

Where did you find below info in dbt-labs/dbt-bigquery#993 ?

data_type	granularity	transform	description
datetime	month	datetime_trunc(column, month)	full scan
datetime	day	datetime_trunc(column, day)	full scan
datetime	hour	datetime_trunc(column, hour)	full scan
datetime	month	date_trunc(column, month)	valid partition filter
datetime	day	date_trunc(column, day)	valid partition filter
datetime	hour	date_trunc(column, hour)	valid partition filter, but not supported

I think partition pruning works well when partition column is datetime & granularity day, but
I don't understand how pruning partitions with date_trunc works.

Can I find it in the official document?

tnk-ysk · 2024-08-19T00:27:33Z

@jx2lee

Can I find it in the official document?

No official documentation has been found, but I have asked Google's bigquery team directly to confirm the supported behavior.
Google's answer was that they only support date_trunc, so please submit a feature request if you wish.

jx2lee · 2024-08-19T00:34:45Z

@tnk-ysk Thanks for reply. I'll take your feedback into consideration! 🙏

* Bump mypy from 1.0.1 to 1.1.1 Bumps [mypy](https://github.com/python/mypy) from 1.0.1 to 1.1.1. - [Release notes](https://github.com/python/mypy/releases) - [Commits](python/mypy@v1.0.1...v1.1.1) --- updated-dependencies: - dependency-name: mypy dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * update pre commit config --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mila Page <versusfacit@users.noreply.github.com>

* Remove import of FieldEncoder from hologram * Update Fixes-20230830-164611.yaml --------- Co-authored-by: colin-rogers-dbt <111200756+colin-rogers-dbt@users.noreply.github.com>

Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: nicor88 <6278547+nicor88@users.noreply.github.com>

dom-devel added type:bug Something isn't working as documented triage:product In Product's queue labels May 10, 2023

github-actions bot changed the title ~~[Bug] Date incremental tables scanning more of table than expected~~ [ADAP-538] [Bug] Date incremental tables scanning more of table than expected May 10, 2023

dbeatty10 added triage:awaiting-response Awaiting a response from the reporter and removed triage:product In Product's queue labels May 11, 2023

pcreux mentioned this issue May 15, 2023

[ADAP-551] [Bug] Datetime incremental tables scanning more of table than expected dbt-labs/dbt-bigquery#717

Closed

2 tasks

github-actions bot added triage:product In Product's queue and removed triage:awaiting-response Awaiting a response from the reporter labels May 16, 2023

dbeatty10 added the feature:partitioning Issues related to table partitioning label May 16, 2023

dbeatty10 mentioned this issue May 16, 2023

[ADAP-555] [Spike] Capture the number of partitions pruned when running an incremental model dbt-labs/dbt-bigquery#728

Closed

3 tasks

dbeatty10 removed the triage:product In Product's queue label May 17, 2023

dbeatty10 added the feature:incremental Issues related to incremental materializations label May 30, 2024

mikealfare added the pkg:dbt-bigquery Issue affects dbt-bigquery label Jan 14, 2025

mikealfare transferred this issue from dbt-labs/dbt-bigquery Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592

[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592

dom-devel commented May 10, 2023 •

edited

Loading

dbeatty10 commented May 11, 2023

dom-devel commented May 16, 2023

tnk-ysk commented Oct 28, 2023 •

edited

Loading

ajrheaume commented Mar 7, 2024

tnk-ysk commented Mar 7, 2024

jx2lee commented Aug 16, 2024 •

edited

Loading

tnk-ysk commented Aug 19, 2024

jx2lee commented Aug 19, 2024

[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592

[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592

Comments

dom-devel commented May 10, 2023 • edited Loading

Is this a new bug in dbt-bigquery?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

dbeatty10 commented May 11, 2023

dom-devel commented May 16, 2023

tnk-ysk commented Oct 28, 2023 • edited Loading

ajrheaume commented Mar 7, 2024

tnk-ysk commented Mar 7, 2024

jx2lee commented Aug 16, 2024 • edited Loading

tnk-ysk commented Aug 19, 2024

jx2lee commented Aug 19, 2024

dom-devel commented May 10, 2023 •

edited

Loading

tnk-ysk commented Oct 28, 2023 •

edited

Loading

jx2lee commented Aug 16, 2024 •

edited

Loading