Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592

Open
2 tasks done
dom-devel opened this issue May 10, 2023 · 8 comments
Open
2 tasks done
Labels
feature:incremental Issues related to incremental materializations feature:partitioning Issues related to table partitioning pkg:dbt-bigquery Issue affects dbt-bigquery type:bug Something isn't working as documented

Comments

@dom-devel
Copy link

dom-devel commented May 10, 2023

Is this a new bug in dbt-bigquery?

  • I believe this is a new bug in dbt-bigquery
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

What's the problem?

I think I'm seeing a version similar to this issue?

I have a table where every days partition is approximately 2GB. Config:

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        partition_by={
            "field": "event_date_dt",
            "data_type": "date",
        },
        cluster_by=["event_name"],
    )
}}

I'm doing incremental builds on this table in order to reduce the size of runs. However when the incremental build runs it appears to be scanning more the table than I'd expect?

Here's the whole run (with the delete tmp step stripped so I could inspect the tmp table).

image

The initial incremental build picks up 3 days here which is approximately 6GB (as expected).

It then picks the 3 days to be replaced:

image

Then the merge query however proceeds to query 15 GB? The 3 days it's replacing are approximately 6GB. How is this scaling to 15GB?


    merge into `project_id`.`analytics_models`.`ga4_stg_events` as DBT_INTERNAL_DEST
        using (
        select
        * from `project_id`.`analytics_models`.`ga4_stg_events__dbt_tmp`
      ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement) 
        then delete

    when not matched then insert
        (`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)
    values
        (`event_date_dt`, `event_timestamp`, another 30 columns truncated for brevity)

The previous issue suggested it was an issue with DBT wrapping date:

and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement)

However in this case removing the date wrapper, does not change anything.

I think a temporary fix is to turn on copy partitions, because the first part of the query is approx 6GB and I think copy partitions is free. (Although if you use copy partitions DBT doesn't record query size for the first steps so I'm validating by running the queries manually in BQ and measuring).

But is this a bug? It doesn't feel like intended behaviour.

       partition_by={
            "field": "event_date_dt",
            "data_type": "date",
            "copy_partitions": true
        },

Expected Behavior

I would expect only 3 partitions to be scanned which would only cost ~ 6GB.

Steps To Reproduce

  • Generate a table partitioned by date
  • Run an incremental build

Relevant log output

No response

Environment

1. dbt-core: 1.5
2. dbt-bigquery: 1.5
2. Windows 10
3. Python 3.8.6

Additional Context

No response

@dom-devel dom-devel added type:bug Something isn't working as documented triage:product In Product's queue labels May 10, 2023
@github-actions github-actions bot changed the title [Bug] Date incremental tables scanning more of table than expected [ADAP-538] [Bug] Date incremental tables scanning more of table than expected May 10, 2023
@dbeatty10
Copy link
Contributor

Thanks for raising this @dom-devel !

Do you know of any other way other than bytes processed to inspect how many partitions are pruned vs. how many are scanned in BigQuery?

If there is some method with deterministic results that we could get the number of skipped partitions vs scanned ones, that would make it easier to create functional tests for your report (and others).

@dom-devel
Copy link
Author

Hmm no unfortunately not :(

When we're building models we always do a full-refresh and then a non-full refresh as part of the building process.

At that point we tend to catch if the wrong number of partitions are being scanned, but it's definitely not a great process as these issues only often get spotted on the really big tables, when someone spots we're re-running a 40GB query on an incremental run.

@github-actions github-actions bot added triage:product In Product's queue and removed triage:awaiting-response Awaiting a response from the reporter labels May 16, 2023
@dbeatty10 dbeatty10 added the feature:partitioning Issues related to table partitioning label May 16, 2023
@dbeatty10 dbeatty10 removed the triage:product In Product's queue label May 17, 2023
@tnk-ysk
Copy link
Contributor

tnk-ysk commented Oct 28, 2023

Merge query scans both the source table and target table.
I think the total size of both tables is probably 15GB.

@ajrheaume
Copy link

I'm also running into this problem of BigQuery running a full scan of the destination table during incremental updates, which is costly when the table is large. I have confirmed that this part of the merge into statement:

and datetime_trunc(DBT_INTERNAL_DEST.dt_column, day) in unnest(dbt_partitions_for_replacement)

has no effect on the amount of data processed by BQ -- BQ says "This query will process X when run", where X equals the current size of the destination table, whether or not the above line is commented out.

My destination table has the following config:

{{ 
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
              "field": "dt_column",
              "data_type": "datetime",
              "granularity": "day"
            },
    )
}}

Any update on the issue?

@tnk-ysk
Copy link
Contributor

tnk-ysk commented Mar 7, 2024

@ajrheaume
Probably the issue is related to dbt-labs/dbt-bigquery#717.
This issue aren't related.

This problem occurs when data_type is datetime.

I have submitted a pull request for the issue but there is no review.
dbt-labs/dbt-bigquery#993

@dbeatty10 dbeatty10 added the feature:incremental Issues related to incremental materializations label May 30, 2024
@jx2lee
Copy link
Contributor

jx2lee commented Aug 16, 2024

@tnk-ysk
Hi, i have a question.

Where did you find below info in dbt-labs/dbt-bigquery#993 ?

data_type granularity transform description
datetime month datetime_trunc(column, month) full scan
datetime day datetime_trunc(column, day) full scan
datetime hour datetime_trunc(column, hour) full scan
datetime month date_trunc(column, month) valid partition filter
datetime day date_trunc(column, day) valid partition filter
datetime hour date_trunc(column, hour) valid partition filter, but not supported

I think partition pruning works well when partition column is datetime & granularity day, but
I don't understand how pruning partitions with date_trunc works.

Can I find it in the official document?

@tnk-ysk
Copy link
Contributor

tnk-ysk commented Aug 19, 2024

@jx2lee

Can I find it in the official document?

No official documentation has been found, but I have asked Google's bigquery team directly to confirm the supported behavior.
Google's answer was that they only support date_trunc, so please submit a feature request if you wish.

@jx2lee
Copy link
Contributor

jx2lee commented Aug 19, 2024

@tnk-ysk Thanks for reply. I'll take your feedback into consideration! 🙏

@mikealfare mikealfare added the pkg:dbt-bigquery Issue affects dbt-bigquery label Jan 14, 2025
@mikealfare mikealfare transferred this issue from dbt-labs/dbt-bigquery Jan 14, 2025
mikealfare pushed a commit that referenced this issue Jan 14, 2025
* Bump mypy from 1.0.1 to 1.1.1

Bumps [mypy](https://github.com/python/mypy) from 1.0.1 to 1.1.1.
- [Release notes](https://github.com/python/mypy/releases)
- [Commits](python/mypy@v1.0.1...v1.1.1)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* update pre commit config

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mila Page <versusfacit@users.noreply.github.com>
mikealfare pushed a commit that referenced this issue Jan 20, 2025
* Remove import of FieldEncoder from hologram

* Update Fixes-20230830-164611.yaml

---------

Co-authored-by: colin-rogers-dbt <111200756+colin-rogers-dbt@users.noreply.github.com>
colin-rogers-dbt pushed a commit that referenced this issue Feb 3, 2025
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: nicor88 <6278547+nicor88@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:incremental Issues related to incremental materializations feature:partitioning Issues related to table partitioning pkg:dbt-bigquery Issue affects dbt-bigquery type:bug Something isn't working as documented
Projects
None yet
Development

No branches or pull requests

6 participants