-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-538] [Bug] Date incremental tables scanning more of table than expected #592
Comments
Thanks for raising this @dom-devel ! Do you know of any other way other than bytes processed to inspect how many partitions are pruned vs. how many are scanned in BigQuery? If there is some method with deterministic results that we could get the number of skipped partitions vs scanned ones, that would make it easier to create functional tests for your report (and others). |
Hmm no unfortunately not :( When we're building models we always do a full-refresh and then a non-full refresh as part of the building process. At that point we tend to catch if the wrong number of partitions are being scanned, but it's definitely not a great process as these issues only often get spotted on the really big tables, when someone spots we're re-running a 40GB query on an incremental run. |
Merge query scans both the source table and target table. |
I'm also running into this problem of BigQuery running a full scan of the destination table during incremental updates, which is costly when the table is large. I have confirmed that this part of the
has no effect on the amount of data processed by BQ -- BQ says "This query will process X when run", where X equals the current size of the destination table, whether or not the above line is commented out. My destination table has the following config:
Any update on the issue? |
@ajrheaume This problem occurs when data_type is datetime. I have submitted a pull request for the issue but there is no review. |
@tnk-ysk Where did you find below info in dbt-labs/dbt-bigquery#993 ?
I think partition pruning works well when partition column is datetime & granularity day, but Can I find it in the official document? |
No official documentation has been found, but I have asked Google's bigquery team directly to confirm the supported behavior. |
@tnk-ysk Thanks for reply. I'll take your feedback into consideration! 🙏 |
* Bump mypy from 1.0.1 to 1.1.1 Bumps [mypy](https://github.com/python/mypy) from 1.0.1 to 1.1.1. - [Release notes](https://github.com/python/mypy/releases) - [Commits](python/mypy@v1.0.1...v1.1.1) --- updated-dependencies: - dependency-name: mypy dependency-type: direct:development update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * update pre commit config --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mila Page <versusfacit@users.noreply.github.com>
* Remove import of FieldEncoder from hologram * Update Fixes-20230830-164611.yaml --------- Co-authored-by: colin-rogers-dbt <111200756+colin-rogers-dbt@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: nicor88 <6278547+nicor88@users.noreply.github.com>
Is this a new bug in dbt-bigquery?
Current Behavior
What's the problem?
I think I'm seeing a version similar to this issue?
I have a table where every days partition is approximately 2GB. Config:
I'm doing incremental builds on this table in order to reduce the size of runs. However when the incremental build runs it appears to be scanning more the table than I'd expect?
Here's the whole run (with the delete tmp step stripped so I could inspect the tmp table).
The initial incremental build picks up 3 days here which is approximately 6GB (as expected).
It then picks the 3 days to be replaced:
Then the merge query however proceeds to query 15 GB? The 3 days it's replacing are approximately 6GB. How is this scaling to 15GB?
The previous issue suggested it was an issue with DBT wrapping date:
and date(DBT_INTERNAL_DEST.event_date_dt) in unnest(dbt_partitions_for_replacement)
However in this case removing the date wrapper, does not change anything.
I think a temporary fix is to turn on copy partitions, because the first part of the query is approx 6GB and I think copy partitions is free. (Although if you use copy partitions DBT doesn't record query size for the first steps so I'm validating by running the queries manually in BQ and measuring).
But is this a bug? It doesn't feel like intended behaviour.
Expected Behavior
I would expect only 3 partitions to be scanned which would only cost ~ 6GB.
Steps To Reproduce
Relevant log output
No response
Environment
Additional Context
No response
The text was updated successfully, but these errors were encountered: