BQ partition configs support #3386

prratek · 2021-05-23T01:52:17Z

resolves #3016

Description

#2928 added support for require_partition_filter and partition_expiration_days in the BigQuery adapter. As outlined in #3016, this PR ensure that the new config options work with all incremental strategies available for BigQuery today.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

prratek · 2021-05-23T01:55:40Z

plugins/bigquery/dbt/include/bigquery/macros/materializations/incremental.sql

+            {% set predicates = [] %}
+            {% if is_partition_filter_required %}
+                {%- set partition_filter -%}
+                    (DBT_INTERNAL_DEST.{{ partition_by.field }} is not null or DBT_INTERNAL_DEST.{{ partition_by.field }} is null)


I wasn't sure if there was a better way to get the DBT_INTERNAL_DEST alias in there than just typing it out. I tried {{ partition_by.render(alias='DBT_INTERNAL_DEST') }} but that renders to timestamp_trunc(my_partition_col, date) when partitioning by date on a timestamp col, and BigQuery doesn't like that as a partition filter.

and BigQuery doesn't like that as a partition filter

Really? It works as a partition by expression, but not as a partition-pruning filter? That's... too bad.

In any case, the approach you've taken here seems fine by me

prratek · 2021-05-23T01:58:52Z

Okay a few notes:

Sorry for the large diff! The inconsistent indentation was driving me nuts
I'd love some guidance on writing a test case to make sure this works for the merge strategy. I tested this locally by putting together a minimal dbt project with an incremental model that uses these configs and I've worked with the integration test suite once before but still don't feel too comfortable navigating everything under test/integration

jtcohen6 · 2021-05-25T04:29:43Z

@prratek Thanks so much for taking this on!

As far as testing this functionality, the right place might just be an extension of 022_bigquery_test/test_incremental_strategies.py. There are a bunch of models here that mix and match incremental strategies and partition configurations. I think you could extend the TestBigQueryScripting case, with one crucial adjustment: turn on require_partition_filter for all the models.

@property
def project_config(self):
    return {
        'config-version': 2,
        'seeds': {
            '+quote_columns': False,
        },
        'models': {
            '+require_partition_filter': True
        },
    }

If every model / strategy / partition combo runs with that config turned on, we'll know that this change has been successful. (Conversely, without the change in this PR, some of those models should fail.)

prratek · 2021-05-28T03:12:55Z

That makes sense! Couple of thoughts:

I saw your comment in #3016 about having temp tables not require a partition filter more generally and that seems reasonable. Could that be accomplished by just changing this if statement? Something like:

if config.get('require_partition_filter') and not temporary:

This colon looks suspicious - is that valid jinja?

jtcohen6 · 2021-06-01T13:16:40Z

I can't see what line you linked to in item 1; I think we're on the same page, but just to be safe, I'll say what I'm thinking.

The operative logic is going to be in the BigQuery plugin, rather than the default, since BigQuery implements its own bigquery__create_table_as. That macro calls bigquery_table_options, which in turn shells out to a python adapter method, get_table_options. In particular, we'd want to change this line in the way you recommend:
https://github.com/fishtown-analytics/dbt/blob/d89e1d7f850ba5a48e66cfabf0dcfe14afc32ccc/plugins/bigquery/dbt/adapters/bigquery/impl.py#L797-L799

 if config.get('require_partition_filter') and not temporary:
       opts['require_partition_filter'] = config.get(
                 'require_partition_filter')

There's something funny about the way we've implemented temp tables on BigQuery, because (a) our implementation predated "true" temp tables in BQ, and (b) "true" temp tables are only supported in scripting-style queries, which the snapshot materialization can't quite be today. So instead of "true" temp tables, dbt creates real tables with 12-hour expiration windows. That doesn't change the substance of the change you need to make here, I just wanted to make sure you weren't staring at the code in confusion.

For the integration test, rather than turning on require_partition_filter for the existing test case, could you create a new test case that inherits from the existing one, with require_partition_filter turned on? I want to make sure this works both with and without the config for all incremental strategies.

Last but not least, as far as that colon in Jinja: I hear you! It works either way, though.

{% if True: %} select 1 as fun {% endif %}
{% if True %} select 1 as fun {% endif %}

I figure some folks prefer the colon because it looks more like a python conditional:

if True:
  return "select 1 as fun"

prratek · 2021-06-10T01:18:59Z

Sorry it took me a while to get back to this, but I think I'm almost there:

For the integration tests, I had to navigate the fact that BigQuery can't use the result of a subquery to prune partitions and that the query used to get the most recent row from the table must itself also include a partition filter. The test as it stands feels a little hacky but it works. Open to any suggestions!
@jtcohen6 do you know what _dbt_max_partition is here? A couple of the models in that directory use it and it doesn't prune partitions but I'd want to understand what it's doing before changing anything.
Was there anything else you think I should write tests for? You mentioned a possible test case to ensure the static insert_overwrite strategy works and I think this model should cover that.

jtcohen6 · 2021-06-16T12:50:17Z

For the integration tests, I had to navigate the fact that BigQuery can't use the result of a subquery to prune partitions and that the query used to get the most recent row from the table must itself also include a partition filter. The test as it stands feels a little hacky but it works. Open to any suggestions!

Nice work on this! Looks similar to what we've done on other projects/packages for BigQuery.

@jtcohen6 do you know what _dbt_max_partition is here? A couple of the models in that directory use it and it doesn't prune partitions but I'd want to understand what it's doing before changing anything.

In the "dynamic" version of insert_overwrite, _dbt_max_partition is included in the script to calculate the max partition value. So the query to calculate _dbt_max_partition may indeed be the guilty party here, not be pruning partitions, since it's just select max(partition_col). I would think that the resulting value should effectively prune partitions within the model SQL. If it doesn't, well... we already know that this approach to _dbt_max_partition is not ideal: https://github.com/fishtown-analytics/dbt/issues/2278

Was there anything else you think I should write tests for? You mentioned a possible test case to ensure the static insert_overwrite strategy works and I think this model should cover that.

I think you've got it covered!

jtcohen6 · 2021-06-17T17:12:42Z

@prratek I looked a bit more into this, and I've found something pretty wacky: wrapping a datetime-type partition column in date(...) and timestamp_trunc(..., day) both work just fine for partition filtering and elimination, but when I wrap it in datetime_trunc(...), it doesn't work!

-- models/my_model.sql
{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        partition_by={
            "field": "date_time",
            "data_type": "datetime"
        },
        require_partition_filter = True
    )
}}

select 1 as id, cast('2020-01-01' as datetime) as date_time

-- excerpted from insert_overwrite script, which I copy-pasted into BigQuery console to confirm

merge into `my_model` as DBT_INTERNAL_DEST
        using ( ... ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and datetime_trunc(DBT_INTERNAL_DEST.date_time, day) in ('2020-01-01'')
        then delete

    when not matched then insert
        (`id`, `date_time`)
    values
        (`id`, `date_time`)

  Query error: Cannot query over table 'my_model' without a filter over column(s) 'date_time' that can be used for partition elimination at [42:5]
  compiled SQL at target/run/my_project/models/my_model.sql

This sure looks like a BigQuery bug, doesn't it?

In the meantime, we could work around it by adding another filter:

    when not matched by source
         and datetime_trunc(DBT_INTERNAL_DEST.date_time, day) in ('2020-01-01', '2020-01-02')
         and DBT_INTERNAL_DEST.date_time is not null
        then delete

This enables the query to succeed, but it will still process more bytes than it should, since datetime_trunc doesn't work for partition elimination. To repeat, I'm only seeing this issue with datetime_trunc; both date and timestamp column types/functions seem to work just fine with partition filtering.

jtcohen6 · 2021-08-31T17:58:34Z

@prratek Is this something you'd still be interested in contributing? I'd be willing to find a way to work around the failing test re: filtering with datetime_trunc, since that really does feel like an undocumented limitation (or bug) with BigQuery.

jtcohen6 · 2021-11-16T11:23:59Z

Closing in favor of dbt-labs/dbt-bigquery#65

prratek added 3 commits May 22, 2021 17:57

makes indents consistently 4 spaces

62fdf38

fixes a few more indents

201b168

filters on partition col when required during merge

82295d6

cla-bot bot added the cla:yes label May 23, 2021

prratek commented May 23, 2021

View reviewed changes

prratek marked this pull request as draft May 23, 2021 01:59

prratek changed the title ~~Bq partition configs support~~ BQ partition configs support May 24, 2021

prratek added 2 commits May 27, 2021 22:26

requires partition filter on BQ incremental tests to ensure that works

3048761

removes extra config_version

6c8830b

prratek added 2 commits June 9, 2021 17:31

ignores partition filter requirements for temp tables

8e898a4

moves partition filter tests to new test case

20f2c79

prratek temporarily deployed to Postgres June 9, 2021 21:45 Inactive

prratek temporarily deployed to Bigquery June 9, 2021 21:45 Inactive

prratek temporarily deployed to Redshift June 9, 2021 21:45 Inactive

prratek had a problem deploying to Snowflake June 9, 2021 21:45 Error

prratek had a problem deploying to Snowflake June 9, 2021 21:45 Failure

fetches filter val outside subquery to allow partition pruning

41a3ff4

prratek temporarily deployed to Postgres June 10, 2021 00:36 Inactive

prratek had a problem deploying to Snowflake June 10, 2021 00:36 Failure

prratek had a problem deploying to Snowflake June 10, 2021 00:36 Error

prratek temporarily deployed to Bigquery June 10, 2021 00:36 Inactive

prratek temporarily deployed to Redshift June 10, 2021 00:36 Inactive

jtcohen6 mentioned this pull request Jul 18, 2021

Using dateadd() macro to filter on date partitioned column results in full table scan dbt-labs/dbt-utils#393

Closed

1 task

jtcohen6 added the ok_to_test label Aug 31, 2021

This was referenced Nov 16, 2021

require_partition_filter=true with the insert_overwrite strategy doesn't work dbt-labs/dbt-bigquery#64

Closed

Don't apply require_partition_filter to a temporary table dbt-labs/dbt-bigquery#65

Merged

jtcohen6 closed this Nov 16, 2021

pcreux mentioned this pull request May 12, 2023

BigQuery incremental update with datetime partitioning results in full table scan dbt-labs/dbt-utils#795

Closed

1 task

jtcohen6 mentioned this pull request May 23, 2021

[ADAP-629] [BQ] Support new partition configs with all incremental strategies dbt-labs/dbt-adapters#604

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BQ partition configs support #3386

BQ partition configs support #3386

prratek commented May 23, 2021 •

edited

Loading

prratek May 23, 2021

jtcohen6 May 25, 2021 •

edited

Loading

prratek commented May 23, 2021

jtcohen6 commented May 25, 2021

prratek commented May 28, 2021

jtcohen6 commented Jun 1, 2021

prratek commented Jun 10, 2021

jtcohen6 commented Jun 16, 2021 •

edited

Loading

jtcohen6 commented Jun 17, 2021

jtcohen6 commented Aug 31, 2021

jtcohen6 commented Nov 16, 2021

BQ partition configs support #3386

BQ partition configs support #3386

Conversation

prratek commented May 23, 2021 • edited Loading

Description

Checklist

prratek May 23, 2021

Choose a reason for hiding this comment

jtcohen6 May 25, 2021 • edited Loading

Choose a reason for hiding this comment

prratek commented May 23, 2021

jtcohen6 commented May 25, 2021

prratek commented May 28, 2021

jtcohen6 commented Jun 1, 2021

prratek commented Jun 10, 2021

jtcohen6 commented Jun 16, 2021 • edited Loading

jtcohen6 commented Jun 17, 2021

jtcohen6 commented Aug 31, 2021

jtcohen6 commented Nov 16, 2021

prratek commented May 23, 2021 •

edited

Loading

jtcohen6 May 25, 2021 •

edited

Loading

jtcohen6 commented Jun 16, 2021 •

edited

Loading