Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BQ partition configs support #3386

Closed
wants to merge 8 commits into from

Conversation

prratek
Copy link
Contributor

@prratek prratek commented May 23, 2021

resolves #3016

Description

#2928 added support for require_partition_filter and partition_expiration_days in the BigQuery adapter. As outlined in #3016, this PR ensure that the new config options work with all incremental strategies available for BigQuery today.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label May 23, 2021
{% set predicates = [] %}
{% if is_partition_filter_required %}
{%- set partition_filter -%}
(DBT_INTERNAL_DEST.{{ partition_by.field }} is not null or DBT_INTERNAL_DEST.{{ partition_by.field }} is null)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if there was a better way to get the DBT_INTERNAL_DEST alias in there than just typing it out. I tried {{ partition_by.render(alias='DBT_INTERNAL_DEST') }} but that renders to timestamp_trunc(my_partition_col, date) when partitioning by date on a timestamp col, and BigQuery doesn't like that as a partition filter.

Copy link
Contributor

@jtcohen6 jtcohen6 May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and BigQuery doesn't like that as a partition filter

Really? It works as a partition by expression, but not as a partition-pruning filter? That's... too bad.

In any case, the approach you've taken here seems fine by me

@prratek
Copy link
Contributor Author

prratek commented May 23, 2021

Okay a few notes:

  1. Sorry for the large diff! The inconsistent indentation was driving me nuts
  2. I'd love some guidance on writing a test case to make sure this works for the merge strategy. I tested this locally by putting together a minimal dbt project with an incremental model that uses these configs and I've worked with the integration test suite once before but still don't feel too comfortable navigating everything under test/integration

@prratek prratek marked this pull request as draft May 23, 2021 01:59
@prratek prratek changed the title Bq partition configs support BQ partition configs support May 24, 2021
@jtcohen6
Copy link
Contributor

@prratek Thanks so much for taking this on!

As far as testing this functionality, the right place might just be an extension of 022_bigquery_test/test_incremental_strategies.py. There are a bunch of models here that mix and match incremental strategies and partition configurations. I think you could extend the TestBigQueryScripting case, with one crucial adjustment: turn on require_partition_filter for all the models.

@property
def project_config(self):
    return {
        'config-version': 2,
        'seeds': {
            '+quote_columns': False,
        },
        'models': {
            '+require_partition_filter': True
        },
    }

If every model / strategy / partition combo runs with that config turned on, we'll know that this change has been successful. (Conversely, without the change in this PR, some of those models should fail.)

@prratek
Copy link
Contributor Author

prratek commented May 28, 2021

That makes sense! Couple of thoughts:

  1. I saw your comment in #3016 about having temp tables not require a partition filter more generally and that seems reasonable. Could that be accomplished by just changing this if statement? Something like:
if config.get('require_partition_filter') and not temporary:
  1. This colon looks suspicious - is that valid jinja?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 1, 2021

I can't see what line you linked to in item 1; I think we're on the same page, but just to be safe, I'll say what I'm thinking.

The operative logic is going to be in the BigQuery plugin, rather than the default, since BigQuery implements its own bigquery__create_table_as. That macro calls bigquery_table_options, which in turn shells out to a python adapter method, get_table_options. In particular, we'd want to change this line in the way you recommend:
https://github.com/fishtown-analytics/dbt/blob/d89e1d7f850ba5a48e66cfabf0dcfe14afc32ccc/plugins/bigquery/dbt/adapters/bigquery/impl.py#L797-L799

 if config.get('require_partition_filter') and not temporary:
       opts['require_partition_filter'] = config.get(
                 'require_partition_filter') 

There's something funny about the way we've implemented temp tables on BigQuery, because (a) our implementation predated "true" temp tables in BQ, and (b) "true" temp tables are only supported in scripting-style queries, which the snapshot materialization can't quite be today. So instead of "true" temp tables, dbt creates real tables with 12-hour expiration windows. That doesn't change the substance of the change you need to make here, I just wanted to make sure you weren't staring at the code in confusion.

For the integration test, rather than turning on require_partition_filter for the existing test case, could you create a new test case that inherits from the existing one, with require_partition_filter turned on? I want to make sure this works both with and without the config for all incremental strategies.

Last but not least, as far as that colon in Jinja: I hear you! It works either way, though.

{% if True: %} select 1 as fun {% endif %}
{% if True %} select 1 as fun {% endif %}

I figure some folks prefer the colon because it looks more like a python conditional:

if True:
  return "select 1 as fun"

@prratek prratek temporarily deployed to Postgres June 9, 2021 21:45 Inactive
@prratek prratek temporarily deployed to Bigquery June 9, 2021 21:45 Inactive
@prratek prratek temporarily deployed to Bigquery June 9, 2021 21:45 Inactive
@prratek prratek temporarily deployed to Redshift June 9, 2021 21:45 Inactive
@prratek prratek temporarily deployed to Redshift June 9, 2021 21:45 Inactive
@prratek prratek temporarily deployed to Postgres June 10, 2021 00:36 Inactive
@prratek prratek temporarily deployed to Bigquery June 10, 2021 00:36 Inactive
@prratek prratek temporarily deployed to Bigquery June 10, 2021 00:36 Inactive
@prratek prratek temporarily deployed to Redshift June 10, 2021 00:36 Inactive
@prratek prratek temporarily deployed to Redshift June 10, 2021 00:36 Inactive
@prratek
Copy link
Contributor Author

prratek commented Jun 10, 2021

Sorry it took me a while to get back to this, but I think I'm almost there:

  1. For the integration tests, I had to navigate the fact that BigQuery can't use the result of a subquery to prune partitions and that the query used to get the most recent row from the table must itself also include a partition filter. The test as it stands feels a little hacky but it works. Open to any suggestions!
  2. @jtcohen6 do you know what _dbt_max_partition is here? A couple of the models in that directory use it and it doesn't prune partitions but I'd want to understand what it's doing before changing anything.
  3. Was there anything else you think I should write tests for? You mentioned a possible test case to ensure the static insert_overwrite strategy works and I think this model should cover that.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 16, 2021

  1. For the integration tests, I had to navigate the fact that BigQuery can't use the result of a subquery to prune partitions and that the query used to get the most recent row from the table must itself also include a partition filter. The test as it stands feels a little hacky but it works. Open to any suggestions!

Nice work on this! Looks similar to what we've done on other projects/packages for BigQuery.

  1. @jtcohen6 do you know what _dbt_max_partition is here? A couple of the models in that directory use it and it doesn't prune partitions but I'd want to understand what it's doing before changing anything.

In the "dynamic" version of insert_overwrite, _dbt_max_partition is included in the script to calculate the max partition value. So the query to calculate _dbt_max_partition may indeed be the guilty party here, not be pruning partitions, since it's just select max(partition_col). I would think that the resulting value should effectively prune partitions within the model SQL. If it doesn't, well... we already know that this approach to _dbt_max_partition is not ideal: https://github.com/fishtown-analytics/dbt/issues/2278

  1. Was there anything else you think I should write tests for? You mentioned a possible test case to ensure the static insert_overwrite strategy works and I think this model should cover that.

I think you've got it covered!

@jtcohen6
Copy link
Contributor

@prratek I looked a bit more into this, and I've found something pretty wacky: wrapping a datetime-type partition column in date(...) and timestamp_trunc(..., day) both work just fine for partition filtering and elimination, but when I wrap it in datetime_trunc(...), it doesn't work!

-- models/my_model.sql
{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        partition_by={
            "field": "date_time",
            "data_type": "datetime"
        },
        require_partition_filter = True
    )
}}

select 1 as id, cast('2020-01-01' as datetime) as date_time
-- excerpted from insert_overwrite script, which I copy-pasted into BigQuery console to confirm

merge into `my_model` as DBT_INTERNAL_DEST
        using ( ... ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and datetime_trunc(DBT_INTERNAL_DEST.date_time, day) in ('2020-01-01'')
        then delete

    when not matched then insert
        (`id`, `date_time`)
    values
        (`id`, `date_time`)
  Query error: Cannot query over table 'my_model' without a filter over column(s) 'date_time' that can be used for partition elimination at [42:5]
  compiled SQL at target/run/my_project/models/my_model.sql

This sure looks like a BigQuery bug, doesn't it?

In the meantime, we could work around it by adding another filter:

    when not matched by source
         and datetime_trunc(DBT_INTERNAL_DEST.date_time, day) in ('2020-01-01', '2020-01-02')
         and DBT_INTERNAL_DEST.date_time is not null
        then delete

This enables the query to succeed, but it will still process more bytes than it should, since datetime_trunc doesn't work for partition elimination. To repeat, I'm only seeing this issue with datetime_trunc; both date and timestamp column types/functions seem to work just fine with partition filtering.

@jtcohen6
Copy link
Contributor

@prratek Is this something you'd still be interested in contributing? I'd be willing to find a way to work around the failing test re: filtering with datetime_trunc, since that really does feel like an undocumented limitation (or bug) with BigQuery.

@jtcohen6
Copy link
Contributor

Closing in favor of dbt-labs/dbt-bigquery#65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants