-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Override the default merge behaviour for incremental table materialisations in a model. #2737
Comments
@AndrewCathcart Thanks for the detailed explanation! To accomplish this today, you've got a couple of options:
IMO that third option requires more duplicative code than it ideally would. There's an open issue (#2366) for enabling users to define their own incremental strategies (i.e. modifications of I'm going to close this issue because you've done a great job of solving it already, and because any dbt code changes we'd make to better accommodate this use case are already addressed in an open issue. |
Thanks very much for your response :-) Apologies for opening a duplicate, I couldn't find anything on the topic when doing a search. |
Thank you for opening it, and including a great writeup! If someone has a similar question in the future, they'll be able to find this now :) |
I thought i'd come back to this and add a problem that we had to solve, just in case anyone copies the above approach. We have a snowpipe loading data into raw tables. This has a date_loaded column set by a CURRENT_TIMESTAMP() function call. In the dbt code above, I'm also using CURRENT_TIMESTAMP to set a "date_aggregated" column's value. To make this an incremental model, we have a where clause like so; WHERE date_loaded >= (SELECT MAX(date_aggregated) FROM {{ this }}) This is where the problem arises. What looks relatively harmless can cause you great concurrency pain down the line. From the snowflake docs for CURRENT_TIMESTAMP; "Do not use the returned value for precise time ordering between concurrent queries (processed by the same virtual warehouse) because the queries might be serviced by different servers (in the warehouse)." We ran into a problem where we were missing some events. Thankfully I wrote a nice data test that checked the raw/aggregate counts matched and caught this before we went to prod. To solve this we removed the "date_aggregated" column, and created a "date_loaded_checkpoint" column which contained the MAX(date_loaded) for the new data we were merging into the table. In this way we don't lose any events to concurrency issues by trying to use two separate CURRENT_TIMESTAMP calls (Snowpipe CURRENT_TIMESTAMP which is date_loaded, and dbt merge CURRENT_TIMESTAMP which is date_aggregated). {% if is_incremental() %}
WITH new_event_aggregates AS (
SELECT
unique_key,
a,
b,
date_loaded,
COUNT(*) AS item_count,
FROM
{{ ref('events') }}
WHERE
date_loaded > ( SELECT MAX(date_loaded_checkpoint) FROM {{ this }} )
GROUP BY 1, 2, 3, 4
),
current_aggregates_to_update AS (
SELECT
{{ this }}.*
FROM
{{ this }}
INNER JOIN new_event_aggregates USING(unique_key)
),
event_aggregates AS (
SELECT
new_data.unique_key AS unique_key,
new_data.a AS a,
new_data.b AS b,
new_data.item_count + IFNULL(current_data.item_count, 0) AS item_count,
new_data.date_loaded AS date_loaded
FROM
new_event_aggregates AS new_data
LEFT JOIN current_aggregates_to_update AS current_data USING(unique_key)
)
{% else %}
WITH event_aggregates AS (
SELECT
unique_key,
a,
b,
date_loaded,
COUNT(*) AS item_count,
FROM
{{ ref('events') }}
GROUP BY 1, 2, 3, 4
)
{% endif %}
SELECT
unique_key,
a,
b,
SUM(item_count) as item_count,
(SELECT MAX(date_loaded) FROM event_aggregates) AS date_loaded_checkpoint
FROM
event_aggregates
GROUP BY 1, 2, 3 TLDR; Don't try to compare two separate CURRENT_TIMESTAMP calls. Use CURRENT_TIMESTAMP once, when loading data in, and then find the MAX of this in the dbt CTEs, setting this as a checkpoint for incremental tables. |
Right on, @AndrewCathcart. In general, we advise against using relative date/time values (such as (Using |
Hi everyone, To be more precise, I needed to check that the version I keep is the most recent one so I added a macro : Then I add it to the get_merge_sql macro :
FInally, I can put the following config on any model to ensure the new statement is used for the merge strategy : Is there any reason for this not to be a good idea ? I was wondering if there is anything against addind custom config ? |
@Jeremynadal33 This sounds a lot like the request to add custom predicates (#3293 + #4546), with exactly this idea: limit the table scan of the preexisting table when searching for matches. The mechanism proposed there would add the condition to the matching predicates ( |
Describe the feature
I would like to be able to use a custom merge statement, instead of the default, when using incremental table materialisations. I have late arriving events, so when aggregating these I need to be able to ADD them to that existing row, instead of updating the row with the new aggregated events.
As an example;
Current -
DAY - ANIMAL - COUNT
1 ---- Sheep ---- 10
New -
DAY - ANIMAL - COUNT
1 ---- Sheep ---- 5
Expected (perform an old.count + new.count) -
DAY - ANIMAL - COUNT
1 ---- Sheep ---- 15
Actual -
DAY - ANIMAL - COUNT
1 ---- Sheep ---- 5
Describe alternatives you've considered
I'm using the following approach currently, but I feel this muddy's the model and will make it difficult for others to contribute. Redacted / simplified the code but hopefully it illustrates the general approach. For context we have late arriving events that need to be dealt with.
EDIT: DO NOT USE - CORRECT EXAMPLE IN THE COMMENTS
Additional context
We're currently using Snowflake. Previously I was using Snowflake Tasks & Streams, and doing addition in the merge statement to update the aggregates based on new data arriving on an append-only stream. We're trying to emulate the same thing here using DBT.
Who will this benefit?
I imagine it will benefit anyone who needs to be able to handle a more complex merge command, rather than the default.
Are you interested in contributing this feature?
Not sure I'd be best placed.
The text was updated successfully, but these errors were encountered: