diff --git a/CHANGELOG.md b/CHANGELOG.md index e8e4bbe8..cb6622b4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,12 @@ +# dbt-utils v0.8.5 + +## New features +- Add Postgres specific implementation of `deduplicate()` ([#549](https://github.com/dbt-labs/dbt-utils/pull/549)) +- Add Snowflake specific implementation of `deduplicate()` ([#543](https://github.com/dbt-labs/dbt-utils/pull/543), [#549](https://github.com/dbt-labs/dbt-utils/pull/549)) + +## Quality of life +- Added better documentation for `deduplicate()` ([#542](https://github.com/dbt-labs/dbt-utils/pull/542), [#549](https://github.com/dbt-labs/dbt-utils/pull/549)) + # dbt-utils v0.8.3 ## New features - A macro for deduplicating data, `deduplicate()` ([#335](https://github.com/dbt-labs/dbt-utils/issues/335), [#512](https://github.com/dbt-labs/dbt-utils/pull/512)) @@ -6,7 +15,7 @@ ## Fixes - `get_column_values()` once more raises an error when the model doesn't exist and there is no default provided ([#531](https://github.com/dbt-labs/dbt-utils/issues/531), [#533](https://github.com/dbt-labs/dbt-utils/pull/533)) -- `get_column_values()` raises an error when used with an ephemeral model, instead of getting stuck in a compilation loop ([#358](https://github.com/dbt-labs/dbt-utils/issues/358), [#518](https://github.com/dbt-labs/dbt-utils/pull/518)) +- `get_column_values()` raises an error when used with an ephemeral model, instead of getting stuck in a compilation loop ([#358](https://github.com/dbt-labs/dbt-utils/issues/358), [#518](https://github.com/dbt-labs/dbt-utils/pull/518)) - BigQuery materialized views work correctly with `get_relations_by_pattern()` ([#525](https://github.com/dbt-labs/dbt-utils/pull/525)) ## Quality of life diff --git a/README.md b/README.md index 6502b2b9..dbdb63bc 100644 --- a/README.md +++ b/README.md @@ -595,7 +595,7 @@ This macro returns the unique values for a column in a given [relation](https:// This macro returns an iterable Jinja list of columns for a given [relation](https://docs.getdbt.com/docs/writing-code-in-dbt/class-reference/#relation), (i.e. not from a CTE) - optionally exclude columns - the input values are not case-sensitive (input uppercase or lowercase and it will work!) -> Note: The native [`adapter.get_columns_in_relation` macro](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation) allows you +> Note: The native [`adapter.get_columns_in_relation` macro](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation) allows you to pull column names in a non-filtered fashion, also bringing along with it other (potentially unwanted) information, such as dtype, char_size, numeric_precision, etc. **Args:** @@ -737,6 +737,12 @@ This macro returns the sql required to build a date spine. The spine will includ #### deduplicate ([source](macros/sql/deduplicate.sql)) This macro returns the sql required to remove duplicate rows from a model or source. +**Args:** + - `relation` (required): a [Relation](https://docs.getdbt.com/reference/dbt-classes#relation) (a `ref` or `source`) or string which identifies the model to deduplicate. + - `group_by` (required): column names (or expressions) to use to identify a set/window of rows out of which to select one as the deduplicated row. + - `order_by` (optional, default=none): column names (or expressions) that determine the priority order of which row should be chosen if there are duplicates (comma-separated string). + - `relation_alias` (optional, default=none): DEPRECATED - string to select from which references a `ref`, `source` or CTE which contains the same columns as `relation`. + **Usage:** ``` @@ -749,6 +755,21 @@ This macro returns the sql required to remove duplicate rows from a model or sou }} ``` +``` +with my_cte as ( + select * + from {{ source('my_source', 'my_table') }} + where user_id = 1 +) + +{{ dbt_utils.deduplicate( + relation='my_cte', + group_by="user_id, cast(timestamp as day)", + order_by="timestamp desc", + ) +}} +``` + #### haversine_distance ([source](macros/sql/haversine_distance.sql)) This macro calculates the [haversine distance](http://daynebatten.com/2015/09/latitude-longitude-distance-sql/) between a pair of x/y coordinates. @@ -791,19 +812,19 @@ group by 1,2,3 ``` #### star ([source](macros/sql/star.sql)) -This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields -listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with -the star macro. -This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`). -The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias +This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields +listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with +the star macro. +This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`). +The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias in the output (`prefix` ~ `field_name` ~ `suffix`). NB: This prevents the output from being used in any context other than a select statement. **Args:** - `from` (required): a [Relation](https://docs.getdbt.com/reference/dbt-classes#relation) (a `ref` or `source`) that contains the list of columns you wish to select from - `except` (optional, default=`[]`): The name of the columns you wish to exclude. (case-insensitive) -- `relation_alias` (optional, default=`''`): will prefix all generated fields with an alias (`relation_alias`.`field_name`). -- `prefix` (optional, default=`''`): will prefix the output `field_name` (`field_name as prefix_field_name`). -- `suffix` (optional, default=`''`): will suffix the output `field_name` (`field_name as field_name_suffix`). +- `relation_alias` (optional, default=`''`): will prefix all generated fields with an alias (`relation_alias`.`field_name`). +- `prefix` (optional, default=`''`): will prefix the output `field_name` (`field_name as prefix_field_name`). +- `suffix` (optional, default=`''`): will suffix the output `field_name` (`field_name as field_name_suffix`). **Usage:** ```sql @@ -1026,7 +1047,7 @@ This macro calculates the difference between two dates. This macro splits a string of text using the supplied delimiter and returns the supplied part number (1-indexed). **Args**: -- `string_text` (required): Text to be split into parts. +- `string_text` (required): Text to be split into parts. - `delimiter_text` (required): Text representing the delimiter to split by. - `part_number` (required): Requested part of the split (1-based). If the value is negative, the parts are counted backward from the end of the string. diff --git a/macros/sql/deduplicate.sql b/macros/sql/deduplicate.sql index 9a3571a2..d6061f56 100644 --- a/macros/sql/deduplicate.sql +++ b/macros/sql/deduplicate.sql @@ -4,9 +4,7 @@ {%- macro default__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%} - select - {{ dbt_utils.star(relation, relation_alias='deduped') | indent }} - from ( + with row_numbered as ( select _inner.*, row_number() over ( @@ -16,8 +14,58 @@ {%- endif %} ) as rn from {{ relation if relation_alias is none else relation_alias }} as _inner - ) as deduped - where deduped.rn = 1 + ) + + select + distinct data.* + from {{ relation }} as data + {# + -- Not all DBs will support natural joins but the ones that do include: + -- Oracle, MySQL, SQLite, Redshift, Teradata, Materialize, Databricks + -- Apache Spark, SingleStore, Vertica + -- Those that do not appear to support natural joins include: + -- SQLServer, Trino, Presto, Rockset, Athena + #} + natural join row_numbered + where row_numbered.rn = 1 + +{%- endmacro -%} + +{# Redshift should use default instead of Postgres #} +{% macro redshift__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%} + + {{ return(dbt_utils.default__deduplicate(relation, group_by, order_by=order_by, relation_alias=relation_alias)) }} + +{% endmacro %} + +{# +-- Postgres has the `DISTINCT ON` syntax: +-- https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT +#} +{%- macro postgres__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%} + + select + distinct on ({{ group_by }}) * + from {{ relation if relation_alias is none else relation_alias }} + order by {{ group_by }}{{ ',' ~ order_by if order_by is not none else '' }} + +{%- endmacro -%} + +{# +-- Snowflake has the `QUALIFY` syntax: +-- https://docs.snowflake.com/en/sql-reference/constructs/qualify.html +#} +{%- macro snowflake__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%} + + select * + from {{ relation if relation_alias is none else relation_alias }} + qualify + row_number() over ( + partition by {{ group_by }} + {% if order_by is not none -%} + order by {{ order_by }} + {%- endif %} + ) = 1 {%- endmacro -%} @@ -29,18 +77,14 @@ {%- macro bigquery__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%} select - {{ dbt_utils.star(relation, relation_alias='deduped') | indent }} - from ( - select - array_agg ( - original - {% if order_by is not none -%} - order by {{ order_by }} - {%- endif %} - limit 1 - )[offset(0)] as deduped - from {{ relation if relation_alias is none else relation_alias }} as original - group by {{ group_by }} - ) + array_agg ( + original + {% if order_by is not none -%} + order by {{ order_by }} + {%- endif %} + limit 1 + )[offset(0)].* + from {{ relation if relation_alias is none else relation_alias }} as original + group by {{ group_by }} {%- endmacro -%}