dbt-labs · judahrand · Apr 14, 2022 · Apr 14, 2022 · Apr 14, 2022 · Apr 14, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,12 @@
+# dbt-utils v0.8.5
+
+## New features
+- Add Postgres specific implementation of `deduplicate()` ([#549](https://github.com/dbt-labs/dbt-utils/pull/549))
+- Add Snowflake specific implementation of `deduplicate()` ([#543](https://github.com/dbt-labs/dbt-utils/pull/543), [#549](https://github.com/dbt-labs/dbt-utils/pull/549))
+
+## Quality of life
+- Added better documentation for `deduplicate()` ([#542](https://github.com/dbt-labs/dbt-utils/pull/542), [#549](https://github.com/dbt-labs/dbt-utils/pull/549))
+
 # dbt-utils v0.8.3
 ## New features
 - A macro for deduplicating data, `deduplicate()` ([#335](https://github.com/dbt-labs/dbt-utils/issues/335), [#512](https://github.com/dbt-labs/dbt-utils/pull/512))
@@ -6,7 +15,7 @@
 
 ## Fixes
 - `get_column_values()` once more raises an error when the model doesn't exist and there is no default provided ([#531](https://github.com/dbt-labs/dbt-utils/issues/531), [#533](https://github.com/dbt-labs/dbt-utils/pull/533))
-- `get_column_values()` raises an error when used with an ephemeral model, instead of getting stuck in a compilation loop ([#358](https://github.com/dbt-labs/dbt-utils/issues/358), [#518](https://github.com/dbt-labs/dbt-utils/pull/518)) 
+- `get_column_values()` raises an error when used with an ephemeral model, instead of getting stuck in a compilation loop ([#358](https://github.com/dbt-labs/dbt-utils/issues/358), [#518](https://github.com/dbt-labs/dbt-utils/pull/518))
 - BigQuery materialized views work correctly with `get_relations_by_pattern()` ([#525](https://github.com/dbt-labs/dbt-utils/pull/525))
 
 ## Quality of life

diff --git a/README.md b/README.md
@@ -595,7 +595,7 @@ This macro returns the unique values for a column in a given [relation](https://
 This macro returns an iterable Jinja list of columns for a given [relation](https://docs.getdbt.com/docs/writing-code-in-dbt/class-reference/#relation), (i.e. not from a CTE)
 - optionally exclude columns
 - the input values are not case-sensitive (input uppercase or lowercase and it will work!)
-> Note: The native [`adapter.get_columns_in_relation` macro](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation) allows you 
+> Note: The native [`adapter.get_columns_in_relation` macro](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation) allows you
 to pull column names in a non-filtered fashion, also bringing along with it other (potentially unwanted) information, such as dtype, char_size, numeric_precision, etc.
 
 **Args:**
@@ -737,6 +737,12 @@ This macro returns the sql required to build a date spine. The spine will includ
 #### deduplicate ([source](macros/sql/deduplicate.sql))
 This macro returns the sql required to remove duplicate rows from a model or source.
 
+**Args:**
+ - `relation` (required): a [Relation](https://docs.getdbt.com/reference/dbt-classes#relation) (a `ref` or `source`) or string which identifies the model to deduplicate.
+ - `group_by` (required): column names (or expressions) to use to identify a set/window of rows out of which to select one as the deduplicated row.
+ - `order_by` (optional, default=none): column names (or expressions) that determine the priority order of which row should be chosen if there are duplicates (comma-separated string).
+ - `relation_alias` (optional, default=none): DEPRECATED - string to select from which references a `ref`, `source` or CTE which contains the same columns as `relation`.
+
 **Usage:**
 
 ```
@@ -749,6 +755,21 @@ This macro returns the sql required to remove duplicate rows from a model or sou
 }}
 ```
 
+```
+with my_cte as (
+    select *
+    from {{ source('my_source', 'my_table') }}
+    where user_id = 1
+)
+
+{{ dbt_utils.deduplicate(
+    relation='my_cte',
+    group_by="user_id, cast(timestamp as day)",
+    order_by="timestamp desc",
+   )
+}}
+```
+
 #### haversine_distance ([source](macros/sql/haversine_distance.sql))
 This macro calculates the [haversine distance](http://daynebatten.com/2015/09/latitude-longitude-distance-sql/) between a pair of x/y coordinates.
 
@@ -791,19 +812,19 @@ group by 1,2,3
 ```
 
 #### star ([source](macros/sql/star.sql))
-This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields 
-listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with 
-the star macro. 
-This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`). 
-The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias 
+This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields
+listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with
+the star macro.
+This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).
+The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias
 in the output (`prefix` ~ `field_name` ~ `suffix`). NB: This prevents the output from being used in any context other than a select statement.
 
 **Args:**
 - `from` (required): a [Relation](https://docs.getdbt.com/reference/dbt-classes#relation) (a `ref` or `source`) that contains the list of columns you wish to select from
 - `except` (optional, default=`[]`): The name of the columns you wish to exclude. (case-insensitive)
-- `relation_alias` (optional, default=`''`): will prefix all generated fields with an alias (`relation_alias`.`field_name`). 
-- `prefix` (optional, default=`''`): will prefix the output `field_name` (`field_name as prefix_field_name`). 
-- `suffix` (optional, default=`''`): will suffix the output `field_name` (`field_name as field_name_suffix`). 
+- `relation_alias` (optional, default=`''`): will prefix all generated fields with an alias (`relation_alias`.`field_name`).
+- `prefix` (optional, default=`''`): will prefix the output `field_name` (`field_name as prefix_field_name`).
+- `suffix` (optional, default=`''`): will suffix the output `field_name` (`field_name as field_name_suffix`).
 
 **Usage:**
 ```sql
@@ -1026,7 +1047,7 @@ This macro calculates the difference between two dates.
 This macro splits a string of text using the supplied delimiter and returns the supplied part number (1-indexed).
 
 **Args**:
-- `string_text` (required): Text to be split into parts. 
+- `string_text` (required): Text to be split into parts.
 - `delimiter_text` (required): Text representing the delimiter to split by.
 - `part_number` (required): Requested part of the split (1-based). If the value is negative, the parts are counted backward from the end of the string.
 

diff --git a/macros/sql/deduplicate.sql b/macros/sql/deduplicate.sql
@@ -4,9 +4,7 @@
 
 {%- macro default__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}
 
-    select
-        {{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
-    from (
+    with row_numbered as (
         select
             _inner.*,
             row_number() over (
@@ -16,8 +14,58 @@
                 {%- endif %}
             ) as rn
         from {{ relation if relation_alias is none else relation_alias }} as _inner
-    ) as deduped
-    where deduped.rn = 1
+    )
+
+    select
+        distinct data.*
+    from {{ relation }} as data
+    {#
+    -- Not all DBs will support natural joins but the ones that do include:
+    -- Oracle, MySQL, SQLite, Redshift, Teradata, Materialize, Databricks
+    -- Apache Spark, SingleStore, Vertica
+    -- Those that do not appear to support natural joins include:
+    -- SQLServer, Trino, Presto, Rockset, Athena
+    #}
+    natural join row_numbered
+    where row_numbered.rn = 1
+
+{%- endmacro -%}
+
+{# Redshift should use default instead of Postgres #}
+{% macro redshift__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}
+
+    {{ return(dbt_utils.default__deduplicate(relation, group_by, order_by=order_by, relation_alias=relation_alias)) }}
+
+{% endmacro %}
+
+{#
+-- Postgres has the `DISTINCT ON` syntax:
+-- https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT
+#}
+{%- macro postgres__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}
+
+    select
+        distinct on ({{ group_by }}) *
+    from {{ relation if relation_alias is none else relation_alias }}
+    order by {{ group_by }}{{ ',' ~ order_by if order_by is not none else '' }}
+
+{%- endmacro -%}
+
+{#
+-- Snowflake has the `QUALIFY` syntax:
+-- https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
+#}
+{%- macro snowflake__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}
+
+    select *
+    from {{ relation if relation_alias is none else relation_alias }}
+    qualify
+        row_number() over (
+            partition by {{ group_by }}
+            {% if order_by is not none -%}
+            order by {{ order_by }}
+            {%- endif %}
+        ) = 1
 
 {%- endmacro -%}
 
@@ -29,18 +77,14 @@
 {%- macro bigquery__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}
 
     select
-        {{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
-    from (
-        select
-            array_agg (
-                original
-                {% if order_by is not none -%}
-                order by {{ order_by }}
-                {%- endif %}
-                limit 1
-            )[offset(0)] as deduped
-        from {{ relation if relation_alias is none else relation_alias }} as original
-        group by {{ group_by }}
-    )
+        array_agg (
+            original
+            {% if order_by is not none -%}
+            order by {{ order_by }}
+            {%- endif %}
+            limit 1
+        )[offset(0)].*
+    from {{ relation if relation_alias is none else relation_alias }} as original
+    group by {{ group_by }}
 
 {%- endmacro -%}