Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize deduplicate() implementations #549

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
# dbt-utils v0.8.5

## New features
- Add Postgres specific implementation of `deduplicate()` ([#549](https://github.com/dbt-labs/dbt-utils/pull/549))
- Add Snowflake specific implementation of `deduplicate()` ([#543](https://github.com/dbt-labs/dbt-utils/pull/543), [#549](https://github.com/dbt-labs/dbt-utils/pull/549))

## Quality of life
- Added better documentation for `deduplicate()` ([#542](https://github.com/dbt-labs/dbt-utils/pull/542), [#549](https://github.com/dbt-labs/dbt-utils/pull/549))

# dbt-utils v0.8.3
## New features
- A macro for deduplicating data, `deduplicate()` ([#335](https://github.com/dbt-labs/dbt-utils/issues/335), [#512](https://github.com/dbt-labs/dbt-utils/pull/512))
Expand All @@ -6,7 +15,7 @@

## Fixes
- `get_column_values()` once more raises an error when the model doesn't exist and there is no default provided ([#531](https://github.com/dbt-labs/dbt-utils/issues/531), [#533](https://github.com/dbt-labs/dbt-utils/pull/533))
- `get_column_values()` raises an error when used with an ephemeral model, instead of getting stuck in a compilation loop ([#358](https://github.com/dbt-labs/dbt-utils/issues/358), [#518](https://github.com/dbt-labs/dbt-utils/pull/518))
- `get_column_values()` raises an error when used with an ephemeral model, instead of getting stuck in a compilation loop ([#358](https://github.com/dbt-labs/dbt-utils/issues/358), [#518](https://github.com/dbt-labs/dbt-utils/pull/518))
- BigQuery materialized views work correctly with `get_relations_by_pattern()` ([#525](https://github.com/dbt-labs/dbt-utils/pull/525))

## Quality of life
Expand Down
41 changes: 31 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -595,7 +595,7 @@ This macro returns the unique values for a column in a given [relation](https://
This macro returns an iterable Jinja list of columns for a given [relation](https://docs.getdbt.com/docs/writing-code-in-dbt/class-reference/#relation), (i.e. not from a CTE)
- optionally exclude columns
- the input values are not case-sensitive (input uppercase or lowercase and it will work!)
> Note: The native [`adapter.get_columns_in_relation` macro](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation) allows you
> Note: The native [`adapter.get_columns_in_relation` macro](https://docs.getdbt.com/reference/dbt-jinja-functions/adapter#get_columns_in_relation) allows you
to pull column names in a non-filtered fashion, also bringing along with it other (potentially unwanted) information, such as dtype, char_size, numeric_precision, etc.

**Args:**
Expand Down Expand Up @@ -737,6 +737,12 @@ This macro returns the sql required to build a date spine. The spine will includ
#### deduplicate ([source](macros/sql/deduplicate.sql))
This macro returns the sql required to remove duplicate rows from a model or source.

**Args:**
- `relation` (required): a [Relation](https://docs.getdbt.com/reference/dbt-classes#relation) (a `ref` or `source`) or string which identifies the model to deduplicate.
- `group_by` (required): column names (or expressions) to use to identify a set/window of rows out of which to select one as the deduplicated row.
- `order_by` (optional, default=none): column names (or expressions) that determine the priority order of which row should be chosen if there are duplicates (comma-separated string).
- `relation_alias` (optional, default=none): DEPRECATED - string to select from which references a `ref`, `source` or CTE which contains the same columns as `relation`.

**Usage:**

```
Expand All @@ -749,6 +755,21 @@ This macro returns the sql required to remove duplicate rows from a model or sou
}}
```

```
with my_cte as (
select *
from {{ source('my_source', 'my_table') }}
where user_id = 1
)

{{ dbt_utils.deduplicate(
relation='my_cte',
group_by="user_id, cast(timestamp as day)",
order_by="timestamp desc",
)
}}
```

#### haversine_distance ([source](macros/sql/haversine_distance.sql))
This macro calculates the [haversine distance](http://daynebatten.com/2015/09/latitude-longitude-distance-sql/) between a pair of x/y coordinates.

Expand Down Expand Up @@ -791,19 +812,19 @@ group by 1,2,3
```

#### star ([source](macros/sql/star.sql))
This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields
listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with
the star macro.
This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).
The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias
This macro generates a comma-separated list of all fields that exist in the `from` relation, excluding any fields
listed in the `except` argument. The construction is identical to `select * from {{ref('my_model')}}`, replacing star (`*`) with
the star macro.
This macro also has an optional `relation_alias` argument that will prefix all generated fields with an alias (`relation_alias`.`field_name`).
The macro also has optional `prefix` and `suffix` arguments. When one or both are provided, they will be concatenated onto each field's alias
in the output (`prefix` ~ `field_name` ~ `suffix`). NB: This prevents the output from being used in any context other than a select statement.

**Args:**
- `from` (required): a [Relation](https://docs.getdbt.com/reference/dbt-classes#relation) (a `ref` or `source`) that contains the list of columns you wish to select from
- `except` (optional, default=`[]`): The name of the columns you wish to exclude. (case-insensitive)
- `relation_alias` (optional, default=`''`): will prefix all generated fields with an alias (`relation_alias`.`field_name`).
- `prefix` (optional, default=`''`): will prefix the output `field_name` (`field_name as prefix_field_name`).
- `suffix` (optional, default=`''`): will suffix the output `field_name` (`field_name as field_name_suffix`).
- `relation_alias` (optional, default=`''`): will prefix all generated fields with an alias (`relation_alias`.`field_name`).
- `prefix` (optional, default=`''`): will prefix the output `field_name` (`field_name as prefix_field_name`).
- `suffix` (optional, default=`''`): will suffix the output `field_name` (`field_name as field_name_suffix`).

**Usage:**
```sql
Expand Down Expand Up @@ -1026,7 +1047,7 @@ This macro calculates the difference between two dates.
This macro splits a string of text using the supplied delimiter and returns the supplied part number (1-indexed).

**Args**:
- `string_text` (required): Text to be split into parts.
- `string_text` (required): Text to be split into parts.
- `delimiter_text` (required): Text representing the delimiter to split by.
- `part_number` (required): Requested part of the split (1-based). If the value is negative, the parts are counted backward from the end of the string.

Expand Down
80 changes: 62 additions & 18 deletions macros/sql/deduplicate.sql
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,7 @@

{%- macro default__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}

select
{{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
from (
with row_numbered as (
select
_inner.*,
row_number() over (
Expand All @@ -16,8 +14,58 @@
{%- endif %}
) as rn
from {{ relation if relation_alias is none else relation_alias }} as _inner
) as deduped
where deduped.rn = 1
)

select
distinct data.*
from {{ relation }} as data
{#
-- Not all DBs will support natural joins but the ones that do include:
-- Oracle, MySQL, SQLite, Redshift, Teradata, Materialize, Databricks
-- Apache Spark, SingleStore, Vertica
-- Those that do not appear to support natural joins include:
-- SQLServer, Trino, Presto, Rockset, Athena
#}
natural join row_numbered
where row_numbered.rn = 1

{%- endmacro -%}

{# Redshift should use default instead of Postgres #}
{% macro redshift__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}

{{ return(dbt_utils.default__deduplicate(relation, group_by, order_by=order_by, relation_alias=relation_alias)) }}

{% endmacro %}

{#
-- Postgres has the `DISTINCT ON` syntax:
-- https://www.postgresql.org/docs/current/sql-select.html#SQL-DISTINCT
#}
{%- macro postgres__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}

select
distinct on ({{ group_by }}) *
from {{ relation if relation_alias is none else relation_alias }}
order by {{ group_by }}{{ ',' ~ order_by if order_by is not none else '' }}

{%- endmacro -%}

{#
-- Snowflake has the `QUALIFY` syntax:
-- https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
#}
{%- macro snowflake__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}

select *
from {{ relation if relation_alias is none else relation_alias }}
qualify
row_number() over (
partition by {{ group_by }}
{% if order_by is not none -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the desire to not make this update a breaking change by requiring a parameter that was previously optional, but if order_by is not specified here, the query will fail on Snowflake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#548 deals with that half

order by {{ order_by }}
{%- endif %}
) = 1

{%- endmacro -%}

Expand All @@ -29,18 +77,14 @@
{%- macro bigquery__deduplicate(relation, group_by, order_by=none, relation_alias=none) -%}

select
{{ dbt_utils.star(relation, relation_alias='deduped') | indent }}
from (
select
array_agg (
original
{% if order_by is not none -%}
order by {{ order_by }}
{%- endif %}
limit 1
)[offset(0)] as deduped
from {{ relation if relation_alias is none else relation_alias }} as original
group by {{ group_by }}
)
array_agg (
original
{% if order_by is not none -%}
order by {{ order_by }}
{%- endif %}
limit 1
)[offset(0)].*
from {{ relation if relation_alias is none else relation_alias }} as original
group by {{ group_by }}

{%- endmacro -%}