Incremental insert overwrite requires identical column ordering #59

jtcohen6 · 2020-03-09T01:31:48Z

Column order matters

In Spark, tables store their partition columns last. In the scenario featured in our integration test, given a seed file seed

id,first_name,last_name,email,gender,ip_address
1,Jack,Hunter,jhunter0@pbs.org,Male,59.80.20.168
2,Kathryn,Walker,kwalker1@ezinearticles.com,Female,194.121.179.35
3,Gerald,Ryan,gryan2@com.com,Male,11.3.212.243
4,Bonnie,Spencer,bspencer3@ameblo.jp,Female,216.32.196.175
5,Harold,Taylor,htaylor4@people.com.cn,Male,253.10.246.136

And an incremental model

{{
          config(
              materialized='incremental,
              partition_by='id',
              file_format='parquet'
          )
      }}
      select * from {{ ref('seed') }}

The resulting table will look like

first_name	last_name	email	gender	ip_address	id
Jack	Hunter	jhunter0@pbs.org	Male	59.80.20.168	1
Kathryn	Walker	kwalker1@ezinearticles.com	Female	194.121.179.35	2
Gerald	Ryan	gryan2@com.com	Male	11.3.212.243	3
Bonnie	Spencer	bspencer3@ameblo.jp	Female	216.32.196.175	4
Harold	Taylor	htaylor4@people.com.cn	Male	253.10.246.136	5

In subsequent incremental runs, dbt would attempt to run two queries

create temporary view incremental_relation__dbt_tmp as
    
      select * from dbt_jcohen.seed;

insert overwrite table dbt_jcohen.incremental_relation
       partition (id)
       select * from incremental_relation__dbt_tmp

Since the columns in seed are in different order from the columns in incremental_relation (partitioned on id), the result would be

first_name	last_name	email	gender	ip_address	id
Kathryn	Walker	kwalker1@ezinearticles.com	Female	194.121.179.35	2
Harold	Taylor	htaylor4@people.com.cn	Male	253.10.246.136	5
Bonnie	Spencer	bspencer3@ameblo.jp	Female	216.32.196.175	4
Jack	Hunter	jhunter0@pbs.org	Male	59.80.20.168	1
Gerald	Ryan	gryan2@com.com	Male	11.3.212.243	3
3	Gerald	Ryan	gryan2@com.com	Male
4	Bonnie	Spencer	bspencer3@ameblo.jp	Female
5	Harold	Taylor	htaylor4@people.com.cn	Male
1	Jack	Hunter	jhunter0@pbs.org	Male
2	Kathryn	Walker	kwalker1@ezinearticles.com	Female

Why hasn't the integration test been failing?

The equality test between seed and incremental_relation has been passing because we didn't have the right quoting character defined for Spark. " is the default quoting character in dbt-core; in Spark, " encloses a string literal, not a column name.

Therefore, a query like

-- setup



with a as (

    select * from dbt_jcohen.incremental_relation

),

b as (

    select * from dbt_jcohen.seed

),

a_minus_b as (

    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from a
    
  

    except



    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from b

),

b_minus_a as (

    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from b
    
  

    except



    select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from a

),

unioned as (

    select * from a_minus_b
    union all
    select * from b_minus_a

),

final as (

    select (select count(*) from unioned) +
        (select abs(
            (select count(*) from a_minus_b) -
            (select count(*) from b_minus_a)
            ))
        as count

)

select count from final

Looks okay prima facie. There's some metadata/comment column names included, which is weirdly not erroring. I thought to run just the snippet

select "first_name", "last_name", "email", "gender", "ip_address", "id", "# Partition Information", "# col_name", "id" from a

Which returns

first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id
first_name	last_name	email	gender	ip_address	id	# Partition Information	# col_name	id

Yeah.

Solutions

Grab the dest table columns as an ordered comma-separated list, just like how dbt-core does it in the default (Redshift/Postgres) implementation
Override the Spark quoting character to be ` instead of " (handled in Pull the owner from the DESCRIBE EXTENDED #39)

The text was updated successfully, but these errors were encountered:

Fokko · 2020-03-16T19:21:48Z

Wow, great work! This is a tricky one.

jtcohen6 mentioned this issue Mar 9, 2020

Fix: column order for incremental insert overwrite #60

Merged

jtcohen6 closed this as completed in #60 Mar 16, 2020

jtcohen6 mentioned this issue Nov 17, 2020

Unable to evolve the schema #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental insert overwrite requires identical column ordering #59

Incremental insert overwrite requires identical column ordering #59

jtcohen6 commented Mar 9, 2020

Fokko commented Mar 16, 2020

Incremental insert overwrite requires identical column ordering #59

Incremental insert overwrite requires identical column ordering #59

Comments

jtcohen6 commented Mar 9, 2020

Column order matters

Why hasn't the integration test been failing?

Solutions

Fokko commented Mar 16, 2020