Skip to content

nsharma-25/dbt-tpch-example

Repository files navigation

This is a dbt sample project for Hive using the TPC-H example dataset and queries. More details about TPC-H can be found on TPC Website and in this specification document.

In this project we are using 1GB TPC-H dataset that is present in the Hive warehouse.

Getting Started

It is recommended to use venv to create a Python virtual environment for the demo.

Requirements

  • 3.8 <= Python <= 3.11
  • dbt-core
  • git

Install

Start by cloning this repo

git clone https://github.com/nsharma-25/dbt-tpch-example.git

Next install dbt-core and dbt-hive

pip install -r requirements.txt

Install the required packages mentioned in package.yml file:

dbt deps

Configure

Create a dbt profile in ~/.dbt/profiles.yml

The sample profile looks like this:

dbt_tpch_test:
  outputs:
    dev_hive:
      type: hive
      use_http_transport: true
      use_ssl: true
      auth_type: ldap 
      schema: <schema_name>
      threads: 4
      user: <user>
      password: <password>
      host: <hive host name>
      port: 443
      http_path: cliservice
  target: dev_hive

Test the profile with dbt debug

Using dbt

dbt seed

To load the data present in the seeds folder, run

dbt seed

22 files are present in the Seeds folder, where each file conatins the expected output of the tpch queries.

Everytime to you make a change in seed data, you must run dbt seed again to update it. You can also run the dbt seed command with the --select flag like

dbt seed --select <seed-name>

dbt run

There are 3 sets of models in this project. Each set of models contain their own schema.yml file

  • Firstly, we have raw. Our raw models make use of Sources. This is data that already exists in our database that dbt needs to refer to.

    Our raw models are defined in models/raw/.

  • Next, we have staging. These are Models. Our staging models use the source() method to refer to the Sources we defined in our raw models. The staging models are intermediate models created over our raw data to rename the columns. The staging models have been materialized as tables.

    Our 7 staging models are defined in models/staging/

  • Lastly, we have tpch_queries. These are Models that use the ref() method to refer to the staging models. These are the models based on the standard tpch_queries and have been materialized as View.

    Our 22 tpch_queries models are present in the models/tpch_queries.

Run the models with

dbt run  

Running a specifc model

dbt run --select <model-name>

dbt test

Our project has been configured with a couple of tests. As of this project, we are using a package called dbt-utils. Macros for the custom tests can also be added to the macros/ folder. More info about tests can be found here.

Few tests that has been included in this project are checking the column values to be unique, not null and also comparing the equality of models in this case the equality of actual output tables and the expected output tables (which have been included in the seeds/expected folder).

These tests can be mentioned in the schema of the models. Specific test can be run with --select flag, like

dbt test --select <test-name>

Example:

tpch_queries_schema.yml

models:
- name: tpch_q01
  tests:
    - dbt_utils.equality:
        compare_model: ref('q01_output')

stg_tpch_schema.yml

models:
  - name: stg_tpch_customer
    columns:
      - name: customer_key
        tests:
          - unique
          - not_null

Run the models again

To get the latest data into our mart model, we must run the models again.

Run the models again with

dbt run

Resources:

  • Learn more about dbt in the docs
  • Check out Discourse for commonly asked questions and answers
  • Join the chat on Slack for live discussions and support
  • Find dbt events near you
  • Check out the blog for the latest news on dbt's development and best practices

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published