Skip to content

Latest commit

 

History

History
124 lines (84 loc) · 7 KB

databricks-workflows.md

File metadata and controls

124 lines (84 loc) · 7 KB

Running a dbt project as a job in Databricks Workflows

Databricks Workflows is a highly-reliable, managed orchestrator that lets you author and schedule DAGs of notebooks, Python scripts as well as dbt projects as production jobs.

In this guide, you will learn how to update an existing dbt project to run as a job, retrieve dbt run artifacts using the Jobs API and debug common issues.

Overview

When you run a dbt project as a Databricks Job, the dbt CLI runs on a single-node Automated Cluster. The SQL generated by dbt runs on a serverless SQL warehouse.

Prerequisites

  • An existing dbt project version controlled in git
  • Access to a Databricks workspace
  • Access to an existing interactive cluster with dbt-core and dbt-databricks libraries installed or CAN_MANAGE permissions to install the dbt-core and dbt-databricks as cluster libraries OR
  • Access to serverless SQL warehouses. See documentation to learn more about this feature and regional availability.
  • Files in Repos must be enabled and is only supported on Databricks Runtime (DBR) 8.4+ or DBR 11+ depending on the configuration. Please make sure the cluster has the appropriate DBR version.
  • Install and configure the Databricks CLI
  • Install jq, a popular open source tool for parsing JSON from the command line

Note: previously dbt tasks on Databricks Workflows could target jobs clusters for compute. That is no longer supported. Job clusters can only be used for running the dbt-cli.

Run dbt as a production job

In this step, you will create a job that will run the dbt project on a schedule.

Connect Databricks to Git

The dbt task only supports retrieve dbt projects from Git. Please follow the documentation to connect Databricks to Git.

Create a job

  1. Log in to your Databricks workspace
  2. Click the Data Science & Engineering persona in the left navigation bar
  3. Click Workflows
  4. Click Create Job
  5. Click Type and choose dbt
  6. Click Edit next to "Git provider"
  7. In the dialog, enter your Git repository URL, and choose the Git provider. Also, choose a branch / tag / commit e.g. main.
  8. If your dbt project is in the root of the git repository, leave the Path field empty. Otherwise, provide the relative path e.g. /my/relative/path.
  9. You can customize dbt commands as needed, including any flag accepted by the dbt CLI.

dbt-task-type

  1. Under SQL warehouse, choose the serverless SQL warehouse where SQL generated by dbt will run. You can optionally choose a custom catalog and schema where tables and views will be created.
  2. By default, Databricks installs a recent version of dbt-databricks from PyPi, which will also install dbt-spark as well as dbt-core. You can customize this version if you wish.
  3. You can customize the Automated Cluster if you wish by clicking Edit in the dbt CLI cluster dropdown.
  4. Click Save

Run the job and view dbt output

You can now run your newly-saved job and see its output.

  1. Click Run Now on the notification that shows up when you save the job
  2. Click the active run and see dbt output. Note that dbt output is not real-time, it lags behind dbt's progress by several seconds to a minute.

Retrieve dbt artifacts using the Jobs API

A dbt run generates useful artifacts which you may want to retrieve for analysis and more. Databricks saves the contents of /logs and /target directories as a compressed archive which you can retrieve using the Jobs API.

It is currently not possible to refer to a previous run's artifacts e.g. using the --state flag. You can, however, include a known good state in your repository.

dbt-artifacts is a popular dbt package for ingesting dbt artifacts into tables. This is currently not supported on Databricks. Please contact us if you are interested in Databricks supporting this package.

Follow these steps to retrieve dbt artifacts from a job run:

  1. Go to a job in Databricks and copy the Task Run ID. It appears in the sidebar under Task run details when you click on a run.
  2. Enter the following command in your terminal:
$ databricks jobs configure --version=2.1
$ databricks runs get --run-id TASK_RUN_ID | jq .tasks
  1. The above command will return an array of tasks with their run_ids. Find the dbt task's run_id and run this command:
$ DBT_ARTIFACT_URL="$(databricks runs get-output --run-id DBT_TASK_RUN_ID | jq -r .dbt_output.artifacts_link)"
$ curl $DBT_ARTIFACT_URL --output artifact.tar.gz

On macOS or Linux, you can run the following command to expand and decompress the archive:

$ tar -xvf artifact.tar.gz

Common issues

Unable to connect to Databricks

  • If you do not use the automatically-generated profiles.yml, check your Personal Access Token (PAT). It must not be expired.
  • Consider adding dbt debug as the first command. This may give you a clue about the failure.

dbt cannot find my dbt_project.yml file

If you have checked out the Git repository before enabling the Files in Repos feature, the checkout might be cached invalidly. You need to push a dummy commit to your repository to force a fresh checkout.

Connecting to different sources (custom profile)

By default the dbt task type will connect to the serverless SQL warehouse specified in the task without any configuration changes or need to check in any secrets. It does so by generating a default profiles.yml and telling dbt to use it. We have no restrictions on connection to any other dbt targets such as Databricks SQL, Amazon Redshift, Google BigQuery, Snowflake, or any other supported adapter.

The automatically generated profile can be overridden by specifying an alternative profiles directory in the dbt command using --profiles-dir <dir>, where the path of the <dir> should be a relative path like . or ./my-directory.

If you'd like to connect to multiple outputs and including the current Interactive Cluster as one of those, the following configuration can be used without exposing any secrets:

Note: You cannot connect to an Automated Cluster.

databricks_demo:
  target: databricks_cluster
  outputs:
    databricks_cluster:
      type: databricks
      connect_retries: 5
      connect_timeout: 180
      schema: "<your-schema>"
      threads: 8 # This can be increased or decreased to control the parallism
      host: "{{ env_var('DBT_HOST') }}"
      http_path: "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
      token: "{{ env_var('DBT_ACCESS_TOKEN') }}"