Databricks Workflows is a highly-reliable, managed orchestrator that lets you author and schedule DAGs of notebooks, Python scripts as well as dbt projects as production jobs.
In this guide, you will learn how to update an existing dbt project to run as a job, retrieve dbt run artifacts using the Jobs API and debug common issues.
When you run a dbt project as a Databricks Job, the dbt CLI runs on a single-node Automated Cluster. The SQL generated by dbt runs on a serverless SQL warehouse.
- An existing dbt project version controlled in git
- Access to a Databricks workspace
- Access to an existing interactive cluster with
dbt-core
anddbt-databricks
libraries installed orCAN_MANAGE
permissions to install thedbt-core
anddbt-databricks
as cluster libraries OR - Access to serverless SQL warehouses. See documentation to learn more about this feature and regional availability.
- Files in Repos must be enabled and is only supported on Databricks Runtime (DBR) 8.4+ or DBR 11+ depending on the configuration. Please make sure the cluster has the appropriate DBR version.
- Install and configure the Databricks CLI
- Install jq, a popular open source tool for parsing JSON from the command line
Note: previously dbt tasks on Databricks Workflows could target jobs clusters for compute. That is no longer supported. Job clusters can only be used for running the dbt-cli.
In this step, you will create a job that will run the dbt project on a schedule.
The dbt task only supports retrieve dbt projects from Git. Please follow the documentation to connect Databricks to Git.
- Log in to your Databricks workspace
- Click the Data Science & Engineering persona in the left navigation bar
- Click Workflows
- Click Create Job
- Click Type and choose dbt
- Click Edit next to "Git provider"
- In the dialog, enter your Git repository URL, and choose the Git provider. Also, choose a branch / tag / commit e.g.
main
. - If your dbt project is in the root of the git repository, leave the Path field empty. Otherwise, provide the relative path e.g.
/my/relative/path
. - You can customize dbt commands as needed, including any flag accepted by the dbt CLI.
- Under SQL warehouse, choose the serverless SQL warehouse where SQL generated by dbt will run. You can optionally choose a custom catalog and schema where tables and views will be created.
- By default, Databricks installs a recent version of
dbt-databricks
from PyPi, which will also installdbt-spark
as well asdbt-core
. You can customize this version if you wish. - You can customize the Automated Cluster if you wish by clicking Edit in the dbt CLI cluster dropdown.
- Click Save
You can now run your newly-saved job and see its output.
- Click Run Now on the notification that shows up when you save the job
- Click the active run and see dbt output. Note that dbt output is not real-time, it lags behind dbt's progress by several seconds to a minute.
A dbt run generates useful artifacts which you may want to retrieve for analysis and more. Databricks saves the contents of /logs
and /target
directories as a compressed archive which you can retrieve using the Jobs API.
It is currently not possible to refer to a previous run's artifacts e.g. using the
--state
flag. You can, however, include a known good state in your repository.
dbt-artifacts is a popular dbt package for ingesting dbt artifacts into tables. This is currently not supported on Databricks. Please contact us if you are interested in Databricks supporting this package.
Follow these steps to retrieve dbt artifacts from a job run:
- Go to a job in Databricks and copy the Task Run ID. It appears in the sidebar under Task run details when you click on a run.
- Enter the following command in your terminal:
$ databricks jobs configure --version=2.1
$ databricks runs get --run-id TASK_RUN_ID | jq .tasks
- The above command will return an array of tasks with their
run_id
s. Find the dbt task'srun_id
and run this command:
$ DBT_ARTIFACT_URL="$(databricks runs get-output --run-id DBT_TASK_RUN_ID | jq -r .dbt_output.artifacts_link)"
$ curl $DBT_ARTIFACT_URL --output artifact.tar.gz
On macOS or Linux, you can run the following command to expand and decompress the archive:
$ tar -xvf artifact.tar.gz
- If you do not use the automatically-generated
profiles.yml
, check your Personal Access Token (PAT). It must not be expired. - Consider adding
dbt debug
as the first command. This may give you a clue about the failure.
If you have checked out the Git repository before enabling the Files in Repos feature, the checkout might be cached invalidly. You need to push a dummy commit to your repository to force a fresh checkout.
By default the dbt task type will connect to the serverless SQL warehouse specified in the task without any configuration changes or need to check in any secrets. It does so by generating a default profiles.yml
and telling dbt to use it. We have no restrictions on connection to any other dbt targets such as Databricks SQL, Amazon Redshift, Google BigQuery, Snowflake, or any other supported adapter.
The automatically generated profile can be overridden by specifying an alternative profiles directory in the dbt command using --profiles-dir <dir>
, where the path of the <dir>
should be a relative path like .
or ./my-directory
.
If you'd like to connect to multiple outputs and including the current Interactive Cluster as one of those, the following configuration can be used without exposing any secrets:
Note: You cannot connect to an Automated Cluster.
databricks_demo:
target: databricks_cluster
outputs:
databricks_cluster:
type: databricks
connect_retries: 5
connect_timeout: 180
schema: "<your-schema>"
threads: 8 # This can be increased or decreased to control the parallism
host: "{{ env_var('DBT_HOST') }}"
http_path: "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
token: "{{ env_var('DBT_ACCESS_TOKEN') }}"