When installed as a working connector, your data sources, Azure Databricks, and Microsoft Purview are assumed to be setup and running.
Installing the base connector requires that you have already configured Databricks CLI with the Azure Databricks platform.
- Clone the repository into Azure cloud shell
- Run the installation script
- Post Installation
- Download and configure OpenLineage Spark agent with your Azure Databricks clusters
- Install OpenLineage on Your Databricks Cluster
- Support Extracting Lineage from Databricks Jobs
- Optional Configure Global Init Script
From the Azure Portal
-
At the top of the page, click the Cloud Shell icon
Click "Confirm" if the "Switch to PowerShell in Cloud Shell" pop up appears.
-
Change directory and clone this repository into the
clouddrive
directory using the latest release tag (i.e.2.x.x
). If this directory is not available please follow these steps to mount a new clouddrivecd clouddrive git clone -b <release_tag> https://github.com/microsoft/Purview-ADB-Lineage-Solution-Accelerator.git
Note:
We highly recommend cloning from the release tags listed here.Clone the main branch only when using nightly builds. By using a nightly build (i.e. the latest commit on main), you gain access to newer / experimental features, however those features may change before the next official release. If you are testing a deployment for production, please clone using release tags.
-
Set the Azure subscription you want to use:
az account set --subscription <SubscriptionID>
-
If needed, create a new working Resource Group:
az group create --name <ResourceGroupName> --location <ResourceGroupLocation>
-
Change into the deployment directory (you should already be in the ~/clouddrive directory):
cd Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/
Note:
If your organization requires private endpoints for Azure Storage and Azure Event Hubs, you may need to follow the private endpoint guidance and modify the provided arm template. -
Deploy solution resources:
az deployment group create \ --resource-group <ResourceGroupName> \ --template-file "./newdeploymenttemp.json" \ --parameters purviewName=<ExistingPurviewServiceName>
- A prompt will ask you to provide the following:
- Prefix (this is added to service names)
- Client ID & Secret (from the App ID required as a prerequisite)
- Resource Tags (optional, in the following format:
{"Name":"Value","Name2":"Value2"}
, otherwise leave blank) - This deployment will take approximately 5 minutes.
- A prompt will ask you to provide the following:
Note:
At this point, you should confirm resources deployed successfully. In particular, check the Azure Function and inside its Functions tab, you should see an OpenLineageIn and PurviewOut function. If you have an error likeMicrosoft.Azure.WebJobs.Extensions.FunctionMetadataLoader: The file 'C:\home\site\wwwroot\worker.config.json' was not found.
please restart or start and stop the function to resolve the issue. Lastly check the Azure Function Configuration tab and check if all the Key Vault Referenced app settings have a green checkmark. If not, wait an additional 2-5 minutes and refresh the screen. If Key Vault references are not all green, check that the Key Vault has an access policy referencing the Azure Function.
-
If needed, change into the deployment directory:
cd ~clouddrive/Purview-ADB-Lineage-Solution-Accelerator/deployment/infra/
-
(Manual Configuration) After the installation finishes, you will need to add the service principal to the data curator role in your Purview resource. Follow this documentation to Set up Authentication using Service Principal using the Application Identity you created as a prerequisite to installation.
-
Install necessary types into your Purview instance by running the following commands in Bash.
You will need:
- your Tenant ID
- your Client ID - used when you ran the installation script above
- your Client Secret - used when you ran the installation script above
purview_endpoint="https://<enter_purview_account_name>.purview.azure.com"
TENANT_ID="<TENANT_ID>"
CLIENT_ID="<CLIENT_ID>"
CLIENT_SECRET="<CLIENT_SECRET>"
acc_purview_token=$(curl https://login.microsoftonline.com/$TENANT_ID/oauth2/token --data "resource=https://purview.azure.net&client_id=$CLIENT_ID&client_secret=$CLIENT_SECRET&grant_type=client_credentials" -H Metadata:true -s | jq -r '.access_token')
purview_type_resp_custom_type=$(curl -s -X POST $purview_endpoint/catalog/api/atlas/v2/types/typedefs \
-H "Authorization: Bearer $acc_purview_token" \
-H "Content-Type: application/json" \
-d @Custom_Types.json )
echo $purview_type_resp_custom_type
If you need a Powershell alternative, see the docs.
You will need the default API / Host key configured on your Function app. To retrieve this:
-
To retrieve the ADB-WORKSPACE-ID, in the Azure portal, navigate to the Azure DataBricks (ADB) service in your resource group. In the overview section, copy the ADB workspace identifier from the URL (as highlighted below).
-
To retrieve the FUNCTION_APP_NAME, go back to the resource group view and click on the Azure Function. On the Overview tab, copy the
URL
and save it for the next steps. -
To retrieve the FUNCTION_APP_DEFAULT_HOST_KEY, go back to the resource group view and click on the Azure Function. In the right pane go to 'App Keys', then click on the show values icon and copy the
default
key.
Follow the instructions below and refer to the OpenLineage Databricks Install Instructions to enable OpenLineage in Databricks.
-
Download the OpenLineage-Spark 0.18.0 jar from Maven Central
-
Create an init-script named
open-lineage-init-script.sh
#!/bin/bash STAGE_DIR="/dbfs/databricks/openlineage" cp -f $STAGE_DIR/openlineage-spark-*.jar /mnt/driver-daemon/jars || { echo "Error copying Spark Listener library file"; exit 1;} cat << 'EOF' > /databricks/driver/conf/openlineage-spark-driver-defaults.conf [driver] { "spark.extraListeners" = "io.openlineage.spark.agent.OpenLineageSparkListener" } EOF
Warning If you are using a windows machine, be sure that you save the init script with linux Line Feed (LF) ending and NOT Microsoft Windows Carriage Return and Line Feed (CRLF) endings. Using a tool like VS Code or Notepad++, you may change the line endings by selecting CRLF/LF in the bottom right hand corner of the editor. If you do not have line feed endings, your cluster will fail to start due to an init script error.
-
Upload the init script and jar to dbfs using the Databricks CLI
dbfs mkdirs dbfs:/databricks/openlineage dbfs cp --overwrite ./openlineage-spark-*.jar dbfs:/databricks/openlineage/ dbfs cp --overwrite ./open-lineage-init-script.sh dbfs:/databricks/openlineage/open-lineage-init-script.sh
Note If you choose to use the Databricks Filestore UI instead of the CLI to upload the jar, the UI will replace hyphens (-) with underscores (_). This will cause the init script to fail as it expects hyphens in the jar's file name. Either use the Databricks CLI to ensure the file name is consistent or update the init script to reflect the underscores in the jar name.
-
Create or modify an interactive or job cluster in your Databricks Workspace. Under Advanced Options, add this config to the Spark Configuration:
spark.openlineage.version v1 spark.openlineage.namespace <ADB-WORKSPACE-ID>#<DB_CLUSTER_ID> spark.openlineage.host https://<FUNCTION_APP_NAME>.azurewebsites.net spark.openlineage.url.param.code <FUNCTION_APP_DEFAULT_HOST_KEY>
- The ADB-WORKSPACE-ID value should be the first part of the URL when navigating to Azure Databricks, not the workspace name. For example, if the URL is: https://adb-4630430682081461.1.azuredatabricks.net/, the ADB-WORKSPACE-ID should be adb-4630430682081461.1.
- You should store the FUNCTION_APP_DEFAULT_HOST_KEY in a secure location. If you will be configuring individual clusters with the OpenLineage agent, you can use Azure Databricks secrets to store the key in Azure KeyVault and retrieve it as part of the cluster initialization script. For more information on this, see the Azure documentation
After configuring the secret storage, the API key for OpenLineage can be configured in the Spark config, as in the following example:
spark.openlineage.url.param.code {{secrets/secret_scope/Ol-Output-Api-Key}}
- Add a reference to the uploaded init script
dbfs:/databricks/openlineage/open-lineage-init-script.sh
on the Init script section of the Advanced Options.
-
At this point, you can run a Databricks notebook on an "all-purpose cluster" in your configured workspace and observe lineage in Microsoft Purview once the Databricks notebook has finished running all cells.
-
If you do not see any lineage please follow the steps in the troubleshooting guide.
-
To support Databricks lineage from Databricks jobs, see the following section below.
To support Databricks Jobs, you must add the service principal to your Databricks workspace. To use the below scripts, you must authenticate to Azure Databricks using either access tokens or AAD tokens. The snippets below assume you have generated an access token.
-
Add your Service Principal to Databricks as a User
-
Create a file named
add-service-principal.json
that contains{ "schemas": [ "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" ], "applicationId": "<azure-application-id>", "displayName": "<display-name>", "groups": [ { "value": "<group-id>" } ], "entitlements": [ { "value":"allow-cluster-create" } ] }
-
Provide a group id by executing the
groups
Databricks API and extracting a group id.curl -X GET \ https://<databricks-instance>/api/2.0/preview/scim/v2/Groups \ --header 'Authorization: Bearer DATABRICKS_ACCESS_TOKEN' \ | jq .
You may use the admin group id or create a separate group to isolate the service principal.
-
Execute the following bash command after the file above has been created and populated.
curl -X POST \ https://<databricks-instance>/api/2.0/preview/scim/v2/ServicePrincipals \ --header 'Content-type: application/scim+json' \ --header 'Authorization: Bearer DATABRICKS_ACCESS_TOKEN' \ --data @add-service-principal.json \ | jq .
-
-
Assign the Service Principal as a contributor to the Databricks Workspace
-
At this point, you can run a Databricks job on a "job cluster" in your configured workspace and observe lineage in Microsoft Purview once the Databricks job has finished.
-
If you do not see any lineage please follow the steps in the troubleshooting guide.
You can also configure the OpenLineage listener to run globally, so that any cluster which is created automatically runs the listener. To do this, you can utilize a global init script.
Note: Global initialization cannot currently use values from Azure Databricks KeyVault integration mentioned above. If using global initialization scripts, this key would need to be retrieved in the notebooks themselves, or hardcoded into the global init script.