Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent lineage between Databricks Notebooks and ADLS assets #123

Closed
svnnl opened this issue Nov 30, 2022 · 2 comments
Closed

Inconsistent lineage between Databricks Notebooks and ADLS assets #123

svnnl opened this issue Nov 30, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@svnnl
Copy link

svnnl commented Nov 30, 2022

Describe the bug
For a PoC, we are testing the Lineage Connector and are aiming to create lineage between Databricks Notebooks and external tables stored in ADLS, basically similar to the lineage graph that's shown in the readme (See image below).
image

Our goal is to make this connection between ADLS -> Notebook -> ADLS, but we are not able to consistently get this. In most cases, the lineage graph will be as shown in the screenshot below, where the output is the dummy entity. In one attempt, we created the table, then ran the ADLS scan so the Resource Set can be found, and then we ran the script again, which transformed the dummy entity to the ADLS resource set. Unfortunately, after trying out several times after that, we have not been able to reproduce that result. Below is a simple example for what we're trying to achieve.

image

We do see that there's an open PR related to this #69 , but this has apparently not been updated for a couple of months, so we're wondering what's the progress on this.

Alternatively, if you know how to achieve our goal in a different way, we are happy to hear it.

To Reproduce
Steps to reproduce the behavior:

  1. Create a .csv file with just some random integers as keys and store this in ADLS.
  2. Run below code in Notebook
    %sql CREATE TABLE IF NOT EXISTS customer_keys (key bigint) LOCATION 'abfss://<...>@<..>.dfs.core.windows.net/customer_keys

new_keys = spark.read.format('csv').option('header', 'true').load('abfss://<...>@<..>.dfs.core.windows.net/new_keys.csv') new_keys.createOrReplaceTempView('new_keys')

%sql INSERT INTO customer_keys SELECT * FROM new_keys
3. Run ADLS scan in Purview so the customer_keys asset is found as ADLS asset
4. Run the code above again

Expected behavior

Expected is that the dummy entity is replaced by the ADLS asset, as we have achieved once, but unfortunately have not been able to reproduce.

Desktop (please complete the following information):

  • OS: Windows
  • OpenLineage Version: 0.17.0
  • Databricks Runtime Version: 11.3 LTS (Spark 3.3) -> Lineage works!
  • Cluster Type: Interactive
  • Cluster Mode: Shared DS3_v2
  • Using Credential Passthrough: No
@svnnl svnnl added the bug Something isn't working label Nov 30, 2022
@wjohnson
Copy link
Collaborator

wjohnson commented Dec 2, 2022

@svnnl thank you for your interest in the solution accelerator!

Just to be sure, customer_keys is a Delta table? That would be the default for DBR 11.3 but I wanted to confirm with you.

We are working on a release that improves the matching to existing assets and it should be available by Monday.

If you wanted to try it out, please see the PR #124 and you could deploy the azure function via VS code to your existing azure function application and see if the newest changes corrects your scenario.

@svnnl
Copy link
Author

svnnl commented Dec 2, 2022

@wjohnson Indeed, customer_keys is a delta table.
We'll await the release then, hopefully it can solve the issue, thanks!

@svnnl svnnl closed this as completed Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants