Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1203] [Bug] mutiple references in a model causes duplicate entries in graph.node.depends_on #5877

Closed
2 tasks done
dave-connors-3 opened this issue Sep 19, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@dave-connors-3
Copy link
Contributor

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

For a model that contains two relation references (is that the right term? trying not to say "references" to conflate that with the ref function), the corresponding entry in the manifest.json / graph will have the same node in the depends_on list multiple times:

Model with same source

-- in models/star.sql
select
  {{ dbt_utils.star(source('fake_source', 'orders')) }}
from {{ source('fake_source', 'orders') }}

Corresponding JSON

"nodes": {
        "model.davebt.star": {
            "raw_sql": "select\n  {{ dbt_utils.star(source('fake_source', 'orders')) }}\nfrom {{ source('fake_source', 'orders') }}",
            "compiled": true,
            "resource_type": "model",
            "depends_on": {
                "macros": [
                    "macro.dbt_utils.star"
                ],
                "nodes": [
                    "source.davebt.fake_source.orders",
                    "source.davebt.fake_source.orders"
                ]
            },
...

Model with same {{ ref() }}

-- in models/downstream.sql
-- depends on {{ ref('star') }}


select * from {{ ref('star') }}

Corresponding JSON

...
        "model.davebt.downstream": {
            "raw_sql": "-- depends on {{ ref('star') }}\n\n\nselect * from {{ ref('star') }}",
            "compiled": true,
            "resource_type": "model",
            "depends_on": {
                "macros": [],
                "nodes": [
                    "model.davebt.star",
                    "model.davebt.star"
                ]
            },
...

Expected Behavior

This may in fact be expected! This came up in the dbt_project_evaluator project -- multiple sources joined together is a violation of the recommended best practices (stage those sources!) and the same source called twice with the star macro as above caused an unintentional error.

The fix on the package side is really straightforward, but got us wondering what the intended behavior here is!

Steps To Reproduce

  1. configure a source
  2. reference it twice in any model file
  3. reference model from step 2 in new model file multiple times
  4. check manifest.json entries for these nodes

Relevant log output

No response

Environment

- OS: mac
- Python: 3.8.4
- dbt: 1.2.1

Which database adapter are you using with dbt?

postgres, redshift, snowflake, bigquery, spark

Additional Context

No response

@dave-connors-3 dave-connors-3 added bug Something isn't working triage labels Sep 19, 2022
@github-actions github-actions bot changed the title [Bug] mutiple references in a model causes duplicate entries in graph.node.depends_on [CT-1203] [Bug] mutiple references in a model causes duplicate entries in graph.node.depends_on Sep 19, 2022
@lostmygithubaccount lostmygithubaccount self-assigned this Sep 20, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 6, 2023

@dave-connors-3 Thanks for opening this, and sorry it's taken me forever to respond!

This one is totally in our power to change. So really, the question is: What is the expected behavior? What ought it be?

Here's a halfway position that makes sense to me, but I wouldn't be surprised if someone else found it more confusing:

  • The refs node property is where dbt "captures" calls to ref() at parse time. (Same for sources/source() + metrics/metric().) If those are called multiple times within one model, I'd expect those calls to appear multiple times, and in the same sequence in which they're called.
  • The depends_on node property is the result of some resolution & processing within dbt. It should be a clean, go-to place to quickly see which resources a given model depends on, and so it should be deduplicated — maybe even sorted (alphabetically), but that would just be a bonus.

The deduplication would be consistent with three other things already in the dbt-core codebase:

  • The DependsOn class already has an add_node property that checks to see if a node is already present before adding. But it looks like, in several places, we use depends_on.nodes.append instead of depends_on.nodes.add_node!
  • The depends_on.macros, which (after some quick manual testing to verify) appears to be the set of distinct macros called, even if a given macro is called multiple times
  • same_depends_on for exposures, which already deduplicates depends_on.nodes by wrapping it in set()

example

-- models/model_a.sql
select 1 as id
-- models/model_b.sql
select 1 as id
-- models/model_c.sql
select * from {{ ref('model_a') }}
union all
select * from {{ ref('model_b') }}
union all
select * from {{ ref('model_a') }}

current

    "model.testy.model_c": {
      ...
      "raw_code": "select * from {{ ref('model_a') }}\nunion all\nselect * from {{ ref('model_b') }}\nunion all\nselect * from {{ ref('model_a') }}",
      "language": "sql",
      "refs": [
        [
          "model_a"
        ],
        [
          "model_b"
        ],
        [
          "model_a"
        ]
      ],
      "sources": [],
      "metrics": [],
      "depends_on": {
        "macros": [],
        "nodes": [
          "model.testy.model_a",
          "model.testy.model_b",
          "model.testy.model_a"
        ]
      },
      "compiled_path": null
    },

proposed

    "model.testy.model_c": {
      ...
      "raw_code": "select * from {{ ref('model_a') }}\nunion all\nselect * from {{ ref('model_b') }}\nunion all\nselect * from {{ ref('model_a') }}",
      "language": "sql",
      "refs": [
        [
          "model_a"
        ],
        [
          "model_b"
        ],
        [
          "model_a"
        ]
      ],
      "sources": [],
      "metrics": [],
      "depends_on": {
        "macros": [],
        "nodes": [
          "model.testy.model_b",
          "model.testy.model_a"
        ]
      },
      "compiled_path": null
    },

@jtcohen6 jtcohen6 removed the triage label Jan 6, 2023
@jtcohen6 jtcohen6 removed their assignment Jan 6, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 6, 2023

Gave this a quick spin, but ran into some nasty hologram issues when I tried turning a List into a set: 741e49b

@jtcohen6
Copy link
Contributor

Resolved by #7455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants