[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073

china-cse · 2022-10-16T00:45:13Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can't go with just run as tests required immediately before proceed to next model in DAG.

Expected Behavior

Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.

Steps To Reproduce

version: dbt 1.0 till latest release 1.3

total nodes count: 3K+ including models, tests, snapshots, seeds

build single model:
$dbt build -s test_model

Relevant log output

No response

Environment

- OS: Windows/DBT Cloud
- Python: 3.7.4
- dbt: 1.2

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

jtcohen6 · 2022-10-16T19:33:45Z

Thanks for opening @china-cse!

This is an issue we saw after introducing the dbt build command last year. The fact is that build requires constructing a more complex DAG, with tests as additional upstream dependencies, versus the simpler DAG used for other commands. After the v0.21 release, we decided the right move was to create genuinely different DAGs for build and non-build commands.

Some ideas we could explore to improve performance here:

A more-performant networkx algorithm (in the vein of Perf improvement to subgraph selection #4155)
Caching the networkx DAG between invocations — with cache invalidation for project code / config changes (like partial parsing), or when switching between build and non-build commands
Where possible, perform certain types of node selection before adding test edges to the graph. This either wouldn't work with graph operators, or it would represent a functional change, if we resolve graph operators before adding test edges. (Imagine a case with model_a --> some_test and model_a --> model_b. What should dbt build --select some_test+ resolve to? Today, it includes model_b; if we reordered the logic so graph selection happens before the some_test --> model_b edge is added, it no longer would.) But it could make a big difference in the case of dbt build --select single_model.

These startup costs are no fun anywhere, but they're especially painful in development, when you're trying to build a single model. As a larger topic, we are thinking a lot about the experience for large dbt deployments, which might want to look less like 2-3k nodes in a single project, and more like several projects with several hundred each. There's some lively and ongoing discussion over in #5244.

china-cse · 2022-10-16T21:24:58Z

thank you @jtcohen6 for the response and references to ongoing discussions.

Have gone through code snippet where its taking time to build graph.. I'm intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove.

As per my understanding of the same logic, it can be achieved through just using subgraph like below.. It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues?

Thank you!!!

jtcohen6 · 2022-10-17T11:59:28Z

@china-cse This is great! Someone from the Core-Execution team can take a look at the proposed logic, when they have some free bandwidth. If it's strictly better/faster at achieving the same result, I see no reason not to go with it.

china-cse · 2022-10-18T03:49:03Z

Great @jtcohen6 , thank you!!

china-cse · 2022-10-19T23:25:13Z

@jtcohen6, Just to remind.. I think this fix should be expedited.. in case organizations reaching the point beyond expected avg nodes count ~2K, you receive lot of performance issue requests..

will-sargent-dbtlabs · 2022-10-27T13:06:21Z

Agree! This is definitely affecting customers with very large projects.

leahwicz · 2022-10-27T17:18:18Z

@iknox-fa after the Hackathon concludes and you are on Support Rotation Monday, could you take a look at the above proposed solution and see if that would work for us please?

iknox-fa · 2022-10-31T22:52:43Z

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn't do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:

>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]

As you can see we've removed nodes 5 and 7 from our graph even though they were selected!

Here's what we were expecting to happen:

>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]

Now interestingly enough that's not what we get today-- instead we get:

[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]

As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.

iknox-fa · 2022-11-01T21:53:15Z

@jtcohen6 I'm adding the bug label to this one since there is an unexpected result in the graph edge creation step (although at worse it only effects performance, not correctness).

jeremyyeo · 2022-11-08T22:03:52Z

To add to Will's point - this one is a blocker for development on "large projects" (we do have one such example that we've brought up internally) - you tweak a model and you do dbt build -s +my_changed_model and it takes 10-15 minutes just to get to "Acquiring connection" - basically in 1 hour development cycle - you can only run/test your dbt project ~4 times (because you have to wait for so long).

china-cse · 2022-11-13T21:53:26Z

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn't do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:
>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]
As you can see we've removed nodes 5 and 7 from our graph even though they were selected!

Here's what we were expecting to happen:
>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]
Now interestingly enough that's not what we get today-- instead we get:
[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]
As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.

Hi @iknox-fa

thanks for pointing the issue, my bad I was checking only single node when tested.

I had looked into further on the approach. Looping through whole graph to find all selected nodes & removing one by one takes cost & time. It would be good if we trim of all unnecessary nodes from Graph before we start building new Graph based on the selection.

I worked on below approach, this takes less time in finding nodes.. and tested with 10K nodes.. seems its giving faster results.

this solution requires additional function (lets say "trim_graph()") for scalability purpose which main purpose is to remove unnecessary nodes/paths except direct paths

approach is as follow..

for example, graph as below

nodes: [0,1,2,3,4,5,6,7,8]

edges: [(0,1),(1,2),(1,3),(2,4),(2,6),(3,4),(4,5),(5,6),(6,7),(5,8)]

new trim-graph function::

scenario-1:

lets say, selection is [4].. its pretty straightforward in finding parents & children and remove all nodes except selected..

trim-graph outcome:
nodes: [4]
edges: []

cost: O(f(n)) --> its a best short in finding single nodes as its very cost effective.. cost is O(1)

scenario-2:

lets say, selection is [2,5].. will loop through selected nodes and remove unnecessary parents and children falls outside of direct paths

trim-graph outcome:
nodes: [2,4,5]
edges: [(2,4),(4,5)]

cost: O(3*f(n)) --> its thrice the cost of above scenario, it may increase as selection nodes increases.. so average cost would be O(cf(n)) if c specifies number of nodes.

scenario-3:

worst case scenario, selection is [0,7]... here not much to trim except node 8 as this falls outside direct path from 0 -> 7

trim-graph outcome:
nodes: [1,2,3,4,5,6,7]
edges: [(0,1),(1,2),(1,3),(2,4),(2,6),(3,4),(4,5),(5,6),(6,7)]

cost: O(7*f(n)) --> its almost like selecting whole graph

Once above function completes, will continue with existing logic to remove not required nodes based on selection.

this logic fixes bug mentioned by you as well...

please let me know if I can work on this change..

thank you!!

ChenyuLInx · 2022-11-18T06:51:19Z

@china-cse trimming the graph sounds like a good idea to me!
I think You should continue with it and open a PR if you are interested!
I am wondering how do you plan to do the trimming part? Doing a round of BFS from nodes on two sides of the DAG, remove node that isn't selected and stop at selected node?

misteliy · 2022-12-30T13:28:23Z

Dear all, thanks for looking into this because for us this started to become a bottleneck as well. In our company we currently face 18k nodes and thus this becomes unbearable. @china-cse what is the current speed up with the new solution? @jtcohen6 is this something that will find its way to be released soon? Is there something we can help with or should start to engage? Thank you

china-cse · 2022-12-30T14:36:38Z

Hi @misteliy,

Awaiting dbt-core team review on pull request. It’s been a while I don’t see any update. Let me try update and ask for the status.

thanks

simonloach · 2023-01-05T02:50:11Z

@china-cse I had created a PR for some changes that I wanted to suggest: china-cse#1. I had verified and the algorithm works as intended in the case of linear DAG. Will look into writing a test suite with more complex DiGraphs and test performance.

china-cse · 2023-01-06T23:05:06Z

thanks @simonloach for the update.. let me know if any issues... thank you!!

boxysean · 2023-01-20T10:34:38Z

I have a client who is interested in this PR and are struggling with dbt build run durations. (I'm not yet sure if this is impacting their dev, or prod, and what kind of business impact it has for them.) They reported to me that they have about 10K-15K nodes in their project (models+tests) and reported a startup time of about 15 minutes before the first model is built (similar to @china-cse's report). I also understand that dbt Core v1.4 will be released soon, so I would like to assist in profiling the performance of @china-cse's work.

Summary

I did not observe any speed gain from this PR. I tried dbt-core version 1.3.2 with and without the patch from this PR, and observed the same result on a synthetic DAG I created.

That does not mean that this change doesn't improve performance, it just means that I wasn't able to observe the performance gains claimed. I would like to run this test again on a DAG in the same shape (not necessarily the same code) as the DAGs that claim to have a 15 minute startup time.

Benchmark tests

Methodology

I set up a test environment on my laptop with dbt-core version 1.3.2 (editable repo) and dbt-duckdb version 0.6.1. I used these versions because I could relatively easily benchmark dbt build end-to-end with DuckDB within my own system, and dbt-duckdb currently requires dbt-core version <=1.3.2. Then, on a Test DAG (described below), I ran the command dbt build using my local dbt core version 1.3.2 with and without the patch to core/dbt/graph/graph.py (patch file) in this PR.

Test DAG: I built a synthetic DAG in the shape of a binary tree and 1023 models, each model with 5 "no-op" generic tests. The models had SELECT 1 table materializations with SQL-commented refs to their children models, and "no-op" tests were SELECT 1 WHERE 1=0. See this project generation code.

I used hyperfine for command line benchmarking.

Test 1: Vanilla dbt-core v1.3.2

Steps to recreate:

# within a pyenv
git clone git@github.com:dbt-labs/dbt-core.git
cd dbt-core
git checkout v1.3.2 -b v1.3.2
pip install -e ./core && pip install -e ./plugins/postgres && pip install -e ./tests/adapter
pip install duckdb
cd ../my_project
hyperfine --runs 5 --export-json vanilla_v1.3.2.json 'dbt build'  # this vanilla v1.3.2

Results:

➜  my_project $ hyperfine --runs 5 --export-json vanilla_v1.3.2.json 'dbt build'  # this vanilla v1.3.2
Benchmark 1: dbt build
  Time (mean ± σ):     297.611 s ± 16.962 s    [User: 273.805 s, System: 27.789 s]
  Range (min … max):   281.686 s … 326.369 s    5 runs

Test 2: dbt-core v1.3.2, with `core/dbt/graph/graph.py` patch applied

Steps make repo (within same):

# within same pyenv
cd dbt-core
git remote add china-cse git@github.com:china-cse/dbt-core-local.git
git fetch -a
git checkout dbt-build-performance-fix
git diff v1.3.2 -- core/dbt/graph/graph.py > ../dbt-build-performance-fix.patch
git checkout v1.3.2 -b v1.3.2-with-dbt-build-performance-fix
git apply ../dbt-build-performance-fix.patch 
cd ../my_project
hyperfine --warmup 1 --runs 5 'dbt build' # this is with performance fix

Results:

➜  my_project $ hyperfine --warmup 1 --runs 5 'dbt build' # this is with performance fix
Benchmark 1: dbt build
  Time (mean ± σ):     321.686 s ± 21.727 s    [User: 293.544 s, System: 29.333 s]
  Range (min … max):   292.003 s … 347.179 s    5 runs

boxysean · 2023-01-22T21:45:56Z

Hi all, I propose a small improvement in #6694.

For anyone curious, here are some visualizations of where dbt-core is spending its time during a dbt compile on a very large DAG of 8468 models, 17103 tests. Hopefully this gives everyone on this thread ideas where to spend optimization efforts.

Main thread:

Other threads:

I'm personally curious why the graph building code is multithreaded, does anyone know if there are performance gains from that? We see the main thread locked and waiting for the subthread due to the GIL; I don't believe there are I/O wait times in the subthread which would benefit runtime.

ttusing · 2023-03-20T05:41:22Z

Hello - I have a PR to increase dbt build performance. I added a trimming strategy similar to what was described here by @china-cse

The strategy is a bit different than described, I first remove any nodes with degree 0 or 1 (iteratively), then sort the remaining nodes by degree, starting with the lowest, when determining whether to remove them, such that a minimal number of edges are created when removing nodes.

This decreased my build time when selecting a single model from 2 minutes 18 seconds to 10 seconds.

I think this can be slightly improved further by removing any nodes with 0 incoming or outgoing edges, which is a bit broader a definition than degree 1 and will probably make a big difference on large and interconnected DAGs. I'm not familiar enough with networkx to know how to do that easily/efficiently.

Issue: #7195
PR: #7194

kostek-pl · 2023-04-05T12:59:11Z

Hi Tobie
I've checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement.
Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

ttusing · 2023-04-05T20:46:15Z

Hi Tobie I've checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

awesome!

Could you post your findings in #7195?

ttusing · 2023-04-08T02:20:05Z

Hi Tobie I've checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

The PR is merged to main and is in likely for 1.5-rc1 which, peeking around, looks like will be released on Monday :).

I have some ideas on how to optimize get_subset_graph even further (see https://github.com/ttusing/dbt-core/pull/3/files). If everyone can share their runtimes on 1.5 and where their runs are hanging, we can iterate further. I think it would be cool to build a community understanding of how many nodes DBT can handle in <1 minute on common hardware, would help me inform how to design my DBT projectors.

ttusing · 2023-05-18T07:59:44Z

@china-cse @kostek-pl @boxysean do you consider this issue to still be open? Can you share your project build times in dbt 1.5?

github-actions · 2024-02-18T01:46:22Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

peterallenwebb · 2024-02-22T15:27:04Z

I'm closing this for now, since the problematic performance issues identified are in older versions of dbt, and we believe they are probably resolved. We are always ready and willing to look into this further, if needed.

china-cse added bug Something isn't working triage labels Oct 16, 2022

github-actions bot changed the title ~~[Bug] Performance issues with DBT build when nodes more than 2K in a project~~ [CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project Oct 16, 2022

jtcohen6 added enhancement New feature or request performance Team:Execution and removed bug Something isn't working triage labels Oct 16, 2022

jtcohen6 added the support_rotation A good task to pick up during support rotation label Oct 17, 2022

iknox-fa added the bug Something isn't working label Nov 1, 2022

china-cse mentioned this issue Nov 27, 2022

dbt build performance fix #6322

Closed

6 tasks

iknox-fa mentioned this issue Jan 10, 2023

[SPIKE] Testing PR #6322 (build performance update) #6562

Open

jtcohen6 removed the support_rotation A good task to pick up during support rotation label Jan 15, 2023

This was referenced Jan 22, 2023

Cheeky performance improvement on big DAGs #6694

Merged

Add the yappi profiler for multithread profiling #6695

Closed

This was referenced Jan 23, 2023

[CT-1881] [Feature] When possible, bypass an expensive method during graph construction #6697

Closed

[CT-1927] [Feature] Consider only successors of the deleted nodes when populating the GraphQueue #6759

Closed

boxysean mentioned this issue Feb 3, 2023

Optimize BaseRelation.matches() #6844

Closed

6 tasks

This was referenced Mar 19, 2023

[CT-2315] [Feature] Ignore duplicate edges while building the subgraph for the job queue #7191

Closed

[CT-2316] [Feature] Improve subset_graph selection by removing trivial nodes and improving removal order #7195

Closed

github-actions bot added the triage label May 18, 2023

dbeatty10 removed the triage label May 18, 2023

jtcohen6 removed the Team:Execution label Jul 19, 2023

truls-p mentioned this issue Jan 15, 2024

Change metadata_vars if not to if ... is None #9376

Closed

5 tasks

truls-p mentioned this issue Jan 23, 2024

Change metadata_vars if not to if ... is None dbt-labs/dbt-common#37

Merged

5 tasks

github-actions bot added the stale Issues that have gone stale label Feb 18, 2024

peterallenwebb closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073

china-cse commented Oct 16, 2022

jtcohen6 commented Oct 16, 2022

china-cse commented Oct 16, 2022

jtcohen6 commented Oct 17, 2022 •

edited

Loading

china-cse commented Oct 18, 2022

china-cse commented Oct 19, 2022

will-sargent-dbtlabs commented Oct 27, 2022 •

edited

Loading

leahwicz commented Oct 27, 2022

iknox-fa commented Oct 31, 2022

iknox-fa commented Nov 1, 2022

jeremyyeo commented Nov 8, 2022

china-cse commented Nov 13, 2022 •

edited

Loading

ChenyuLInx commented Nov 18, 2022

misteliy commented Dec 30, 2022

china-cse commented Dec 30, 2022

simonloach commented Jan 5, 2023

china-cse commented Jan 6, 2023

boxysean commented Jan 20, 2023 •

edited

Loading

boxysean commented Jan 22, 2023 •

edited

Loading

ttusing commented Mar 20, 2023 •

edited

Loading

kostek-pl commented Apr 5, 2023 •

edited

Loading

ttusing commented Apr 5, 2023

ttusing commented Apr 8, 2023

ttusing commented May 18, 2023

github-actions bot commented Feb 18, 2024

peterallenwebb commented Feb 22, 2024

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073

Comments

china-cse commented Oct 16, 2022

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

jtcohen6 commented Oct 16, 2022

china-cse commented Oct 16, 2022

jtcohen6 commented Oct 17, 2022 • edited Loading

china-cse commented Oct 18, 2022

china-cse commented Oct 19, 2022

will-sargent-dbtlabs commented Oct 27, 2022 • edited Loading

leahwicz commented Oct 27, 2022

iknox-fa commented Oct 31, 2022

iknox-fa commented Nov 1, 2022

jeremyyeo commented Nov 8, 2022

china-cse commented Nov 13, 2022 • edited Loading

nodes: [0,1,2,3,4,5,6,7,8]

edges: [(0,1),(1,2),(1,3),(2,4),(2,6),(3,4),(4,5),(5,6),(6,7),(5,8)]

ChenyuLInx commented Nov 18, 2022

misteliy commented Dec 30, 2022

china-cse commented Dec 30, 2022

simonloach commented Jan 5, 2023

china-cse commented Jan 6, 2023

boxysean commented Jan 20, 2023 • edited Loading

Summary

Benchmark tests

Methodology

Test 1: Vanilla dbt-core v1.3.2

Test 2: dbt-core v1.3.2, with core/dbt/graph/graph.py patch applied

boxysean commented Jan 22, 2023 • edited Loading

ttusing commented Mar 20, 2023 • edited Loading

kostek-pl commented Apr 5, 2023 • edited Loading

ttusing commented Apr 5, 2023

ttusing commented Apr 8, 2023

ttusing commented May 18, 2023

github-actions bot commented Feb 18, 2024

peterallenwebb commented Feb 22, 2024

jtcohen6 commented Oct 17, 2022 •

edited

Loading

will-sargent-dbtlabs commented Oct 27, 2022 •

edited

Loading

china-cse commented Nov 13, 2022 •

edited

Loading

boxysean commented Jan 20, 2023 •

edited

Loading

Test 2: dbt-core v1.3.2, with `core/dbt/graph/graph.py` patch applied

boxysean commented Jan 22, 2023 •

edited

Loading

ttusing commented Mar 20, 2023 •

edited

Loading

kostek-pl commented Apr 5, 2023 •

edited

Loading