-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073
Comments
Thanks for opening @china-cse! This is an issue we saw after introducing the Some ideas we could explore to improve performance here:
These startup costs are no fun anywhere, but they're especially painful in development, when you're trying to |
thank you @jtcohen6 for the response and references to ongoing discussions. Have gone through code snippet where its taking time to build graph.. I'm intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove. As per my understanding of the same logic, it can be achieved through just using subgraph like below.. It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues? Thank you!!! |
@china-cse This is great! Someone from the Core-Execution team can take a look at the proposed logic, when they have some free bandwidth. If it's strictly better/faster at achieving the same result, I see no reason not to go with it. |
Great @jtcohen6 , thank you!! |
@jtcohen6, Just to remind.. I think this fix should be expedited.. in case organizations reaching the point beyond expected avg nodes count ~2K, you receive lot of performance issue requests.. |
Agree! This is definitely affecting customers with very large projects. |
@iknox-fa after the Hackathon concludes and you are on Support Rotation Monday, could you take a look at the above proposed solution and see if that would work for us please? |
Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case As an example, if we applied the logic proposed here like so:
As you can see we've removed nodes 5 and 7 from our graph even though they were selected! Here's what we were expecting to happen:
Now interestingly enough that's not what we get today-- instead we get:
As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow. Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work. |
@jtcohen6 I'm adding the bug label to this one since there is an unexpected result in the graph edge creation step (although at worse it only effects performance, not correctness). |
To add to Will's point - this one is a blocker for development on "large projects" (we do have one such example that we've brought up internally) - you tweak a model and you do |
Hi @iknox-fa thanks for pointing the issue, my bad I was checking only single node when tested. I had looked into further on the approach. Looping through whole graph to find all selected nodes & removing one by one takes cost & time. It would be good if we trim of all unnecessary nodes from Graph before we start building new Graph based on the selection. I worked on below approach, this takes less time in finding nodes.. and tested with 10K nodes.. seems its giving faster results. this solution requires additional function (lets say "trim_graph()") for scalability purpose which main purpose is to remove unnecessary nodes/paths except direct paths approach is as follow.. for example, graph as below nodes: [0,1,2,3,4,5,6,7,8]edges: [(0,1),(1,2),(1,3),(2,4),(2,6),(3,4),(4,5),(5,6),(6,7),(5,8)]new trim-graph function:: scenario-1: lets say, selection is [4].. its pretty straightforward in finding parents & children and remove all nodes except selected.. trim-graph outcome: cost: O(f(n)) --> its a best short in finding single nodes as its very cost effective.. cost is O(1) scenario-2: lets say, selection is [2,5].. will loop through selected nodes and remove unnecessary parents and children falls outside of direct paths trim-graph outcome: cost: O(3*f(n)) --> its thrice the cost of above scenario, it may increase as selection nodes increases.. so average cost would be O(cf(n)) if c specifies number of nodes. scenario-3: worst case scenario, selection is [0,7]... here not much to trim except node 8 as this falls outside direct path from 0 -> 7 trim-graph outcome: cost: O(7*f(n)) --> its almost like selecting whole graph Once above function completes, will continue with existing logic to remove not required nodes based on selection. this logic fixes bug mentioned by you as well... please let me know if I can work on this change.. thank you!! |
@china-cse trimming the graph sounds like a good idea to me! |
Dear all, thanks for looking into this because for us this started to become a bottleneck as well. In our company we currently face 18k nodes and thus this becomes unbearable. @china-cse what is the current speed up with the new solution? @jtcohen6 is this something that will find its way to be released soon? Is there something we can help with or should start to engage? Thank you |
Hi @misteliy, Awaiting dbt-core team review on pull request. It’s been a while I don’t see any update. Let me try update and ask for the status. thanks |
@china-cse I had created a PR for some changes that I wanted to suggest: china-cse#1. I had verified and the algorithm works as intended in the case of linear DAG. Will look into writing a test suite with more complex DiGraphs and test performance. |
thanks @simonloach for the update.. let me know if any issues... thank you!! |
I have a client who is interested in this PR and are struggling with SummaryI did not observe any speed gain from this PR. I tried dbt-core version 1.3.2 with and without the patch from this PR, and observed the same result on a synthetic DAG I created. That does not mean that this change doesn't improve performance, it just means that I wasn't able to observe the performance gains claimed. I would like to run this test again on a DAG in the same shape (not necessarily the same code) as the DAGs that claim to have a 15 minute startup time. Benchmark testsMethodologyI set up a test environment on my laptop with dbt-core version 1.3.2 (editable repo) and dbt-duckdb version 0.6.1. I used these versions because I could relatively easily benchmark Test DAG: I built a synthetic DAG in the shape of a binary tree and 1023 models, each model with 5 "no-op" generic tests. The models had I used hyperfine for command line benchmarking. Test 1: Vanilla dbt-core v1.3.2Steps to recreate:
Results:
Test 2: dbt-core v1.3.2, with
|
Hi all, I propose a small improvement in #6694. For anyone curious, here are some visualizations of where dbt-core is spending its time during a Main thread: Other threads: I'm personally curious why the graph building code is multithreaded, does anyone know if there are performance gains from that? We see the main thread locked and waiting for the subthread due to the GIL; I don't believe there are I/O wait times in the subthread which would benefit runtime. |
Hello - I have a PR to increase The strategy is a bit different than described, I first remove any nodes with degree 0 or 1 (iteratively), then sort the remaining nodes by degree, starting with the lowest, when determining whether to remove them, such that a minimal number of edges are created when removing nodes. This decreased my build time when selecting a single model from 2 minutes 18 seconds to 10 seconds. I think this can be slightly improved further by removing any nodes with 0 incoming or outgoing edges, which is a bit broader a definition than degree 1 and will probably make a big difference on large and interconnected DAGs. I'm not familiar enough with |
Hi Tobie |
awesome! Could you post your findings in #7195? |
The PR is merged to main and is in likely for 1.5-rc1 which, peeking around, looks like will be released on Monday :). I have some ideas on how to optimize |
@china-cse @kostek-pl @boxysean do you consider this issue to still be open? Can you share your project build times in |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
I'm closing this for now, since the problematic performance issues identified are in older versions of dbt, and we believe they are probably resolved. We are always ready and willing to look into this further, if needed. |
Is this a new bug in dbt-core?
Current Behavior
In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can't go with just run as tests required immediately before proceed to next model in DAG.
Expected Behavior
Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.
Steps To Reproduce
version: dbt 1.0 till latest release 1.3
total nodes count: 3K+ including models, tests, snapshots, seeds
build single model:
$dbt build -s test_model
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
No response
The text was updated successfully, but these errors were encountered: