[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

peterallenwebb · 2023-04-13T21:11:39Z

The manifest.json and graph.gpickle files would often be useful for investigating performance issues in Core's graph algorithms, but there are a few of issues:

They may be large
They often contain sensitive proprietary information that cannot be shared internally
They only record graph structure after test edges are added

I propose that we add an additional output file called graph_summary.json, which would be more compact and anonymized. This file would contain only the edge structure, and the name and type of each node. It will include that information at two separate points in time: Immediately after the graph is linked together, and then after test edges have been added.

My hope is that this file could be shared with the Core team on a routine basis, and would allow us to build and maintain a library of realistic graph structures from production environments, helping us improve and test our graph algorithms.

Immediate applications include:

Building a library of user DAGs for analysis and testing
Analyzing whether or not we are executing DAGs as quickly as we reasonably could
Checking whether add_test_edges() is doing what we want, and optimizing it

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-04-13T22:46:46Z

Smart idea @peterallenwebb ! 🧠

What would you see as the mechanism for these to be shared with the Core team on a routine basis?

I'm wondering if maybe just a single anonymized and compact graph.json with test edges included might be simple? Those edges could be easily removed post-hoc if/when desired (e.g. using jq).

What would you see as the pros/cons?

peterallenwebb · 2023-04-14T14:24:09Z

@dbeatty10 I haven't got any a specific mechanism for sharing in mind, and I'm open to ideas. One possibility is that the Cloud Artifacts team could direct these outputs to an S3 bucket where we could take a look. If the Core thinks the solution I've sketched here is a good idea, I will ask for some further feedback from Artifacts. Of course, for non-Cloud dbt users, it's also a less invasive than asking for a full manifest to ask them for this graph summary.

I'll give some thought to trying to combine the graphs into a single one. It seems to me that the downsides would be:

It complicates the graph output format slightly
We would need to add some logic to the already-complex function which adds test edges (or alternately, do some potentially slow analysis after the fact to determine which edges were added)

My original hope was that the information would be compact enough to include in the metrics we are collecting with snowplow, but I after giving it some thought I do not think that will be workable. I don't know what our snowplow message size limit is, but 40KB was a limit I saw cited, and that does seem like a reasonable limit for our use case. For clients with large graphs, which are the type we are most interested in, it would be tricky to stay under 40KB.

dbeatty10 · 2023-04-14T19:41:59Z

I wonder if there's some kind of compact summary that we could include snowplow messages that is guaranteed to fit under size X, even for arbitrarily large graphs. e.g., maybe we could choose some representative summary statistics from which we could generate "similar" graphs.

Of course this wouldn't give us an isomorphic representation like graph.json would, but could give us a lightweight complement to it.

Spitballing some ideas of summary stats:

total number of nodes (potentially by resource type)
"average" node depth*
"average" node in degree* (potentially as a function of depth)

* Where "average" is actually location, scale, and shape parameters of a skew normal distribution.

I tried googling a bit for "summary statistics for directed acyclic graph"; most of the hits were fascinating, yet not relevant. Here's one that might be useful for brainstorming purposes:
https://documentation.sas.com/doc/en/pgmsascdc/v_037/casactml/casactml_network_examples86.htm#casactml.network.summaryc

peterallenwebb · 2023-04-14T19:58:17Z

I agree that capturing some rudimentary stats is a good idea, with a lot of potentially interesting applications for the results. I'll read up a bit on the topic and try to add a statistical summary to the snowplow tracking.

boxysean · 2023-05-02T10:00:53Z

Very nice @peterallenwebb 👍 This really smooths out the process of asking users for their DAG to help our testing purposes. I could imagine wanting to generate some graphs to help us prototype some of the cross-project ref features coming in 1.6 and beyond. Also ➕ to the idea of creating a folder of graphs to refer back to for future work. Thanks!

peterallenwebb added enhancement New feature or request triage labels Apr 13, 2023

github-actions bot changed the title ~~Output Graph Summary Information on When dbt Compiles~~ [CT-2414] Output Graph Summary Information on When dbt Compiles Apr 13, 2023

peterallenwebb mentioned this issue Apr 13, 2023

CT-2414: Add graph summaries to target directory output #7358

Merged

6 tasks

dbeatty10 added awaiting_response and removed triage labels Apr 13, 2023

github-actions bot added triage and removed awaiting_response labels Apr 14, 2023

dbeatty10 removed the triage label Apr 14, 2023

jtcohen6 added the performance label Apr 18, 2023

jtcohen6 assigned peterallenwebb Apr 19, 2023

peterallenwebb closed this as completed in #7358 Apr 28, 2023

jtcohen6 mentioned this issue Jul 13, 2023

New artifacts with graph summary info, to aid in performance tuning dbt-labs/docs.getdbt.com#3724

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

peterallenwebb commented Apr 13, 2023 •

edited

Loading

dbeatty10 commented Apr 13, 2023

peterallenwebb commented Apr 14, 2023

dbeatty10 commented Apr 14, 2023

peterallenwebb commented Apr 14, 2023

boxysean commented May 2, 2023

[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

Comments

peterallenwebb commented Apr 13, 2023 • edited Loading

dbeatty10 commented Apr 13, 2023

peterallenwebb commented Apr 14, 2023

dbeatty10 commented Apr 14, 2023

peterallenwebb commented Apr 14, 2023

boxysean commented May 2, 2023

peterallenwebb commented Apr 13, 2023 •

edited

Loading