Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

Closed
peterallenwebb opened this issue Apr 13, 2023 · 5 comments · Fixed by #7358
Closed

[CT-2414] Output Graph Summary Information on When dbt Compiles #7357

peterallenwebb opened this issue Apr 13, 2023 · 5 comments · Fixed by #7358
Assignees
Labels
enhancement New feature or request performance

Comments

@peterallenwebb
Copy link
Contributor

peterallenwebb commented Apr 13, 2023

The manifest.json and graph.gpickle files would often be useful for investigating performance issues in Core's graph algorithms, but there are a few of issues:

  • They may be large
  • They often contain sensitive proprietary information that cannot be shared internally
  • They only record graph structure after test edges are added

I propose that we add an additional output file called graph_summary.json, which would be more compact and anonymized. This file would contain only the edge structure, and the name and type of each node. It will include that information at two separate points in time: Immediately after the graph is linked together, and then after test edges have been added.

My hope is that this file could be shared with the Core team on a routine basis, and would allow us to build and maintain a library of realistic graph structures from production environments, helping us improve and test our graph algorithms.

Immediate applications include:

  • Building a library of user DAGs for analysis and testing
  • Analyzing whether or not we are executing DAGs as quickly as we reasonably could
  • Checking whether add_test_edges() is doing what we want, and optimizing it
@peterallenwebb peterallenwebb added enhancement New feature or request triage labels Apr 13, 2023
@github-actions github-actions bot changed the title Output Graph Summary Information on When dbt Compiles [CT-2414] Output Graph Summary Information on When dbt Compiles Apr 13, 2023
@dbeatty10
Copy link
Contributor

Smart idea @peterallenwebb ! 🧠

What would you see as the mechanism for these to be shared with the Core team on a routine basis?

I'm wondering if maybe just a single anonymized and compact graph.json with test edges included might be simple? Those edges could be easily removed post-hoc if/when desired (e.g. using jq).

What would you see as the pros/cons?

@peterallenwebb
Copy link
Contributor Author

@dbeatty10 I haven't got any a specific mechanism for sharing in mind, and I'm open to ideas. One possibility is that the Cloud Artifacts team could direct these outputs to an S3 bucket where we could take a look. If the Core thinks the solution I've sketched here is a good idea, I will ask for some further feedback from Artifacts. Of course, for non-Cloud dbt users, it's also a less invasive than asking for a full manifest to ask them for this graph summary.

I'll give some thought to trying to combine the graphs into a single one. It seems to me that the downsides would be:

  1. It complicates the graph output format slightly
  2. We would need to add some logic to the already-complex function which adds test edges (or alternately, do some potentially slow analysis after the fact to determine which edges were added)

My original hope was that the information would be compact enough to include in the metrics we are collecting with snowplow, but I after giving it some thought I do not think that will be workable. I don't know what our snowplow message size limit is, but 40KB was a limit I saw cited, and that does seem like a reasonable limit for our use case. For clients with large graphs, which are the type we are most interested in, it would be tricky to stay under 40KB.

@dbeatty10
Copy link
Contributor

I wonder if there's some kind of compact summary that we could include snowplow messages that is guaranteed to fit under size X, even for arbitrarily large graphs. e.g., maybe we could choose some representative summary statistics from which we could generate "similar" graphs.

Of course this wouldn't give us an isomorphic representation like graph.json would, but could give us a lightweight complement to it.

Spitballing some ideas of summary stats:

  • total number of nodes (potentially by resource type)
  • "average" node depth*
  • "average" node in degree* (potentially as a function of depth)

* Where "average" is actually location, scale, and shape parameters of a skew normal distribution.

I tried googling a bit for "summary statistics for directed acyclic graph"; most of the hits were fascinating, yet not relevant. Here's one that might be useful for brainstorming purposes:
https://documentation.sas.com/doc/en/pgmsascdc/v_037/casactml/casactml_network_examples86.htm#casactml.network.summaryc

@dbeatty10 dbeatty10 removed the triage label Apr 14, 2023
@peterallenwebb
Copy link
Contributor Author

I agree that capturing some rudimentary stats is a good idea, with a lot of potentially interesting applications for the results. I'll read up a bit on the topic and try to add a statistical summary to the snowplow tracking.

@boxysean
Copy link
Contributor

boxysean commented May 2, 2023

Very nice @peterallenwebb 👍 This really smooths out the process of asking users for their DAG to help our testing purposes. I could imagine wanting to generate some graphs to help us prototype some of the cross-project ref features coming in 1.6 and beyond. Also ➕ to the idea of creating a folder of graphs to refer back to for future work. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants