-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2414] Output Graph Summary Information on When dbt Compiles #7357
Comments
Smart idea @peterallenwebb ! 🧠 What would you see as the mechanism for these to be shared with the Core team on a routine basis? I'm wondering if maybe just a single anonymized and compact What would you see as the pros/cons? |
@dbeatty10 I haven't got any a specific mechanism for sharing in mind, and I'm open to ideas. One possibility is that the Cloud Artifacts team could direct these outputs to an S3 bucket where we could take a look. If the Core thinks the solution I've sketched here is a good idea, I will ask for some further feedback from Artifacts. Of course, for non-Cloud dbt users, it's also a less invasive than asking for a full manifest to ask them for this graph summary. I'll give some thought to trying to combine the graphs into a single one. It seems to me that the downsides would be:
My original hope was that the information would be compact enough to include in the metrics we are collecting with snowplow, but I after giving it some thought I do not think that will be workable. I don't know what our snowplow message size limit is, but 40KB was a limit I saw cited, and that does seem like a reasonable limit for our use case. For clients with large graphs, which are the type we are most interested in, it would be tricky to stay under 40KB. |
I wonder if there's some kind of compact summary that we could include snowplow messages that is guaranteed to fit under size X, even for arbitrarily large graphs. e.g., maybe we could choose some representative summary statistics from which we could generate "similar" graphs. Of course this wouldn't give us an isomorphic representation like Spitballing some ideas of summary stats:
* Where "average" is actually location, scale, and shape parameters of a skew normal distribution. I tried googling a bit for "summary statistics for directed acyclic graph"; most of the hits were fascinating, yet not relevant. Here's one that might be useful for brainstorming purposes: |
I agree that capturing some rudimentary stats is a good idea, with a lot of potentially interesting applications for the results. I'll read up a bit on the topic and try to add a statistical summary to the snowplow tracking. |
Very nice @peterallenwebb 👍 This really smooths out the process of asking users for their DAG to help our testing purposes. I could imagine wanting to generate some graphs to help us prototype some of the cross-project ref features coming in 1.6 and beyond. Also ➕ to the idea of creating a folder of graphs to refer back to for future work. Thanks! |
The
manifest.json
andgraph.gpickle
files would often be useful for investigating performance issues in Core's graph algorithms, but there are a few of issues:I propose that we add an additional output file called
graph_summary.json
, which would be more compact and anonymized. This file would contain only the edge structure, and the name and type of each node. It will include that information at two separate points in time: Immediately after the graph is linked together, and then after test edges have been added.My hope is that this file could be shared with the Core team on a routine basis, and would allow us to build and maintain a library of realistic graph structures from production environments, helping us improve and test our graph algorithms.
Immediate applications include:
The text was updated successfully, but these errors were encountered: