-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Include column-level information in dbt's adapter cache, for faster get_columns_in_relation #9595
Comments
❤️ I would love to support caching at the column level. It's not just custom user macros that make covetous use of We already have a "relational" cache, stored on the adapter for the duration of an invocation, and it would be possible to extend this cache to also include information about column names & data types. However: That alone is not enough. We also need to find & update all macros that change the columns in a relation (adding, removing, resizing, ...) — those macros either need to update the cache (better), or clear it (acceptable) — same as we do for materializations that create relations, or methods that drop them. If we start caching column metadata before we've done that, we will end up with incorrect results, which is what happened when we experimented with it in
So - I think it's doable, I think it's interesting, I think it will first require a full accounting of all the places where we (potentially) mutate the columns of a model. Who's interested in a little bit of spelunking? :) |
@jtcohen6 great to hear :) just to understand the problem-space better: in which use-cases do we have schemas which change retroactively? i thought dbt materialization completely follows the directed acyclic graph and once a model is materialized the schema of this model is immutable for all downstream dependencies. basically we shouldn't face a changing schema in a dbt run. or would that be if we request schemas of models which aren't materialized yet? (does that even work?) |
@jtcohen6, I'm sort of interested in diving in, but I'm not clear on scope. does "all the places" include:
|
@OneCyrus This came up specifically in the @mike-weinberg In theory, all three; in practice, primarily adapters or packages that roll custom materializations, since that tends to be where dbt is modifying the structure of DWH objects (more than just introspecting the structure/content). To be clear, this would be spelunking without guarantee of striking gold. I'm not convinced that our current caching patterns — populating a bunch of batch information upfront that we can get easily/cheaply ( To simplify scope, I might start with:
|
Is this your first time submitting a feature request?
Describe the feature
I have some macros which rely heavily on get_columns_in_relation. they are macros which operate on the column level and there are a lot of calls to the same relation to get information about columns. in the end there are thousands of calls to get_columns_in_relation while there are actually 10-100 different relations.
as the calls are quite expensive it would be great to have a shared memory cache for each macro which survives the call of the macro.
it should be something like a key/value-store which would allow to add the return value of get_columns_in_relation for a specific relation and in the next macro call we would check if the value is already there and only call get_columns_in_relation if we don't have the value already.
Describe alternatives you've considered
surely we can call get_columns_in_relation in other places and let the result flow through the codebase. but this would lead to a lot of complexity and is not a good tradeoff for the performance gain.
Who will this benefit?
everyone who loves a performant dbt
Are you interested in contributing this feature?
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: