Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Include column-level information in dbt's adapter cache, for faster get_columns_in_relation #9595

Open
3 tasks done
OneCyrus opened this issue Feb 17, 2024 · 4 comments
Labels
enhancement New feature or request performance

Comments

@OneCyrus
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I have some macros which rely heavily on get_columns_in_relation. they are macros which operate on the column level and there are a lot of calls to the same relation to get information about columns. in the end there are thousands of calls to get_columns_in_relation while there are actually 10-100 different relations.
as the calls are quite expensive it would be great to have a shared memory cache for each macro which survives the call of the macro.

it should be something like a key/value-store which would allow to add the return value of get_columns_in_relation for a specific relation and in the next macro call we would check if the value is already there and only call get_columns_in_relation if we don't have the value already.

Describe alternatives you've considered

surely we can call get_columns_in_relation in other places and let the result flow through the codebase. but this would lead to a lot of complexity and is not a good tradeoff for the performance gain.

Who will this benefit?

everyone who loves a performant dbt

Are you interested in contributing this feature?

No response

Anything else?

No response

@OneCyrus OneCyrus added enhancement New feature or request triage labels Feb 17, 2024
@jtcohen6
Copy link
Contributor

everyone who loves a performant dbt

❤️

I would love to support caching at the column level. It's not just custom user macros that make covetous use of get_columns_in_relation, it's our own materializations as well (especially snapshots).

We already have a "relational" cache, stored on the adapter for the duration of an invocation, and it would be possible to extend this cache to also include information about column names & data types.

However: That alone is not enough. We also need to find & update all macros that change the columns in a relation (adding, removing, resizing, ...) — those macros either need to update the cache (better), or clear it (acceptable) — same as we do for materializations that create relations, or methods that drop them.

If we start caching column metadata before we've done that, we will end up with incorrect results, which is what happened when we experimented with it in dbt-spark a few years ago:

So - I think it's doable, I think it's interesting, I think it will first require a full accounting of all the places where we (potentially) mutate the columns of a model. Who's interested in a little bit of spelunking? :)

@jtcohen6 jtcohen6 removed the triage label Feb 20, 2024
@jtcohen6 jtcohen6 changed the title [Feature] memory cache in macros [Feature] Include column-level information in dbt's adapter cache, for faster get_columns_in_relation Feb 20, 2024
@OneCyrus
Copy link
Author

@jtcohen6 great to hear :)

just to understand the problem-space better: in which use-cases do we have schemas which change retroactively? i thought dbt materialization completely follows the directed acyclic graph and once a model is materialized the schema of this model is immutable for all downstream dependencies. basically we shouldn't face a changing schema in a dbt run.

or would that be if we request schemas of models which aren't materialized yet? (does that even work?)

@mike-weinberg
Copy link

@jtcohen6, I'm sort of interested in diving in, but I'm not clear on scope. does "all the places" include:

  1. adapters?
  2. first party dbt packages?
  3. third party dbt packages on package hub?

@jtcohen6
Copy link
Contributor

@OneCyrus This came up specifically in the dbt-spark example because we cached the columns of an incremental model at the start of a run / at the start of its materialization, then modified the columns in the model (alter statements for schema evolution as part of on_schema_change config), then introspected the columns again to template out the subsequent insert statement (cache lookup), and got the wrong values. The problem is that the alter statement didn't update or invalidate the cache.

@mike-weinberg In theory, all three; in practice, primarily adapters or packages that roll custom materializations, since that tends to be where dbt is modifying the structure of DWH objects (more than just introspecting the structure/content).

To be clear, this would be spelunking without guarantee of striking gold. I'm not convinced that our current caching patterns — populating a bunch of batch information upfront that we can get easily/cheaply (show-type statements), then using the cache to definitively answer simple positive/negative questions about which objects exist — will scale well to additional information that's less cheap/fast to batch up.

To simplify scope, I might start with:

  • just Snowflake
  • just with the goal of making get_columns_in_relation reliably return a cached value if called multiple times in sequence
  • with cache updates/invalidation when the adapter knows it's evolving the schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

4 participants