-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize BaseRelation.matches() #6844
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
@boxysean If you have a chance to kick the tires on this change, please do and let me know how it works for you. If it looks good to you, I'll discuss it with other stakeholders. |
@peterallenwebb This is awesome!! Perf bottlenecks in dbtFirst, I want to offer some higher-level context for this change:
This PR: proposed trade-off
I could be open to making this change. The A little history:
A concrete exampleHere's a simple case for reproducing when this exception would be helpful. On Snowflake, the default case is uppercase (both ANSI-compliant and annoying), and quoted identifiers are case-sensitive. This was the bane of my existence in 2018; since then, we've disable quoting for relation identifiers by default, and it's much more pleasant. So if I create a model like: -- models/my_model.sql
select 1 as id I create or replace view analytics.dbt_jcohen.my_model as (select 1 as id); In Snowflake, this relation has a much more boisterous name: # dbt_project.yml
quoting:
identifier: true Now, dbt is going to try to template a SQL statement like: create or replace view analytics.dbt_jcohen."my_model" as (select 1 as id); But we don't even get there, because first dbt populates the adapter cache, then it tries to match up my model with an entry in the cache, and it sees there's there's an almost but not quite matching entry. And we stop the whole thing in its tracks, because we want to avoid an ugly scenario.
If I try the same, having checked out your branch, I don't get that exception—the model succeeds!—because we didn't get a match, and we didn't check for an approximate match either. dbt successfully created a view named select * from analytics.dbt_jcohen.my_model;
select * from analytics.dbt_jcohen."my_model"; It's a gross situation, no question. But, in keeping with what I said above, I'm not convinced that the |
Thanks for the explanation @jtcohen6! I'd be curious to see some real-world results, but I won't have time in the next 1-2 weeks to review due to travel. A similar analysis to what I did here would help us determine the impact of @peterallenwebb's proposed change. I will ask my client to see if they could support. I'd also be curious to see some unit tests on |
@jtcohen6 Yes, thanks very much for this clear explanation of where the practical performance concerns really are! I'll keep it in mind as I inevitably continue to tinker with performance. I definitely don't feel strongly about getting this change in if it is unlikely to make an impact under real world conditions. |
@peterallenwebb If this proves to be a significant perf boost in an "in-the-wild" scenario, I'd be supportive of moving forward! Sounds like the next step here is testing with a real large project. We do have one of these for our own internal analytics :) |
target = self.create(database=database, schema=schema, identifier=identifier) | ||
raise ApproximateMatchError(target, self) | ||
if database is None and schema is None and identifier is None: | ||
raise dbt.exceptions.DbtRuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this first since you don't need to run self._is_exactish_match()
. I'm assuming that method is the expensive method. I'd also rephrase the if clause:
if not any(database, schema, identifier):
raise dbt.exceptions.DbtRuntimeError(...)
"Tried to match relation, but no search path was passed!" | ||
) | ||
if identifier is not None and not self._is_exactish_match( | ||
ComponentName.Identifier, identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way it was originally written. I think the performance pickup comes from two places:
- not looking for the approximate match
- not exiting the for loop once a match was found
I think it could look something like this:
if not any(identifier, schema, database):
raise dbt.exceptions.DbtRuntimeError(...)
search = filter_null_values(
{
ComponentName.Identifier: identifier,
ComponentName.Schema: schema,
ComponentName.Database: database
}
)
return any(
(
self._is_exactish_match(existing_components, new_component)
for existing_components, new_component in search.items()
)
)
I'm pretty sure any()
will lazily evaluate each element in the generator and then stop when it finds one, which is what you're trying to do.
@jtcohen6 @boxysean I'm probably going to close this particular PR for now, since there are risks to any effective improvement, and we have not demonstrated that it is needed in the field. That said, I have some interesting parting observations... Our current strategy for looking up relations in the cache is compute-intensive, since it has to account for all the ways a relation name might be quoted or cased. Our strategy also takes time linear in the number of tables in the cache. The entire list of tables in the cache is scanned for a match every time a relation is looked up. A lookup in a cache with 1000 entries will be ten times slower than one with 100 on average. With some development effort we could make this a constant-time lookup with much lower overhead. It's not clear how important this bottleneck is in real-world production scenarios, but the anonymized client project which @boxysean provided me has the following runtimes for
As @jtcohen6 has pointed out to me, our multithreading model might blunt the impact of the compute savings in real-world scenarios. It's still interesting how much overhead is being spent on this operation, though. |
Agree that it's very interesting. If you think the right next step is to close this specific PR for now, given some unknowns in the risks & benefits, I won't argue. I do think we should keep #6842 open as a promising lead to revisit for perf improvements in the future. |
This PR has been marked as Stale because it has been open with no activity as of late. If you would like the PR to remain open, please comment on the PR or else it will be closed in 7 days. |
Although we are closing this PR as stale, it can still be reopened to continue development. Just add a comment to notify the maintainers. |
resolves #6842
Description
This PR optimizes the BaseRelation.matches() function in order to avoid costly string processing and comparison operations which are being done a very large number of times during certain large project runs of
dbt build
. On the large scenario described in #6842, it took my local run time fordbt build
from 23m to 14m.As written, we would lose the ApproximateMatchError exception, since determining whether a relation is a an approximate match was a large part of the time spent. We'll need to determine whether that is justified by the savings, or if there is a better way to avoid performing the check a large number of times.
At any rate, there is a lot of room for improvement in this bottleneck.
Checklist
changie new
to create a changelog entry