-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add namespacing for dbt resources #1269
Comments
Hi @drewbanin, is this different / related to the comment here "be sure to namespace your models" ? https://docs.getdbt.com/docs/configuring-models We're looking for a clean way to separate sets of pipelines, e.g. all pipelines related to a consumer ML service, and pipelines related to analytics. This will be particularly useful as the size of our pipelines grow, the ability to run dbt run/test on a specific namespace will ensure the processing completes faster, as non-related models aren't built. |
hey @danimcgoo - yeah - this would be slightly different than the namespacing described in the link you shared above. The type of namespacing described here would make it possible to refer to models with more specific names. Imagine you wanted to have two models with the same name, but materialized in different schemas. Namespacing in dbt would let you do:
And then later: select * from {{ ref('shopify_uk.sales') }
union all
select * from {{ ref('shopify_fr.sales') } Here, there would be two models named I would say that for the moment, dbt's approach to model selection might be sufficient for your case. Have you checked out tagging yet? You can use Hope some of this is helpful -- lmk if there's anything else I can clarify |
Hey @drewbanin, tagging looks like it'll do the trick. thanks fo the tip! |
Hi @drewbanin Is this "Help Wanted" still current? I'm interested in supporting this feature in our internal projects because we have 100s of sql files that have a hierarchy of subdirectories under the "models" directory to help keep things semantically organized. Also, we have several SQL files with the same name but that exist in different directories (as illustrated above with the sales.sql example). Offhand, I think that this feature needs to be enabled with a flag somewhere to avoid getting an error during the compile phase when two SQL files have the same name (i.e., the "dbt found two resources with the name" CompilationException. One place to add this flag would be in dbt_project.yml at the top-level. Maybe like this: dbt_project.yml: use_reference_namespace = true The next trick is to thread this parameter through to the places where the unique identifiers are generated. As one example, in parser/base_sql.py select * from {{ ref('shopify_uk/sales') }}
union all
select * from {{ ref('shopify_fr/sales') }} So, I think one solution would approximately be to update the Thoughts? Thanks! |
(Actually, in looking closer at the |
Hey @heisencoder - before we begin any work here, I'd like to better develop our thinking around how namespacing will work in general. In your example ( dbt already has a notion of "fully qualified names" -- this looks like If namespacing is turned on, should it be allowed to ref a model without namespacing it? Or are namespaces solely intended to make it possible to handle models with shared names? Presently, dbt will assert that no two models share the same name. I think we'll need to defer that validation until all of the I'm definitely more in favor of using dots ( What do you think about all of this? |
Thanks @drewbanin for your thoughts! I haven't looked into the "fully qualified names" that you've mentioned for the Here's a proposed approach that hopefully fits in the framework you've outlined:
We could do these changes without adding a namespacing mode, although adding a namespacing mode could help with forward compatibility issues with the manifest.json format. As a proof-of-concept, here are examples of the changes I imagined to def find_in_list_by_name(haystack, target_name, target_package, nodetype):
"""Find an entry in the given list by name.
target_name could either be the base of the model name or could include the
package prefix separated with '.' characters.
"""
base_name_matches = []
for model in haystack:
name = model.get('unique_id')
if id_matches(name, target_name, target_package, nodetype, model):
return model
if id_matches(name, target_name.split('.')[-1], target_package, nodetype, model):
base_name_matches.append(model)
if len(base_name_matches >= 2):
raise_compiler_error(
'Ambiguous reference %s has multiple matches: %s' % target_name, base_name_matches)
if len(base_name_matches == 1):
return base_name_matches[0]
return None (And then also update Note that as a side-effect of including the full package prefix in the target_name, the compile tests in |
Hey @heisencoder - I think everything you're saying here is totally reasonable. The big challenge I can imagine is that we've always held as a hard constraint that resource names are unique. As such, there may be pernicious failures in parts of the codebase which rely on these model names being unique. We'll definitely have to figure out how to adjust the I'm supportive of this idea, but we need to figure out:
This is probably going to be a pretty big project! |
Given the large number of edge cases, I think it makes sense to flag control this new code so that old code doesn't have to worry about these edge cases. So I'm thinking that this could be controlled by a The unique_ids for the nodes would then become the FQN path instead of just the base filename. This should make them be unique in all contexts. I can provide some suggested documentation text. I hadn't thought about the
|
Why not just copy what other programming environments do (python) and make the folder the package name? As a newbie, I expected it already worked list this. I think this should be the default with an option to disable it that allows legacy users to migrate. |
Also, I will add that the source selector works about like I'd want from a usage point of view. That may provide a better way to migrate as well. Any time the old ref() is called, the uniqueness for that model name is ensured, and generates and exception if there are duplicates. A new ref that uses packages (or maybe just a new arg!) would limit this check to the namespace that was specified. |
I love the |
hey @rpedela-recurly - we definitely don't want to namespace things by schema name! Destination schemas are configuration in dbt, so the actual schema that a model renders into should be totally different in dev vs. prod. You can actually currently supply two arguments to the
In practice, this isn't super useful though -- model names must currently be unique across all packages included in a project. Increasingly, my thinking is that we should:
If two models share the same name, then you would need to qualify the model name with a package name in the Do you buy that? |
I didn't realize you could define
Then namespace is defined in |
@drewbanin I like the two notions of making packages within a project easy, and dropping the name uniqueness across packages. This is similar to ideas that Betterment has been tossing around a lot lately. We have models across a small but slowly growing number of internal domains that all belong in the same project, but have different owners and use cases depending on the domain. We want to relax the requirement for unique model names across these domains (especially for ephemeral models). We were thinking of suggesting custom schema names (not the actual materialized schema names, which may differ from run to run) as a namespace, but I think if we could easily create intra-project packages, that could suffice. The requirements that we've been thinking about are as follows (let's say for these requirements that
I know everyone on our team would also be thrilled if ephemeral models within different folders of the same package could be similarly named without creating conflicts as well, but the points above are more important. |
Also @drewbanin, when you say "make it easier to define multiple packages inside of a single project", what do you imagine that looks like? Something along the lines of the following structure, where DBT perhaps infers the package names of sub-projects, and maybe inherits project config from the "parent"?:
# project/packages.yml
packages:
- local: domain1
- local: domain2 # project/dbt_project.yml
name: 'company_project'
version: '1.0'
models:
domain1:
schema: d1
domain2:
schema: d2 Maybe even allowing the |
I am very interested in being able to namespace objects like models and sources and macros—at least in a rudimentary fashion by removing the requirement that model names be unique even between packages—and I'd be happy to try to put together a PR if someone more familiar with the project can give me a little guidance on where to get started. |
hey @mjumbewu - thanks for writing all of this out! I really like your suggestions on how we should approach model-level namespacing: it's really sensible and well-specified. Re: making namespacing easier: I like the general idea of your suggestion:
The only issue with this approach is that you need to run Instead, I think we could do one of the following: 1. Smarter local packages Pros:
Cons:
2. A new dbt_project.yml config
Alternatively, we could get clever with
But I don't think I love that approach. Pros:
Cons:
3. config-level namespaces This was the one I had in mind when I wrote about making namespacing easier, but increasingly, I don't think this approach is such a good idea. I was picturing a
All of the models in Pros:
These are the things bouncing around in my head. I think approach 2 outlined above is going to be our best bet, but I'm curious what you all think too. |
Hi there, just wanted to continue that discussion I'm a beginner on DBT, so I would just like to give my opinion from a "beginner" point of view. SeedsFirst I would like to point out that this would be useful for Seeds as well, consider the following simple use-case: 2 pipelines are being worked on, they target 2 different databases, data/a/calendar.csv
data/b/calendar.csv The 2 pipelines target 2 different Databases, so their names do not clash when they run against the Warehouse. But Now, as a beginner, my first intuition would be to use Sub projectsI think the "sub project" abstraction is quite hard to wrap my head around:
Proposal(I'm aware it's a summary/mix of some other solutions above)
Pros:
Cons:
|
@drewbanin Where does this go from here? Is this the 'DBT extension proposal'? Do you need to make a BDFL pronouncement? Was shocked to discover that this is an outstanding issue still. I've always used multiple packages with qualified |
Hi Drew, I was horrified when I just noticed that I could not just namespace the same model name by putting these in the different packages. When one references with Instead of discussing more advanced ways (as in bigger changes) of namespacing as above, would it be possible to at least make this approach work, it seems to do the trick (for me at least ) ? I would think this makes for less surprises when using the dbt-framework. Is there anything in the way that dbt is structured internally that makes a change like this painful? I haven't looked into the source code in dbt yet, but I might be able to contribute with a PR if you find that this would be a suitable way forward. And sorry for whining :) dbt is a great tool, thanks for your contributions! |
oh geez... I missed this one and am disappointed that I have abdicated my BDFL responsibilities. Thanks for the bump @hhagblom and for getting this back on my radar! Yep.... it is highly weird that dbt supports a two-arg version of
@jtcohen6 curious to get your take on this one too. Will poke around here and report back with findings :) |
ok @hhagblom & co - let's not make a habit out of this, but I did open my first PR to dbt in a couple of months over here https://github.com/fishtown-analytics/dbt/pull/3053/files I'm going to need to spend more time here with testing, but if any of y'all want to take this for a spin locally, feedback is very welcomed :D |
@drewbanin Many thanks! What I'll do is that I'll fork the repo and will do a rebase-onto a stable release and take this for a spin in our project. I'll let you know how it works out for us! |
@Shadowsong27 The work in #3053 got 90% of the way to supporting this, but it ran into an obstacle around defining model properties in |
Sorry for the +1 🙈 but we just ran into this as well. We're moving through different layers, raw -> clean -> common, where the models are refined in each layer. Currently we have to name the models differently coming from each layer to prevent the name collision. |
Is there any news on the status of this issue? This and unit testing are the two big ones for me to make dbt feel like the complete real deal :) |
@louis-vines This is still very much top-of-mind for me. I don't have any immediate plans for development that I can share, but I am thinking that this will be an important ingredient for supporting dbt deployments at larger orgs. Probably part of a larger capability around cross-project lineage, which could exist without this capability, but would be much much more desirable with it. |
For folks following this issue, and interested in namespacing capabilities as a crucial tool for scaling out dbt deployments: I'd be very curious to hear your thoughts on the proposal outlined in #5244 |
How does anyone use dbt at scale without this? My team was hoping to switch over to dbt but without some form of namespacing I don't think we can afford the friction in adjusting our database and workflow! |
By prefixing every models? |
Alternatives:
|
Oh, I rather like the second option. Thank you for the suggestion, @hernanparra. |
I'm not sure I understand point 2. I set up a sample project like
where
This still fails because I have two models with the same name even though I wanted to alias one of the models to a different identifier.
|
@stevenlw-porpoise I think the suggestion is to use alias in the opposite direction, e.g. call your file This is a bit of a hassle for us migrating some dataform pipelines over (one of the few areas that dataform is better), but we do already break things into a bunch of different pipelines (which has its own problems) which mitigates the problem somewhat. |
Is there a fix being implemented or are we left to prefix every model? If there is a fix being implemented, is there an ETA? Thanks! |
@elyobo Tried that but DBT doesn't let me create it. Model name has to be unique both in custom alias and file name for same schema and database. |
This is one of the most significant dbt inconveniences. It would be nice to see this issue progress this year, it's been three years already. |
There is some promising looking work as part of the broader cross project work - #5244 |
Maybe I am late to the party, but just like there are macros for generating a schema and a database name, why isn't there a macro to generate the model name? I don't know how complicated it would be to add this feature, but it sure seems like overall, it would be the simplest approach. The previous behavior would be kept as default, not breaking anything and people that wanted to, can customize it to their will. Currently, my team has a structure similar to what others have shown:
And we would like to have:
But have a way to make the second approach be the same from dbt's "point of view". We thought about something like (very rough sketch):
|
There is one, Annoying but I'm sort of used to it now, the unique model makes are easier to find in editors and fits nicely with the "how we structure our projects" guide on the dbt site, and if I really want my db to reflect different names then it's doable. I'm keen on seeing this fixed properly but hasn't ended up being a blocker on us jumping over from |
Placeholder issue for thoughts related to namespacing. When we're able to prioritize work on namespacing, let's convert this to a more actionable issue:
Questions:
Other similar questions / use cases welcomed :)
The text was updated successfully, but these errors were encountered: