-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guidance/features for reducing disk space and bandwidth usage #2920
Comments
We can offer users to use TF features such as provider plugin caching, but there is a drawback because terraform does not guarantee safe concurrency: Terraform issue #31964, and Terragrunt issue #1875, thus we cannot perform initialization in parallel for example 50
The first thing that comes to mind is to create a common cache for all Terragrunt modules, but here we are faced with several issues:
It will be easy. I checked, a shallow clone is about 1/3-1/2 times smaller.
As far as I know, Terraform does not have the module caching feature. But we can implement it in the same way as with providers, that is, download modules into a common cache directory, and then create symbolic links. I checked, terraform works well with module dirs that refer to another dirs. To summarize, I would suggest implementing something like
I don't know how to solve this issue. Any suggestions? |
I just took a quick look at the Terraform code, and unfortunately the code we need is located in the Of course, the obvious disadvantage is that if they suddenly radically change the module and provider loading, we will also need to update our code. But given that this can happen, we can delivery a new caching feature to users not as a default option, but as the deliberately choice. In other words, they should explicitly run |
I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading providers. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change. Here's a bit of a zany idea that leverages their public API: in the
We'd probably make this feature opt-in, at least initially. Once you turn it on, you get provider caching automatically, in a way that should be concurrency safe.
Yea, both of these are valid issues. Any ideas on solutions? Do the suggestions in #2923, especially a content-addressable store similar to pnpm with symlinks offer a potential solution?
It's easy for most things, but as I found out in #2893, one issue we hit with shallow clones is that the
I'm a bit worried about duplicating much of Terraform's own logic for discovering and downloading modules. Most of that is internal logic and not part of a public API they with compatibility guarantees, which may make it tough to keep up to date as Terraform and OpenTofu change. Are there any hooks for downloading modules? For example, how does Terraform work in an air gapped environment?
This would mostly be about documenting how to persist data, such as a provider cache, in a K8S cluster: e.g., with persistent volumes. |
Agree.
Great idea. Yes, there are, one of them https://github.com/terralist/terralist supports Module and Provider registries.
In the case of npm, this is justified, since npm itself understands where to get which files from, in our case, terraform needs to be provided a regular file structure, and for this we would have to create hundreds or thousands of symlinks, and the creation of such a database itself is not trivial task, considering that the modules themselves are not so big as plugins, to spend so much time and resources on it, not sure.
Ah, will keep this in mind, thanks.
We can run a private registry locally, but then we need to change the module links to point to this private register. Perhaps we could do this automatically, after cloning repos into the cache directory, but this does not solve the disk usage issue, although compared to plugins, this may not be so critical.
Ah, understood. |
If we also enable the plugin cache (which TG could enable via env var automatically when executing
Alright, keep thinking about it in the background to see if you can come up with something. One thing I stumbled across recently that may be of use: hashicorp/terraform#28309
Fair enough.
I suspect disk space isn't as big of a concern with modules, as those are mostly text (whereas providers are binaries in the tens of MBs). The time spent re-downloading (re-cloning) things is probably the bigger concern there. |
I'm not sure how useful this is as far as addressing the issue from within terragrunt, but figured I'd share in the sense that there are approaches users could take to address the issues within their own pipelines/workflows... It is complicated though, and also maybe abuses some implementation details of terraform. Definitely welcome the conversation and would appreciate any features within terragrunt that address these issues more directly! One thing I started doing is maintaining a single terraform config of all the providers and modules that are in use across the whole project. I call it the
Then in the "real" terragrunt and terraform configs, the module
Before running any terragrunt commands, we run (The provider versions in that |
I took a quick look at the terraform code https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go Briefly step by step how terraform pluging cache works: // Step 1: Which providers might we need to fetch a new version of?
// This produces the subset of requirements we need to ask the provider
// source about. If we're in the normal (non-upgrade) mode then we'll
// just ask the source to confirm the continued existence of what
// was locked, or otherwise we'll find the newest version matching the
// configured version constraint.
// Step 2: Query the provider source for each of the providers we selected
// in the first step and select the latest available version that is
// in the set of acceptable versions.
//
// This produces a set of packages to install to our cache in the next step.
// Step 3: For each provider version we've decided we need to install,
// install its package into our target cache (possibly via the global cache). So the idea with the lockers might work, since it first queries which versions exist in the registry, and then checks which exists in the cache. But there may be issues keeping the connection, Terraform processes must wait for the private registry, until a plugin is downloaded, so a timeout may occur, in the case when the user’s Internet speed is low and the plugin is large. @lorengordon suggested the interesting idea. Thanks @lorengordon! On the one hand, we don’t need any private registers, which eliminates a huge number of issues that we don’t yet know about, but on the other hand, we will have to implement such logic that will generate on fly such a config and replace the By the way, I don't know if the symlink approach is workable on Windows OS at all. Should I check it or does someone already know the answer? :)
Of course I can, but can you please confirm that this request is still relevant?
Sure, will do.
Ah, very interested, I don’t know yet whether this will be useful to us, but will keep it mind. Thanks.
Agree. What will be the decision?
|
While we see supporting a single version of a module as a bonus, if I had to support multiple versions, I would do it by changing the module label. That label is what maps to the path in the
and then referencing paths like so:
|
One place I know of that my approach does fall over for modules though, is nested modules. If a |
Ah, right, this might work :) We need to weigh whether it’s worth parsing all the configs to change the name of the modules and its source, or accepting duplication of modules as a compromise.
Oh really, this idea won't work with nested terraform modules, since each terraform module creates its own |
I also support air-gapped environments. We only use modules that use For providers, we host an accessible provider mirror and use the |
Modules are tricky overall anyway, since the terraform network mirror only supports providers, not modules. You'd have to use something like the |
@lorengordon, yeah, you are right! I think we shouldn’t bother so much with modules, since they usually take up several megabytes. I wouldn't touch them. @brikis98, suggestion on how to resolve the issue of duplication of providers, based on @lorengordon suggestion.
This way, one terraform process download all providers at the same time, eliminating concurrency issue and in the end we have a cache with all the necessary providers. |
One sticking point with that step, even for providers, is any config that uses a module with a remote source. Remote modules may have provider requirements also. "Parsing all tf configs" to figure out all the providers in use and their version constraints, necessarily involves retrieving all remote modules. And so we're now reinventing a lot of the plumbing around There may be an optimization available though, if the |
Ok, but if we also include the modules in the single config, it will also download providers of these modules, right? After that we can just remove these modules as garbage. |
Did you actually test this out and see a timeout issue? Or are you just guessing that it might be an issue?
AFAIK, symlinks work more or less as you'd expect on Win 10/11.
I'll create a separate comment shortly to summarize the options on the table and address this there.
For now, let's gather all ideas, and then decide which ones to test out, and in which order. Reducing provider downloads is definitely a higher priority than the module stuff, so that should be the first thing to focus on. |
Thanks for sharing this approach! Definitely a cool idea. As pointed out in subsequent comments, this approach doesn't quite seem to handle nested modules... And we have a lot of those. So it feels promising, but not quite complete. |
So far, only guesses. |
I could be wrong, but doesn't |
OK, let me summarize the ideas on the table so far: Problem 1: ProvidersReducing bandwidth and disk space usage with providers is the highest priority and should be the thing we focus on first. Idea 1: network mirror running on localhostAs described here:
There may be an issue here with timeouts related to step (4), so we'll have to test and see if this is workable.
|
Yes, it does. The "single config" option, using what I called the It also retrieves all modules, including nested ones. The problem with nested modules are those specifically with remote sources. If the source is local within the nested module, no problem, the local relative path is fine. But a remote source will re-download the remote module when However, one thing that just occurred to me to address that, would be to pre-populate the |
Regarding pre-fetching provider binaries, what if we just have an opt-in configuration (like an environment variable named If OpenTofu accepts the RFC, we can switch to an opt-out configuration, and rely on the public package to handle the logic for pre-fetching provider binaries in OpenTofu and use our naive custom logic for Terraform. Would this handle your concerns regarding the potential changing logic in the @brikis98 For directories without |
@yhakbar, The concern is not only that the code we are interested in is located the |
@brikis98,
About connection timeout concern, Terraform terminates connections if the registry does not respond after 10-15 seconds, and exits with the error A workaround could be to (don't use private registry at all) run 🤷♂️ Honestly, I don’t see any other way than to implement the functionality of fetching providers and modules by Terragrunt itself to ensure maximum performance and predictable behavior. |
@levkohimins I think that only resolved the provider thing. There are many other tasks in this bug, so going to reopen. |
Joining to the party because we are behind a solution for the explained problem. Today I have tested latest Terragrunt ( Have you considered offering the cache server as a standalone service that I can spin up on instance boot and share among all processes? Thank you for working on this! |
@amontalban, That's true. Each Terragrunt instance runs its own cache server. We use file locking to prevent conflicts when multiple Terragrunt instances try to cache the same provider. What do you mean by
Thinking out loud, for the standalone server, we will need connections (like gRPC) between the Terragrunt instances and the Terragrunt Cache Server itself to receive notifications from the cache server when the cache is ready. @brikis98, Interesting what you think about this. |
Hi @levkohimins! Some of the plans work and some don't on the same Atlantis PR, and I think it is because all threads (We have parallel configuration in Atlantis running up to 10 at the same time) are trying to lock/download providers at the same time. For example a working one:
A non working one:
Another error:
And we have the following settings:
Let me know if you want me to open an issue for this. Thanks! |
Hi @amontalban, thanks for the detailed explanation.
Please create a new issue and indicate there the terraform version, your CLI Configuration, and also check if you are using any credentials. Thanks. |
Thanks I will open an issue then. Regarding Terragrunt Provider Cache being concurrency safe I understand it is if it used in a single Thanks! |
Thanks!
By safe concurrency I meant multiple Terragrunt processes running at the same time. |
@levkohimins is it possible to mount a volume and share cache between multiple Kubernetes pods? |
You can specify the different cache directory |
does it mean if i do that, i will have problem ? each job should have their own cache? |
The Terragrunt Provider Cache is concurrency safe, so you can run multiple Terragrunt processes with one shared cache directory. The only requirement is that the file system must support File locking. |
if anyone like me looking to use this with aws EFS, it should work since EFS supports flock |
hi @brikis98 @levkohimins while downloading the terraform source URLs, https:// is getting replaced by file:/// and the workflow is failing, failing to download the module zips. It was working fine till TG 0.55.13.. Because of this issue, we are not able to use any of the recently delivered features.. Can you please look into this as a priority |
Hey there! I have a question regarding how to handle multi platform with lock files in order to reduce disk & bandwidth usage? It seems to me that all the caching functionality only works for your own platform. |
Hi @RaagithaGummadi, this issue is not related to this subject, if the issue still exists please let me know there #3141 |
Hi @tomaaron, Could you please describe in detail how you create lock files for multiple platforms in your workflow when you do not use Terragrunt Provider Cache feature? |
That's actually what I'm trying to figure out. So far I have unsuccessfully tried the following:
But this seems to download the providers over and over again. |
Yeah it won't work. I'll look into what we can do to make this work through the Terragrunt Provider Cache. |
The provider cache logic is working great for us in I wanted to quickly touch on an observation regarding modules. I understand the complexity regarding modules sourced from inside the tf code, but what about modules sourced in the Let's say you are deploying many modules from the same repository (i.e., a central module repository your organization users to manage all it's IaC). This is the current folder structure generated by terragrunt after a
Notice the Since sourcing many modules from the same repo seems like the standard way to use terragrunt, this would be a big disk and network savings for anyone with more than a few modules. Note that we tried to do this on our own by having |
Hi @fullykubed, thanks for the feedback! Your observations are quite convincing and perhaps we could also optimize the modules. I will think it over. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for raising this issue. |
@levkohimins I don't think this should be marked stale. |
I removed the |
The problem
Many of our customers struggle with TG eating up a ton of disk space and bandwidth: hundreds of gigabytes in some cases! I think this comes from a few sources (note: #2919 may help provide the data we need understand this better):
init
. This isn't a problem for a single module, but if you dorun-all
in a repo that has, say, 50terragrunt.hcl
files, each one runsinit
on a TF module, each module downloads an average of, say, 10 providers, then that's 50 * 10 = 500 provider downloads—even if it's the exact same 10 providers across all 50 modules!source
URL in TG, it downloads the whole repo into a.terragrunt-cache
folder. If you have 50terragrunt.hcl
files withsource
URLs, and dorun-all
, it will download the repos 50 times—even if all 50 repos are the same!source
URL. We should consider doing a shallow clone, as that would be much faster/smaller.init
, Terraform downloads repos to a.modules
folder. If you have 50terragrunt.hcl
files, each of which has asource
URL pointing to TF code that contains, say, 10 modules, then when you dorun-all
, you'll end up downloading 50 * 10 = 500 repos—even if all the modules are exactly the same!Goals
We should have the following goals:
git clone
should be as efficient as possible: e.g., use shallow clones.The solution
The solution will need to be some mix of:
provider_installation
block to achieve the goals in the previous section.depth=1
param to tell go-getter to do a shallow clone.Notes
The text was updated successfully, but these errors were encountered: