Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial/Progressive Configuration Changes #4149

Closed
apparentlymart opened this issue Dec 2, 2015 · 68 comments
Closed

Partial/Progressive Configuration Changes #4149

apparentlymart opened this issue Dec 2, 2015 · 68 comments
Labels
core enhancement proposal providers/protocol Potentially affecting the Providers Protocol and SDKs thinking

Comments

@apparentlymart
Copy link
Contributor

apparentlymart commented Dec 2, 2015

For a while now I've been wringing my hands over the issue of using computed resource properties in parts of the Terraform config that are needed during the refresh and apply phases, where the values are likely to not be known yet.

The two primary situations that I and others have run into are:

After a number of false-starts trying to find a way to make this work better in Terraform, I believe I've found a design that builds on concepts already present in Terraform, and that makes only small changes to the Terraform workflow. I arrived at this solution by "paving the cowpaths" after watching my coworkers and I work around the issue in various ways.


The crux of the proposal is to alter Terraform's workflow to support the idea of partial application, allowing Terraform to apply a complicated configuration over several passes and converging on the desired configuration. So from the user's perspective, it would look something like this:

$ terraform plan -out=tfplan
... (yada yada yada) ...

Terraform is not able to apply this configuration in a single step. The plan above
will partially apply the configuration, after which you should run "terraform plan"
again to plan the next set of changes to converge on the given configuration.

$ terraform apply tfplan
... (yada yada yada) ...

Terraform has only partially-applied the given configuration. To converge on
the final result, run "terraform plan" again to plan the next set of changes.

$ terraform plan -out=tfplan
... (yada yada yada) ...

$ terraform apply
... (yada yada yada) ...

Success! ....

For a particularly-complicated configuration there may be three or more apply/plan cycles, but eventually the configuration should converge.

terraform apply would also exit with a predictable exit status in the "partial success" case, so that Atlas can implement a smooth workflow where e.g. it could immediately plan the next step and repeat the sign-off/apply process as many times as necessary.

This workflow is intended to embrace the existing workaround of using the -target argument to force Terraform to apply only a subset of the config, but improve it by having Terraform itself detect the situation. Terraform can then calculate itself which resources to target to plan for the maximal subset of the graph that can be applied in a single action, rather than requiring the operator to figure this out.

By teaching Terraform to identify the problem and propose a solution itself, Terraform can guide new users through the application of trickier configurations, rather than requiring users to either have deep understanding of the configurations they are applying (so that they can target the appropriate resources to resolve the chicken-and-egg situation), or requiring infrastructures to be accompanied with elaborate documentation describing which resources to target in which order.

Implementation Details

The proposed implementation builds on the existing concept of "computed" values within interpolations, and introduces the new idea of a graph nodes being "deferred" during the plan phase.

Deferred Providers and Resources

A graph node is flagged as deferred if any value it needs for refresh or plan is flagged as "computed" after interpolation. For example:

  • A provider is deferred if any of its configuration block arguments are computed.
  • A resource is deferred if its count value is computed.

Most importantly though, a graph node is always deferred if any of its dependencies are deferred. "Deferred-ness" propagates transitively so that, for example, any resource that belongs to a deferred provider is itself deferred.

After the graph walk for planning, the set of all deferred nodes is included in the plan. A partial plan is therefore signaled by the deferred node set being non-empty.

Partial Application

When terraform apply is given a partial plan, it applies all of the diffs that are included in the plan and then prints a message to inform the user that it was partial before exiting with a non-successful status.

Aside from the different rendering in the UI, applying a partial plan proceeds and terminates just like an error occured on one of the resource operations: the state is updated to reflect what was applied, and then Terraform exits with a nonzero status.

Progressive Runs

No additional state is required to keep track of partial application between runs. Since the state is already resource-oriented, a subsequent refresh will apply to the subset of resources that have already been created, and then plan will find that several "new" resources are present in the configuration, which can be planned as normal. The new resources created by the partial application will cause the set of deferred nodes to shrink -- possibly to empty -- on the follow-up run.


Building on this Idea

The write-up above considers the specific use-cases of computed provider configurations and computed "count". In addition to these, this new concept enables or interacts with some other ideas:

  • Proposal: Generator Plugins for Configuration Generation #3310 proposed one design for supporting "iteration" -- or, more accurately, "fan out" -- to generate a set of resource instances based on data obtained elsewhere. This proposal enables a simpler model where foreach could iterate over arbitrary resource globs or collections within resource attributes, without introducing a new "generator" concept, by deferring the planning of the multiple resource instances until the collection has been computed.
  • Pre-refreshed resources #2976 proposed the idea of allowing certain resources to be refreshed immediately, before they've been created, to allow them to exist during the initial plan. Partial planning reduces the need for this, but supporting pre-refreshed resources would still be valuable to skip an iteration just to, for example, look up a Consul key to configure a provider.
  • Rolling-apply to instances, or individual instance application #2896 talks about rolling updates to sets of resources. This is not directly supported by the above, since it requires human intervention to describe the updates that are required, but the UX of running multiple plan/apply cycles to converge could be used for rolling updates too.
  • The cycles that result when mixing create_before_destroy with not, as documented in Document create_before_destroy limitation #2944, could get a better UX by adding some more cases where nodes are "deferred" such that the "destroy" node for the deposed resource can be deferred to a separate run from the "create" that deposed it.
  • Provider Aliases Interact Poorly with Reusable Modules #1819 considers allowing the provider attribute on resources to be interpolated. It's mainly concerned with interpolating from variables rather than resource attributes, but the partial plan idea allows interpolation to be supported more broadly without special exceptions like "only variables are allowed here", and so it may become easier to implement interpolation of provider.
  • Intermediate variables (OR: add interpolation support to input variables) #4084 requests "intermediate variables", where computed values can be given a symbolic name that can then be used in multiple places within the configuration. One way to support this would be to allow variable defaults to be interpolated and mark the variables themselves as "deferred" when their values are computed, though certainly other implementations are possible.
@apparentlymart
Copy link
Contributor Author

@phinze this proposal brings together several discussions we've had elsewhere and provides a new take on the "static vs. dynamic" problem that seems to be a common theme in Terraform supporting more complex use-cases.

I'd love to work on something in this vein early next year if you guys think this is a reasonable direction to take, but given the work involved I'd love to hear your thoughts before I get stuck in to implementation.

@ketzacoatl
Copy link
Contributor

I could see this being very helpful, and a nice way to simplify situations where we are sometimes otherwise a bit stuck with one or another limitation.

@phinze
Copy link
Contributor

phinze commented Dec 4, 2015

Thanks for this - as usual! - wonderfully clear write up, @apparentlymart. 😀

This solution does seem flexible and fairly straightforward conceptually. As I'm sure you'd agree, introducing the possibility of partial plans / applies would be a pretty central modification to Terraform's workflow, so it deserves careful scrutiny.

You have shown how partial progress could enable more complex configurations, but I'm worried about the cost to the general UX of the tool. The usage pattern would go from from two well-defined discrete steps to "just keep retrying until it goes through."

One of Terraform's most important qualities is in its ability to predict the actions it's going to take. This feature proposes to trade off Terraform's prediction guarantees to enable previously impossible configs, or in other words "enable Terraform to automate configurations it cannot predict (in one step)."

At this point, Partial Applies feels like a heavy hammer to wield before addressing some of the more specific pain points you mention like:

However these are just my initial thoughts - I'll keep thinking and we can keep discussing!

@phinze phinze added the thinking label Dec 4, 2015
@apparentlymart
Copy link
Contributor Author

Thanks for the feedback. It's certainly what I was expecting, and echoes my own reservations about the design.

My rationale for proposing it in spite of those reservations was to observe that users are already effectively doing this workflow to work around what I perceive to be some flaws in Terraform's "ideal" model of being able to operate in a single step.

Here are two workarounds that my team is doing regularly, and that I've seen recommended to others running into similar problems:

  • Use -target to force the creation of a particular resource before the others in order to break a chicken/egg problem around provider dependencies. This is a one-off resolution that we most often use to hack around a chicken/egg problem caused by an error during a Terraform run.
  • Creating multiple, entirely separate root configuration modules that need to be applied in a particular sequence to produce an infrastructure that Terraform would otherwise fail to implement in a single step. terraform_remote_state is used to string them together. In this case the workaround gets baked into the architecture because otherwise almost every run would require the above -target workaround.

Something like #2976 certainly reduces the cases where this is necessary, but not to zero. Consider for example the use-case of spinning up a Rundeck instance as an aws_instance, then using the rundeck provider to write jobs into it. On the first run, starting from nothing, this is fine: we can delay instantiating the rundeck provider until the aws_instance is complete. The problem arises any time the Rundeck instance is respun: now there is the risk that we get ourselves into the situation where the Rundeck server isn't running but Terraform still needs it to refresh the extant rundeck_project and rundeck_job resources in the state. This problem and various problems of this type have tripped me and my co-workers up frequently, each time forcing us to apply the workarounds I described above. Terraform could smooth this situation by noting that the rundeck provider depends on the EC2 instance that doesn't currently exist, and thus deferring the refresh/plan of those resources until the aws_instance has been recreated.

Thus I have concluded that "Terraform can plan everything" is a nice ideal, but it doesn't seem to stand up to reality. This proposal basically embraces the -target workaround and makes it a first-class workflow in order to get as close as possible to the ideal: while Terraform can't plan everything in a single step, you can rely on it to be explicit about what it will do every step of the way and guide the user towards convergence on the desired outcome.

I'd consider this a big improvement over the current situation, where quite honestly my coworkers lose confidence in Terraform every time I have to guide them through a manual workaround to a problem like I've described here, since it feels like hacking around the tool rather than working with it, and they (quite rightly) consider what would happen if they found themselves in such a situation in the middle of the night during an incident and were forced to create a workaround on the fly, working solo.

I think it'd be important to couple this architectual shift with the continued effort to have Terraform detect as many errors as possible at plan time, so that errors during terraform apply are rare and thus it is less likely that you would find yourself "stranded" in the middle of a multi-step application process. (#4170 starts to address this.)

I also think that it's important that Terraform describe the "I need multiple steps" situation very well after a plan, so that users can make an informed decision about whether to proceed with the multi-step process or whether to rework/simplify their configuration to not require it, if the "plan everything in one step" guarantee is important to them and they aren't willing to risk that step 2 might not be what they expected.

With all of that said, I'm just trying to directly address your initial feedback and hopefully aid your analysis; I totally agree that this is a significant change that requires scrutiny, and am happy to let this soak for a while.

@phinze
Copy link
Contributor

phinze commented Dec 5, 2015

This helps a lot, Martin, and it's very persuasive!

I agree that forcing users to -target their way out of a scenario is in no way acceptable and indeed erodes confidence in Terraform as a tool.

I have been silently wondering if Terraform should just encourage separation between "unplannable boundaries" into separate configs, but reading your description:

Creating multiple, entirely separate root configuration modules that need to be applied in a particular sequence to produce an infrastructure

It does seem perhaps too tedious and clunky to be a "recommended architecture".

I'll let this bounce around my brain over the weekend and follow up next week. 🍻

@mitchellh
Copy link
Contributor

This is a lot like something I've been bouncing around with @phinze for awhile. Good to see this to start being formalized a bit more. I actually called this something like "phased graphs" or something terrible, but partial apply makes more sense.

I think internally what we do is actually separate various graph nodes into "phases". Each phase on its own should be its own completely cycle free thing. It has various semantic checks such as: it can only depend on things from phase N or N-1, and it can't contain any cycles within it.

Detecting when a phase increase needs to happen uses the rules you outlined above: computed dependency from a provider, or a computed count.

Then for plan output we note the various phases that exist, and when you run it, we just increment the phase count, then you can plan the next phase, etc etc. Does this make sense?

@mitchellh
Copy link
Contributor

On second thought, we can probably make this a lot simpler: we detect when we have what I described above as "phase change" and just stop the plan/apply there, notifying the user that another run is required to continue forward. Once an apply is done then re-running it on the whole thing shouldn't affect things.

I can see the benefit of making this a more stateful operation but that is something we can do as an improvement later. The idea in the former paragraph can introduce partial applies while not having any huge UX change.

@ketzacoatl
Copy link
Contributor

I sort of use -target in an attempt to limit terraform to "phases". I recently noticed this would still try to rm / delete some resources if TF deemed necessary (even nodes outside the "target").

@apparentlymart
Copy link
Contributor Author

@mitchellh Thanks for that feedback. It's great to hear how you're thinking about this.

Considering your first comment about planning multiple phases, I believe it's not possible for Terraform to plan the full set of apply steps in all cases. Consider the scenario where the provider configuration is computed: we then can't refresh/diff any resources belonging to that provider. I think the best we could do is note in the output that we're deferring certain resources, but I expected that would make the output rather noisy... seems like a good thing to learn via some prototyping.

I think your second comment describes what I had originally tried to propose: go as far as possible, let the user know it's not complete, and then let them re-plan to move to the next step. For me, marking a node as "deferred" was what you called "phase change", assuming I understand correctly. In my approach there are only two "phases" for a given plan: "planned" or "deferred". At each "apply" the state is updated, without having to retain anything new in the state file, just like we'd do if there was an error during the apply phase.

I'm hoping to spend some time prototyping this in the new year. I think it will be easier to talk about the details of this with a strawman implementation to tear apart.

@mitchellh
Copy link
Contributor

@apparentlymart I don't think I was clear enough, but I think we're agreeing.

Only the current phase would be planned (contain a diff). The rest is a no-op, but may have to have some logic associated about it upstream, probably not downstream in the graph (in order to get the state to read things).

@IX-Erich
Copy link

From the peanut gallery:

I'm very relieved to find this thread, and just wanted to offer the perspective of a newbie user in love with this idea, but running into horrible chicken and egg roadblocks trying to spin up what should be a fairly simple AWS VPC.

At this point, after hours (actually days) of trying to manually pull out parts of configs and then put them back in to work around these chicken and egg scenarios that terraform is supposed to handling for me , terraform's state seems so confused to be moribund. Whereas before, I had a working environment, now I can't even figure out how to fix terraforms state to make it understand what's actually there - as it seems totally confused about that now, no doubt due to all my fiddling. I guess I just rip the entire thing out and start from scratch now? Will it even be ale to do that at this point?

If I were already a terraform expert - no doubt I'd recognize typical patterns, and know how to work around them as your remarkably named colleague, apparentlymart, can. But as a novice to your tech, I wouldn't even know where to begin. As he observed above, I am so addled by the experience, given how simple this environment is, that I would be terrified to even try to use this magic on anything remotely complex, and would never dream of suggesting some of my even less mart colleagues try.

The thought of making some minor change to my environment that, basically, completely borks it and leaves me not even knowing what the state is - is the stuff of nightmares, quite literally.

That does seem to be a problem. I am not as mart as apparentlymart, but it seems obvious that having a single pass in a system which must rely on future states for the data to support subsequent steps, is one of those magical ideas that looks great in theoretical language about phase state... but, basically, must fail in practice.

Again, I'm not remotely as mart as any of you, and that's why I'm trying to help you see that mere mortals have a hard time seeing how this reduces effort - quite to the contrary - I can manually make a VPC with a functioning vyetta node in a about fifteen minutes, and not be terrified that the smallest change will take my business offline.

Sorry for the long read and degree of frustration in tone - I have been at this, as I said, for days. And these tools are exciting! I'll be watching as it becomes something predictable and usable.

@IX-Erich
Copy link

Quick follow-up as an example - my most recent plan was hanging when applying trying to delete a network interface.. it would eventually time out and not report the details of the failure beyond that it timed out.

The problem ended up being that I had manually created an instance on that subnet, and so AWS was refusing to delete the subnet prior to deleting the dependent instance/interface.

  1. I doubt the AWS API timed out rather than returning the same error it did in the console when I manually tried to delete the subnet.

  2. I thought terraform was supposed to recognize manually induced dependencies against the plan and deal with them, no?

So - not a huge deal, obviously, and not a chicken and egg scenario - but an example of the sort of workflow detritus that gets really frustrating in repetition.

Again - I'm a huge fan, and don't expect a zero learning curve experience. Just offering the novice perspective. Thanks for the project and all your efforts!

@ketzacoatl
Copy link
Contributor

@erich-comsIO, there are a few assumption one should make, or allow Terraform to make (such as, let Terraform manage its space and interfere as little as possible), and you will run into these types of issues less often. It takes time, sharing, and sometimes bug squashing to improve that learning curve. I have run into similar issues and circular dependencies with template_file, subnets, security groups, etc. Part of this is also due to nuances with the provider, the AWS console smooths out the experience in many ways you do not realize until using Terraform.

Are you able to connect to #terraform-tool on freenode IRC? I would be happy to help you through some of these types of issues. The mailing list is also a great place for this discussion, this ticket will quickly spin out of control if we veer too far away from the central discussion.

@IX-Erich
Copy link

Thanks for the reply and offer of help. Very much appreciated. I’m about to jump into a meeting, but will start hanging out in the irc channel. I do get that I need to stay out of the way as much as possible and let it do its thing. But then I run into these circular dependency issues. I gather I need to start using —target to manage those things terraform seems unaware of or unable to track at AWS regarding these kinds of dependencies.

Departing this thread, and will ask my noob questions in new topics.

Thanks again!

On Dec 16, 2015, at 10:49 AM, ketzacoatl notifications@github.com wrote:

@erich-comsIO https://github.com/erich-comsIO, there are a few assumption one should make, or allow Terraform to make (such as, let Terraform manage its space and interfere as little as possible), and you will run into these types of issues less often. It takes time, sharing, and sometimes bug squashing to improve that learning curve. I have run into similar issues and circular dependencies with template_file, subnets, security groups, etc. Part of this is also due to nuances with the provider, the AWS console smooths out the experience in many ways you do not realize until using Terraform.

Are you able to connect to #terraform-tool on freenode IRC? I would be happy to help you through some of these types of issues. The mailing list is also a great place for this discussion, this ticket will quickly spin out of control if we veer too far away from the central discussion.


Reply to this email directly or view it on GitHub #4149 (comment).

@apparentlymart
Copy link
Contributor Author

A follow up idea, building on this proposal: Terraform will usually happily propagate computed values through the graph as things are created, which means the initial diff often has a whole lot of raw interpolations that don't get resolved until apply time. This also means that certain validations can't be done until apply time, since the value to validate isn't yet known.

With partial apply support it becomes possible to offer a -conservative flag (suggestions for better names are welcome) that will make Terraform defer any resource that has at least one computed attribute, so the config would then be applied across more steps but with the benefit that at every step it is possible to completely inspect all of the values, both against human intuition and against Terraform's built-in validation rules. This would complement the new capability offered by #4348 in allowing Terraform to do more extensive plan-time validation than is possible today.

I often "fake" -conservertive using -target when I'm applying a particularly-complex configuration, or one where extended downtime due to manual error recovery would be bothersome.

I wouldn't tackle this immediately when implementing this feature, but I think this is a further use-case for having the infrastructure to allow partial plan/apply.

@mitchellh
Copy link
Contributor

@apparentlymart Another good idea, but I'd make the same cautious viewpont as above: let's treat that as a separate idea once we do this one.

@bflad bflad added the providers/protocol Potentially affecting the Providers Protocol and SDKs label Dec 1, 2021
@crw
Copy link
Contributor

crw commented Jan 26, 2022

Should this issue be resolved, please check #2253 to see if it is resolved by the resolution to this issue.

@aequitas
Copy link

I think this issue is broader, especially regarding the behaviour of Terraform to automate away what needs to be included/excluded.

For OP:

This workflow is intended to embrace the existing workaround of using the -target argument to force Terraform to apply only a subset of the config, but improve it by having Terraform itself detect the situation. Terraform can then calculate itself which resources to target to plan for the maximal subset of the graph that can be applied in a single action, rather than requiring the operator to figure this out.

@apparentlymart
Copy link
Contributor Author

Hi all! It's been a long time.

I originally opened this issue quite some time before I joined HashiCorp to work on Terraform full-time, and although the underlying problem statement of this issue remains valid, the exact details I described here have become less relevant to modern Terraform over time and so it's been clear to us that we will need to take a fresh start at designing it, taking into account more recent changes to the way providers are developed, the better handling of unknown values in Terraform v0.12 and later, the introduction of data sources in the meantime, and various other situations that are clearer to the Terraform team today than they were to me as an external contributor back in 2015.

With that in mind, I've decided to close this issue and replace it with one that represents just the problem to be solved and not yet any specific solution to it. My hope is that we'll use that new issue to discuss the relevant constraints and challenges and eventually reach a new proposal that makes sense for Terraform as it exists today, which may or may not be similar to what I mocked up in this older issue.

The new issue is #30937. If you're interested in following along with or participating in that discussion, please move your issue subscription over to that issue instead. I'm going to lock this one just to avoid continued additions to this issue and thus a fragmented discussion.

Thanks for the discussion here so far! The history of this issue isn't going anywhere, so we'll still be able to take into account the existing feedback as we consider possible approaches to solve this problem.

@apparentlymart apparentlymart closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2022
@hashicorp hashicorp locked as resolved and limited conversation to collaborators Apr 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core enhancement proposal providers/protocol Potentially affecting the Providers Protocol and SDKs thinking
Projects
None yet
Development

No branches or pull requests