Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Terraform provider version inconsistency within stages #2614

Open
viniciusdc opened this issue Aug 7, 2024 · 23 comments · May be fixed by #2862
Open

[BUG] - Terraform provider version inconsistency within stages #2614

viniciusdc opened this issue Aug 7, 2024 · 23 comments · May be fixed by #2862

Comments

@viniciusdc
Copy link
Contributor

Describe the bug

We must be more consistent in Terraform provider versions across different deployment stages. This discrepancy can lead to unpredictable behavior and potential issues during deployment. For example, on a recent AWS deployment, I noticed the following in deployment logs from Terraform:

Stage 01 -- Terraform State:

  • hashicorp/aws: v5.12.0

Stage 02:

  • hashicorp/aws: v5.33.0
  • hashicorp/local: v2.5.1
  • hashicorp/tls: v4.0.5

Stage 03:

  • hashicorp/aws: v5.61.0
  • hashicorp/kubernetes: v2.20.0
  • hashicorp/helm: v2.1.2

While we do set the version for the most important infrastructure resources:

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "5.33.0"
}
}
required_version = ">= 1.0"
}

The order stages use the terraform.Provider to instantiate the providers across the deployment:

elif self.config.provider == schema.ProviderEnum.aws:
return [
terraform.Provider(
"aws", region=self.config.amazon_web_services.region
),
]

We should make sure that it becomes consistent. Also, the exciting thing is that after stage 3, it becomes consistent across all calls; I guess it comes from the backend being already set up.

Expected behavior

At least the cloud provider versions respect the versions described in their infra modules, as that would be expected.

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

Any cloud provider deployment might lead to the same problem.

Command output

No response

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

No response

Anything else?

No response

@viniciusdc viniciusdc added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Aug 7, 2024
@Adam-D-Lewis
Copy link
Member

I think we could add a consistent version in tf_objects.py instead of in the terraform files directly. That would enforce the same version in all of the stages.

@Adam-D-Lewis Adam-D-Lewis removed the needs: triage 🚦 Someone needs to have a look at this issue and triage label Aug 20, 2024
@Adam-D-Lewis
Copy link
Member

I'm also curious what issues you've seen from this. It seems like it shouldn't cause a problem to use different provider versions in different stages.

@viniciusdc
Copy link
Contributor Author

viniciusdc commented Aug 20, 2024

I'm also curious what issues you've seen from this. It seems like it shouldn't cause a problem to use different provider versions in different stages.

I haven't noted any issue directly, but keeping this inconsistency might open the chance for bugs where tracking would be difficult; for example, a specific version of the provider might handle certain API request in a particular order while another newer version does not (same with error messages) or have different internal requirements like region, zones etc..

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Aug 20, 2024

It seems like b/c the stages are isolated from each other (isolated terraform modules) that differering provider versions should be okay. That said, I think we should try to keep the versions consistent between the stages, but I'm not sure I would support enforcing it (e.g. for plugins), at least not until we see an issue.

@marcelovilla
Copy link
Member

@smokestacklightnin will be picking up this issue

@marcelovilla marcelovilla added good first issue Good for newcomers and removed needs: follow-up 📫 Someone needs to get back to this issue or PR labels Nov 1, 2024
@marcelovilla
Copy link
Member

@smokestacklightnin, here's a bit more context to help bring you up to speed on this issue.

Nebari has several Terraform stages, which are run sequentially because some require the output of others as input. For each stage, we have one or multiple versions.tf file where Terraform provider versions are defined. For example: https://github.com/nebari-dev/nebari/blob/9b1310b33e89c2c11c3b39128ec792ca80342486/src/_nebari/stages/infrastructure/template/aws/versions.tf

It also seems we are setting providers in other files, like for example:

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "5.12.0"
}
}
required_version = ">= 1.0"
}

We need to ensure consistency across all stages by using the same provider versions. For now, I suggest we avoid updating to the latest available versions and instead stick to the most up-to-date versions among the ones we’re currently using.

@smokestacklightnin
Copy link
Contributor

smokestacklightnin commented Nov 4, 2024

It seems to me that this Issue could be solved by implementing #2814

@smokestacklightnin
Copy link
Contributor

I think we could add a consistent version in tf_objects.py instead of in the terraform files directly. That would enforce the same version in all of the stages.

This is also a good alternative to #2814.

@marcelovilla
Copy link
Member

It seems to me that this Issue could be solved by implementing #2814

That's right. However, implementing #2814 means that all providers will get updated, and while that might be the end goal, it might also break things. Thus, in my opinion, a safer approach is to make sure providers are consistent using the versions we already have and then explore automatic upgrades to the providers.

This is also a good alternative to #2814.

I think it is indeed a good alternative but it would probably not work well with tools for automatic upgrades like Renovate.

@smokestacklightnin
Copy link
Contributor

However, implementing #2814 means that all providers will get updated, and while that might be the end goal, it might also break things.

I'm looking for a way with renovate to pin versions of dependencies

@smokestacklightnin
Copy link
Contributor

smokestacklightnin commented Nov 4, 2024

However, implementing #2814 means that all providers will get updated, and while that might be the end goal, it might also break things.

I'm looking for a way with renovate to pin versions of dependencies

It appears that it is possible to pin provider versions.

@dcmcand
Copy link
Contributor

dcmcand commented Nov 6, 2024

@smokestacklightnin I think it would be best to actually pin the provider versions in constants.py since we are generating the TF from templates. That way we have one single source of truth for what provider we are running.

We talked about this approach in our Nebari meeting this week and there was consensus that this was a good way to go here.

As far as #2814, this would mean that we would need to render the terraform and then run any scanner against the rendered files, but I think that is ok.

@marcelovilla
Copy link
Member

@smokestacklightnin: before looking into Renovate, can you look into the suggestion from above? As Chuck mentions, it would be good to have a place where we can centralize the versions instead of having them scattered around the stages.

We need to review our rendering logic to make sure we can render the versions.tf files from a template, taking the provider versions from a common place such as constants.py. So before working on #2814, I'd suggest you work on this issue first and get acquainted with the logic behind how Nebari uses the Terraform files to render the actual configuration for a deployment, and how that's used in every stage of the deployment.

@smokestacklightnin
Copy link
Contributor

We need to review our rendering logic to make sure we can render the versions.tf files from a template, taking the provider versions from a common place such as constants.py.

Both OpenTofu and Terraform require that the provider version constraint must be a string and not, for example, a variable. This makes it cumbersome to specify the version constraints in .tf files dynamically.

I think it would be best to actually pin the provider versions in constants.py since we are generating the TF from templates. That way we have one single source of truth for what provider we are running.

Since we can't fill in the templates using native Terraform variables like the other fields that are present, we would need to template our templates, which doesn't sit well with me. Maybe others disagree.

Since Nebari basically pulls the template files directly from their respective template directories, it seems to me that keeping the versions in the templates up to date (or pinned as desired) is the most direct way to keep the versions synchronized. Renovate would accomplish this.

it would be good to have a place where we can centralize the versions instead of having them scattered around the stages.

Renovate has a configuration file where versions are specified.

As far as #2814, this would mean that we would need to render the terraform and then run any scanner against the rendered files, but I think that is ok.

As it currently stands, the only way to update Terraform versions is to do so manually, either in the templates or in the rendered files. Updates in the latter option get overwritten every time nebari render is run.

@viniciusdc
Copy link
Contributor Author

viniciusdc commented Nov 8, 2024

Hi @smokestacklightnin , thanks for the feedback and the ideas above. I have overall points at the end, but first, I wanted to provide some context regarding the templates and the render process.

So, as you correctly mentioned above, Nebari pulls all terraform files from their respective stages directories, which happens during render. As you also mentioned, any manual change to the rendered files will be lost by nebari render or nebari deploy (unless --disable-render is passed). Though, as part our internal Python wrapper for the terraform stage files rendering process:

def tf_objects(self) -> List[Dict]:
return [NebariTerraformState(self.name, self.config)]
def render(self) -> Dict[pathlib.Path, str]:
contents = {
(self.stage_prefix / "_nebari.tf.json"): terraform.tf_render_objects(
self.tf_objects()
)
}
for root, dirs, filenames in os.walk(self.template_directory):
for filename in filenames:
root_filename = pathlib.Path(root) / filename
with root_filename.open("rb") as f:
contents[
pathlib.Path(
self.stage_prefix,
pathlib.Path.relative_to(
pathlib.Path(root_filename), self.template_directory
),
)
] = f.read()
return contents

we also inject a _nebari.tf.json file that will be red by terraform during runtime; this file contains certain information, such as the backend configuration when doing cloud deployments (since each stage references the same backend),

{
    "provider": {
        "kubernetes": {
            "host": "${data.aws_eks_cluster.default.endpoint}",
            "cluster_ca_certificate": "${base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)}",
            "token": "${data.aws_eks_cluster_auth.default.token}"
        },
        "helm": {
            "kubernetes": {
                "host": "${data.aws_eks_cluster.default.endpoint}",
                "cluster_ca_certificate": "${base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)}",
                "token": "${data.aws_eks_cluster_auth.default.token}"
            }
        },
        "aws": {
            "region": "us-east-1"
        }
    },
    "terraform": {
        "backend": {
            "s3": {
                "bucket": "************",
                "key": "terraform/******/03-kubernetes-initialize.tfstate",
                "region": "us-east-1",
                "encrypt": true,
                "dynamodb_table": "********"
            }
        }
    }
}

that's said, there is already one tf_object that we are not using which is the required_provider:

def RequiredProvider(_name, **kwargs):
return {"terraform": {"required_providers": {_name: kwargs}}}

If we pass this object across all stages as a tf_object during the render above, we could correctly pin all the provider versions and control the version we are passing to this class from the current constants.py.

As far as #2814, this would mean that we would need to render the terraform and then run any scanner against the rendered files, but I think that is ok.

This means we would run the scans against the already rendered files as part of a CI pipeline.

On the other hand, I am interested in this:

Renovate has a configuration file where versions are specified.

Curious in seeing what the renovate version file would look like,

@smokestacklightnin
Copy link
Contributor

smokestacklightnin commented Nov 9, 2024

Curious in seeing what the renovate version file would look like,

I have a minimal (almost complete) working proof of concept here. The version file is complete as a proof of concept

@smokestacklightnin
Copy link
Contributor

smokestacklightnin commented Nov 10, 2024

From the perspective of developing the Terraform templates, it's an awkward and somewhat confusing workflow to be specifying which required providers are needed for each deployment platform (aws, gcp, etc.) when writing the Terraform injection Python code, as opposed to the more traditional Terraform workflow of specifying the required providers in a versions.tf or main.tf.

In other words, it seems awkward to remove versions.tf and then scatter the information it contained (which required providers per deployment platform) across the injection logic and constants.py. If there was a more consolidated way to do this, then I would find it less awkward.

It's also worth mentioning that, since the providers are not uniformly used across all stages, if we dynamically generate versions.tf, we would also need to keep track of which stages use which providers. This is in contrast to, for example, the Terraform backend which is basically the same across all stages. I feel that doing this would add an unnecessary layer of complexity. To be frank, I would like if there was a simple way to do this, but because the providers are not the same across all stages, this seems like unnecessary added complexity versus using a tool like renovate.

@marcelovilla
Copy link
Member

that's said, there is already one tf_object that we are not using which is the required_provider:

@viniciusdc it seems we might be using terraform.Provider for that, though. One example is the one you wrote in the issue description and another one is:

def tf_objects(self) -> List[Dict]:
if self.config.provider == schema.ProviderEnum.gcp:
return [
terraform.Provider(
"google",
project=self.config.google_cloud_platform.project,
region=self.config.google_cloud_platform.region,
),
NebariTerraformState(self.name, self.config),
]
elif self.config.provider == schema.ProviderEnum.do:
return [
NebariTerraformState(self.name, self.config),
]
elif self.config.provider == schema.ProviderEnum.azure:
return [
NebariTerraformState(self.name, self.config),
]
elif self.config.provider == schema.ProviderEnum.aws:
return [
terraform.Provider(
"aws", region=self.config.amazon_web_services.region
),
NebariTerraformState(self.name, self.config),
]
else:
return []

Do you know what's the difference between specifying the providers here and having them in a versions.tf file? It seems we're currently doing both, at least for the terraform_state and the infrastructure stages. I'm unsure what defining the provider in the tf_objects file is actually accomplishing.


From reading the comments in this issue, it seems to me we have two different options:

  1. Centralize the Terraform provider versions in the constants.py file and make sure we're using to render the relevant Terraform files for each stage.
  2. Use Renovate to pin the maximum allowed versions and run that against the versions.tf files we already have.

I agree with @smokestacklightnin that the second option seems a like simpler approach. If we're already planning to use Renovate to manage other dependencies, I would favor it over the first option, specially since it seems we're not actually rendering the versions.tf files at this moment.

Happy to read other thoughts.

@smokestacklightnin
Copy link
Contributor

smokestacklightnin commented Nov 18, 2024

I've gotten working proof of concepts for a path forward that uses Renovate and one that doesn't. There are advantages and disadvantages to both choices. I'd like to hear others' opinions about whether one path is preferable to the other.

Path 1: Not using Renovate

I have a branch that demonstrates injecting Terraform version constraints and Terraform required providers into the Terraform configuration for the terraform_state and kubernetes_initialize stages using tf_objects. To implement this for the whole repository, it would require going through every versions.tf or main.tf in subdirectories of nebari/src/_nebari/stages and move versioning information to constants.py.

Advantages of this path:

  • Uses already present native Nebari tooling for Terraform injections

  • We have full control over the implementation for tracking versions, as opposed to an external tool.

Disadvantages of this path:

  • It only applies to the Terraform versions

  • It moves the terraform blocks out of Terraform .tf files to a nonstandard location. This is awkward from a Terraform development perspective, but makes sense to do in the context of Python.

  • If we choose this path and change our mind in the future, it would take work to get back to the previous required providers layout.

Path 2: Using Renovate

I have a branch that provides a minimal configuration.

One of the concerns that @viniciusdc raised was that he was worried that Renovate would overwhelm the GitHub CI with too much activity. Renovate has rules for limiting CI activity specifically for this purpose, so I believe this point can be addressed and mitigated if we go with this path.

Advantages of this path:

  • Works for tracking versions of GitHub Actions and Python packages in addition to Terraform providers.

  • One Renovate instance can, if desired, maintain versions for multiple repositories, which would allow synchronizing all Nebari repos.

Disadvantages of this path:

  • External tool

  • Yet another tool to maintain

I like both paths, so I'm hoping to get others' opinions on how to proceed.

@viniciusdc
Copy link
Contributor Author

viniciusdc commented Nov 20, 2024

Thanks for the comprehensive overview and the POCs, @smokestacklightnin. I’m leaning towards Path 1: Not using Renovate.

The main goal here is to ensure all providers use the same version across all Nebari stages. Right now, we have inconsistencies, like different versions of hashicorp/aws being used:

  • v5.12.0
  • v5.33.0
  • v5.61.0

While this has yet to cause major issues, it could lead to problems. Path 1 addresses this by injecting Terraform version constraints and required providers directly into our config. We’re not changing much from the current setup since the base in versions.tf remains in a different format.

This approach simplifies things by moving the provider constraints into constants.py. It aligns well with our existing constants like HIGHEST_SUPPORTED_KUBERNETES_VERSION and others, reducing the need to manage multiple versions.tf files. You can see how it integrates here:

HELM_VERSION = "v3.15.3"
KUSTOMIZE_VERSION = "5.4.3"
OPENTOFU_VERSION = "1.8.3"

Regarding the concern about reverting to the old layout later:

If we choose this path and change our mind in the future, it would take work to get back to the previous required providers layout.

If we decide to use Renovate in the future, we can apply it to the rendered templates in _nebari.tf.json, which already contain the necessary data in a different format. Plus, Renovate’s default settings can be adjusted to handle these files:

{
  "commitMessageTopic": "Terraform {{depName}}",
  "fileMatch": [
    "\\.tf$",
    "\\.tf.json$"
  ],
  "pinDigests": false
}

c.c. @dcmcand @marcelovilla @Adam-D-Lewis

@marcelovilla
Copy link
Member

@smokestacklightnin thank you for outlining these two alternatives!

While I was leaning towards path 2 in the beginning, I think @viniciusdc makes a good point about centralizing the versions in the constants.py file being consistent with the way we're specifying other dependency versions, such as different image tags and binaries that we use. Furthermore, I think we can expand on this approach in the future to also include other versions hard-coded in the Terraform files, such as Helm chart versions. For example:

resource "helm_release" "jupyterhub" {
name = "jupyterhub-${var.namespace}"
namespace = var.namespace
repository = "https://jupyterhub.github.io/helm-chart/"
chart = "jupyterhub"
version = "4.0.0-0.dev.git.6707.h109668fd"

Moving forward with path 1 does not necessarily mean we won't be using Renovate. As you already mentioned, it also tracks other versions, such as GitHub Actions and Python dependencies. Aditionally, as @viniciusdc points out, there's the possibility of running Renovate against rendered Terraform files.

Given that you already have a PoC for path 1 (not using Renovate) and it seems you understand what needs to be done to get it fully working, I suggest we move forward with that.

@smokestacklightnin
Copy link
Contributor

Unless @dcmcand or @Adam-D-Lewis have any opinions in favor of using Renovate, I'll proceed with path 1: not using Renovate

@dcmcand
Copy link
Contributor

dcmcand commented Nov 22, 2024

Sounds like a good plan @smokestacklightnin, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress 🏗
Development

Successfully merging a pull request may close this issue.

5 participants