Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Upgrading an existing Nebari AWS environment to 2023.7.1 causes the cluster to be destroyed/recreated #1884

Closed
sblair-metrostar opened this issue Aug 3, 2023 · 5 comments
Labels
needs: discussion 💬 Needs discussion with the rest of the team needs: tests ✅ This contribution is missing tests type: bug 🐛 Something isn't working

Comments

@sblair-metrostar
Copy link
Contributor

sblair-metrostar commented Aug 3, 2023

Describe the bug

Deploying the upgrade of an existing AWS 2023.5.1 Nebari to 2023.7.1 resulted in Terraform destroying and recreating the network resources (VPC/subnets), along with everything attached to them including the EKS cluster. This appears to have been due to changes made in a recent enhancement intended to permit use of existing subnets.

module.network was renamed to module.network[0] without any deliberate attempt to move the state, and lacking any automated support from Terraform for handling this case, it was replaced without warning.

Expected behavior

Option 1 (Preferred): module.network state records are migrated to the new format prior to applying Terraform changes. The moved option was added in Terraform 1.1, but Nebari is currently pinned to 1.0.5 so an upgrade would be required to utilize this.

Option 2: A check is made prior to deploying changes which would block deployment in destructive scenarios such as this.

OS and architecture in which you are running Nebari

Ubuntu Linux, x64

How to Reproduce the problem?

  1. Install Nebari 2023.5.1 in AWS
  2. Upgrade nebari CLI to 2023.7.1
  3. Run nebari upgrade -c nebari-config.yaml
  4. Run nebari render -c nebari-config.yaml
  5. Run nebari deploy -c nebari-config.yaml

image

Command output

No response

Versions and dependencies used.

Nebari: 2023.5.1 -> 2023.7.1
Kubectl: 1.25
Conda: 23.5.0

Compute environment

AWS

Integrations

No response

Anything else?

Workaround:

  1. nebari render... the 2023.7.1 update changes
  2. pushd stages/02-infrastructure/aws
  3. Using Terraform 1.0.5, terraform state mv module.network module.network[0]
  4. popd
  5. nebari deploy... to deploy the 2023.7.1 upgrade
@sblair-metrostar sblair-metrostar added needs: triage 🚦 Someone needs to have a look at this issue and triage type: bug 🐛 Something isn't working labels Aug 3, 2023
@iameskild
Copy link
Member

Thanks @sblair-metrostar for reporting this! For now we updated the release notes to reflect this breaking change.

@costrouc
Copy link
Member

costrouc commented Aug 3, 2023

@sblair-metrostar regarding "Option 2: A check is made prior to deploying changes which would block deployment in destructive scenarios such as this."

I think regardless of what we do end up with this would be helpful to have. Since as you've said a single change can have a huge effect. We should probably also establish things that cannot be deleted. E.g.:

  • Kubernetes cluster (node groups are fair game)
  • persistent volumes

These checks would be easy enough to include within the terraform plan/apply logic. We would just check a plan and see what actions are going to be performed and block the apply unless the user acknowledges or performs a backup something like this.

@sblair-metrostar
Copy link
Contributor Author

@sblair-metrostar regarding "Option 2: A check is made prior to deploying changes which would block deployment in destructive scenarios such as this."

I think regardless of what we do end up with this would be helpful to have. Since as you've said a single change can have a huge effect. We should probably also establish things that cannot be deleted. E.g.:

  • Kubernetes cluster (node groups are fair game)
  • persistent volumes

These checks would be easy enough to include within the terraform plan/apply logic. We would just check a plan and see what actions are going to be performed and block the apply unless the user acknowledges or performs a backup something like this.

@costrouc Agreed, I think a destruction safety check would be tremendously valuable. I only referred to it as an option because, at least in my mind, it's the much harder one to implement in a general case, maybe not so bad if it's just a one-off check made for known destructive PR's like this.

Unfortunately, just checking for it still leaves the user in a position to manually remediate whatever the situation is or just brace for a backup/restore in order to complete the upgrade. Actually handling the state migration for a seamless upgrade with I'd like to think copious warnings displayed to the user about what's happening would be ideal.

Being able to generate a plan through the nebari CLI would also be nice so we can inspect the pending changes as part of a PR or something. nebari validate just checks the yaml schema, correct?

@pavithraes pavithraes added needs: discussion 💬 Needs discussion with the rest of the team and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Aug 7, 2023
@pavithraes
Copy link
Member

pavithraes commented Aug 14, 2023

Two additional actions to move ahead:

@iameskild iameskild added the needs: tests ✅ This contribution is missing tests label Aug 28, 2023
@Adam-D-Lewis
Copy link
Member

Issue addressed by adding to release notes. I opened a new issue about critical resource protections - #2829

@github-project-automation github-project-automation bot moved this from New 🚦 to Done 💪🏾 in 🪴 Nebari Project Management Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: discussion 💬 Needs discussion with the rest of the team needs: tests ✅ This contribution is missing tests type: bug 🐛 Something isn't working
Projects
Development

No branches or pull requests

5 participants