[BUG] - Upgrading an existing Nebari AWS environment to 2023.7.1 causes the cluster to be destroyed/recreated #1884

sblair-metrostar · 2023-08-03T15:48:30Z

Describe the bug

Deploying the upgrade of an existing AWS 2023.5.1 Nebari to 2023.7.1 resulted in Terraform destroying and recreating the network resources (VPC/subnets), along with everything attached to them including the EKS cluster. This appears to have been due to changes made in a recent enhancement intended to permit use of existing subnets.

module.network was renamed to module.network[0] without any deliberate attempt to move the state, and lacking any automated support from Terraform for handling this case, it was replaced without warning.

Expected behavior

Option 1 (Preferred): module.network state records are migrated to the new format prior to applying Terraform changes. The moved option was added in Terraform 1.1, but Nebari is currently pinned to 1.0.5 so an upgrade would be required to utilize this.

Option 2: A check is made prior to deploying changes which would block deployment in destructive scenarios such as this.

OS and architecture in which you are running Nebari

Ubuntu Linux, x64

How to Reproduce the problem?

Install Nebari 2023.5.1 in AWS
Upgrade nebari CLI to 2023.7.1
Run nebari upgrade -c nebari-config.yaml
Run nebari render -c nebari-config.yaml
Run nebari deploy -c nebari-config.yaml

Command output

No response

Versions and dependencies used.

Nebari: 2023.5.1 -> 2023.7.1
Kubectl: 1.25
Conda: 23.5.0

Compute environment

AWS

Integrations

No response

Anything else?

Workaround:

nebari render... the 2023.7.1 update changes
pushd stages/02-infrastructure/aws
Using Terraform 1.0.5, terraform state mv module.network module.network[0]
popd
nebari deploy... to deploy the 2023.7.1 upgrade

The text was updated successfully, but these errors were encountered:

iameskild · 2023-08-03T19:11:40Z

Thanks @sblair-metrostar for reporting this! For now we updated the release notes to reflect this breaking change.

costrouc · 2023-08-03T19:20:44Z

@sblair-metrostar regarding "Option 2: A check is made prior to deploying changes which would block deployment in destructive scenarios such as this."

I think regardless of what we do end up with this would be helpful to have. Since as you've said a single change can have a huge effect. We should probably also establish things that cannot be deleted. E.g.:

Kubernetes cluster (node groups are fair game)
persistent volumes

These checks would be easy enough to include within the terraform plan/apply logic. We would just check a plan and see what actions are going to be performed and block the apply unless the user acknowledges or performs a backup something like this.

sblair-metrostar · 2023-08-03T20:15:35Z

@sblair-metrostar regarding "Option 2: A check is made prior to deploying changes which would block deployment in destructive scenarios such as this."

I think regardless of what we do end up with this would be helpful to have. Since as you've said a single change can have a huge effect. We should probably also establish things that cannot be deleted. E.g.:

Kubernetes cluster (node groups are fair game)

persistent volumes

These checks would be easy enough to include within the terraform plan/apply logic. We would just check a plan and see what actions are going to be performed and block the apply unless the user acknowledges or performs a backup something like this.

@costrouc Agreed, I think a destruction safety check would be tremendously valuable. I only referred to it as an option because, at least in my mind, it's the much harder one to implement in a general case, maybe not so bad if it's just a one-off check made for known destructive PR's like this.

Unfortunately, just checking for it still leaves the user in a position to manually remediate whatever the situation is or just brace for a backup/restore in order to complete the upgrade. Actually handling the state migration for a seamless upgrade with I'd like to think copious warnings displayed to the user about what's happening would be ideal.

Being able to generate a plan through the nebari CLI would also be nice so we can inspect the pending changes as part of a PR or something. nebari validate just checks the yaml schema, correct?

pavithraes · 2023-08-14T16:30:39Z

Two additional actions to move ahead:

Note this breaking change in https://www.nebari.dev/docs/how-tos/nebari-upgrade
Add tests to catch this early moving forward

Adam-D-Lewis · 2024-11-05T17:39:19Z

Issue addressed by adding to release notes. I opened a new issue about critical resource protections - #2829

sblair-metrostar added needs: triage 🚦 Someone needs to have a look at this issue and triage type: bug 🐛 Something isn't working labels Aug 3, 2023

github-project-automation bot added this to 🪴 Nebari Project Management Aug 3, 2023

github-project-automation bot moved this to New 📬 in 🪴 Nebari Project Management Aug 3, 2023

pavithraes added needs: discussion 💬 Needs discussion with the rest of the team and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Aug 7, 2023

iameskild added the needs: tests ✅ This contribution is missing tests label Aug 28, 2023

Adam-D-Lewis closed this as completed Nov 5, 2024

github-project-automation bot moved this from New 🚦 to Done 💪🏾 in 🪴 Nebari Project Management Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Upgrading an existing Nebari AWS environment to 2023.7.1 causes the cluster to be destroyed/recreated #1884

[BUG] - Upgrading an existing Nebari AWS environment to 2023.7.1 causes the cluster to be destroyed/recreated #1884

sblair-metrostar commented Aug 3, 2023 •

edited

Loading

iameskild commented Aug 3, 2023

costrouc commented Aug 3, 2023

sblair-metrostar commented Aug 3, 2023

pavithraes commented Aug 14, 2023 •

edited

Loading

Adam-D-Lewis commented Nov 5, 2024

[BUG] - Upgrading an existing Nebari AWS environment to 2023.7.1 causes the cluster to be destroyed/recreated #1884

[BUG] - Upgrading an existing Nebari AWS environment to 2023.7.1 causes the cluster to be destroyed/recreated #1884

Comments

sblair-metrostar commented Aug 3, 2023 • edited Loading

Describe the bug

Expected behavior

OS and architecture in which you are running Nebari

How to Reproduce the problem?

Command output

Versions and dependencies used.

Compute environment

Integrations

Anything else?

iameskild commented Aug 3, 2023

costrouc commented Aug 3, 2023

sblair-metrostar commented Aug 3, 2023

pavithraes commented Aug 14, 2023 • edited Loading

Adam-D-Lewis commented Nov 5, 2024

sblair-metrostar commented Aug 3, 2023 •

edited

Loading

pavithraes commented Aug 14, 2023 •

edited

Loading