Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eks: Changing securityGroup deletes entire cluster #28584

Closed
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p1

Comments

@gricey432
Copy link

Describe the bug

Adding a custom securityGroup to eks.Cluster causes a full replacement of the stateful k8s cluster and data loss.

Expected Behavior

securityGroup would be updated in place as it is through the console, SDK, and as indicated in cdk diff.

Current Behavior

The custom resource quietly created a new, empty EKS cluster and then deleted the existing cluster.

Reproduction Steps

Should be reproducible by

  1. Create a barebones eks.Cluster with no securityGroup
  2. Deploy
  3. Add a custom securityGroup to the cluster
  4. Deploy

Possible Solution

Looking at the onEventHandler logs for the custom resource, it looks like the issue is that changing the SG caused this update set

{
    "updates": {
        "replaceName": false,
        "replaceVpc": true,
        "updateAccess": false,
        "replaceRole": false,
        "updateVersion": false,
        "updateEncryption": false,
        "updateLogging": false
    }
}

Looks like the check for replaceVpc is very rough and way too trigger happy

https://github.com/aws/aws-cdk/blob/main/packages/%40aws-cdk/custom-resource-handlers/lib/aws-eks/cluster-resource-handler/cluster.ts#L330

In fact it looks like the API can replace subnets and security groups in place https://docs.aws.amazon.com/eks/latest/APIReference/API_UpdateClusterConfig.html

The eks.Cluster resource takes a vpc prop, I think the changing of that prop should be the only reason that replaceVpc needs to be true...

Additional Information/Context

No response

CDK CLI Version

2.113.0 (build ccd534a)

Framework Version

No response

Node.js Version

v18.17.1

OS

Windows 11

Language

TypeScript

Language Version

4.9.3

Other information

#25544 may have aided disaster recovery here

I would like to have a Stack Policy on this stack but since it's implemented as a custom resource, cloudformation can only tell me that the CR itself isn't going to be replaced, but can't stop / inform me that the CR is going to be destructive.

@gricey432 gricey432 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 5, 2024
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Jan 5, 2024
@pahud pahud self-assigned this Jan 5, 2024
@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 5, 2024
@pahud
Copy link
Contributor

pahud commented Jan 5, 2024

Yes I think it's possible to just update the cluster security groups without replacing the cluster. At this moment the cluster construct from aws-eks is using customer resource to create the cluster but if we look at SecurityGroupsIds, it's update with no Interruption.

Agree we should fix replaceVpc.

@pahud pahud added p1 effort/medium Medium work item – several days of effort labels Jan 5, 2024
@pahud pahud removed their assignment Jan 5, 2024
@pahud pahud removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jan 5, 2024
@gricey432
Copy link
Author

gricey432 commented Jan 8, 2024

I've created #28609 to capture the broader idea of removing this custom resource all together.

This smaller improvement could still be implemented though, given the scale of that other ticket.

Update, that's closed in favour of #24059

@mergify mergify bot closed this as completed in #30114 May 10, 2024
mergify bot pushed a commit that referenced this issue May 10, 2024
### Issue # (if applicable)

Closes #28584

### Reason for this change

To have in place updates for EKS clusters when subnets or SG values are changed.

### Description of changes

Removed `replaceVpc` logic and introduced `updateVpc` to track changes and errors to handle multiple updates in one go

### Description of how you validated changes

Have tested the changes by first deploying a cluster with below config: 
```ts
const vpc = ec2.Vpc.fromLookup(stack, 'Vpc', { isDefault: true });
new eks.Cluster(stack, 'Cluster', {
  vpc,
  ...getClusterVersionConfig(stack, eks.KubernetesVersion.V1_24),
  defaultCapacity: 0,
});
``` 
TestCase - 1 Update both subnets and Access at the same time
```ts
new eks.Cluster(stack, 'Cluster', {
  vpc,
  ...getClusterVersionConfig(stack, eks.KubernetesVersion.V1_29),
  defaultCapacity: 0,
  tags: {
    foo: 'bar',
  },
  endpointAccess: eks.EndpointAccess.PUBLIC,
  vpcSubnets: [{ subnetType: ec2.SubnetType.PUBLIC }],
});
```
Error below is thrown for Cluster custom resource -
```
{
    "errorType": "Error",
    "errorMessage": "Only one type of update - VpcConfigUpdate, LoggingUpdate or EndpointAccessUpdate can be allowed",
    "stack": [
        "Error: Only one type of update - VpcConfigUpdate, LoggingUpdate or EndpointAccessUpdate can be allowed",
        "    at Pi.onUpdate (/var/task/index.js:55:651127)",
        "    at Pi.onEvent (/var/task/index.js:55:647590)",
        "    at Runtime.yR [as handler] (/var/task/index.js:55:657995)",
        "    at Runtime.handleOnceNonStreaming (file:///var/runtime/index.mjs:1173:29)"
    ]
}
```

TestCase - 2 Update subnets to public 

```ts
new eks.Cluster(stack, 'Cluster', {
  vpc,
  ...getClusterVersionConfig(stack, eks.KubernetesVersion.V1_29),
  defaultCapacity: 0,
  vpcSubnets: [{ subnetType: ec2.SubnetType.PUBLIC }],
});
```
```
{
    "updates": {
        "replaceName": false,
        "updateVpc": true,
        "updateAccess": false,
        "replaceRole": false,
        "updateVersion": false,
        "updateEncryption": false,
        "updateLogging": false
    }
}
```

```
{
  clientName: 'EKSClient',
  commandName: 'UpdateClusterConfigCommand',
  input: {
    name: 'Cluster9EE0221C-0b6f58b0698348aea43866b93a62b2c9',
    resourcesVpcConfig: { subnetIds: [Array], securityGroupIds: [Array] }
  },
  output: {
    update: {
      createdAt: 2024-05-08T20:55:00.013Z,
      errors: [],
      id: '7d5cd243-5536-3f52-b5ca-4c6e6c044529',
      params: [Array],
      status: 'InProgress',
      type: 'VpcConfigUpdate'
    }
  },
  metadata: {}
}
``` 


### Checklist
- [x] My code adheres to the [CONTRIBUTING GUIDE](https://github.com/aws/aws-cdk/blob/main/CONTRIBUTING.md) and [DESIGN GUIDELINES](https://github.com/aws/aws-cdk/blob/main/docs/DESIGN_GUIDELINES.md)

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment