Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config option to enable the encryption of AWS EKS secrets #2788

Merged
merged 20 commits into from
Nov 5, 2024

Conversation

joneszc
Copy link
Contributor

@joneszc joneszc commented Oct 22, 2024

Reference Issues or PRs

Fixes #2681
Fixes #2746
Modifies PR#2723 (Failing Tests / Pytest)
Modifies PR#2752 (Failing Tests / Pytest)

What does this implement/fix?

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

How to test this PR?

Any other comments?

Allows user to set EKS encryption of secrets by specifying a KMS key ARN in nebari-config.yaml

amazon_web_services:
  eks_kms_arn: 'arn:aws:kms:us-east-1:010101010:key/3xxxxxxx-xxxxx-xxxxx-xxxxx'
image

The KMS key must meet the following conditions:

  • Symmetric
  • Can encrypt and decrypt data
  • Created in the same AWS Region as the cluster
  • If the KMS key was created in a different account, the IAM principal must have access to the KMS key.

@viniciusdc
Copy link
Contributor

@joneszc, there are two PRs which seem to add the same thing, this one and #2752 -- I assume the first one was the original; can you close this one? (or move any relevant changes back to the other PR?)

@viniciusdc viniciusdc added the needs: follow-up 📫 Someone needs to get back to this issue or PR label Oct 24, 2024
@dcmcand
Copy link
Contributor

dcmcand commented Oct 24, 2024

@joneszc can we close #2752 and #2723 since we have this one?

@joneszc
Copy link
Contributor Author

joneszc commented Oct 24, 2024

@joneszc can we close #2752 and #2723 since we have this one?

@dcmcand @viniciusdc
Yes, those two PRs were built on forks of the old develop branch that is now main
Thanks for help determining that the branch was not the issue causing Pytest failures. #2752 and #2723 can be closed.

@joneszc joneszc changed the title UPDATED2: Add config option to enable the encryption of AWS EKS secrets Add config option to enable the encryption of AWS EKS secrets Oct 24, 2024
@joneszc
Copy link
Contributor Author

joneszc commented Oct 24, 2024

@viniciusdc
I've opened PR#537 to update the docs per your request

Also in follow-up to your ask, it appears that re-deploying to set KMS encryption on an existing Nebari EKS Cluster, without previous encryption set, does succeed. However, attempting thereafter to re-deploy to remove the previously set EKS secrets encryption will fail as terraform attempts to delete and rebuild the EKS cluster but cannot due to existing node groups.

@joneszc joneszc self-assigned this Oct 24, 2024
session = aws_session(region=region)
client = session.client("kms")
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/kms/client/list_keys.html
paginator = client.get_paginator("list_keys")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need a paginator here? It doesn't seem like we are using it. It seems like lines 137-141 could be replaced by

kms_keys = [i["KeyId"] for i in  client.list_keys().get("Keys")]

Which seems easier to read to me since it avoids nested loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if you want it more explicit

key_id_list = client.list_keys().get("Keys")
kms_keys = [i.get("KeyId") for i in key_id_list]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

for i in paginator.paginate()
for j in i["Keys"]
]
return {i["KeyId"]: {k: i[k] for k in fields} for i in kms_keys if i["Enabled"]}
Copy link
Contributor

@dcmcand dcmcand Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like this line is super readable. I would suggest extracting it into a separate function.

class Kms_key:
    Arn: str
    KeyUsage: str
    KeySpec: str

def check_kms_keys(key_ids: list[str], client: boto3.Client) -> dict[str, Kms_key]:
    keys = []
    for id in key_ids:
        key = client.describe_key(KeyId=id).get("KeyMetadata")
        if key.get("Enabled"):
            keys.append(kms_key(
                Arn=key.get("Arn"),
                KeyUsage=key.get("KeyUsage"),
                KeySpec=key.get("KeySpec"),
                )
            )
    return keys

Would be way easier to follow and accomplish the same thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do prefer this as well!

Copy link
Contributor Author

@joneszc joneszc Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the function more readable but I did not separate into two functions.
Note: the filter on the dictionary serves no other purpose other than to avoid passing more data than necessary for validating keys for EKS encryption. It's possible this function could be utilized in the future for collecting metadata on kms keys employed for services other than EKS encryption.

This function could just as easily be written:

def kms_key_arns(region: str) -> Dict[str, dict]:
    """Return dict of available/enabled KMS key IDs and associated KeyMetadata for the AWS region."""
    session = aws_session(region=region)
    client = session.client("kms")
    kms_keys = {}
    for key in client.list_keys().get("Keys"):
        key_id = key["KeyId"]
        key_data = client.describe_key(KeyId=key_id).get("KeyMetadata")
        if key_data.get("Enabled"):
            kms_keys[key_id] = key_data
    return kms_keys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks much cleaner and more readable. Nice work there.

Comment on lines 559 to 589
# check if kms key is valid
available_kms_keys = amazon_web_services.kms_key_arns(data["region"])
if "eks_kms_arn" in data and data["eks_kms_arn"] is not None:
key_id = [
id for id in available_kms_keys.keys() if id in data["eks_kms_arn"]
]
if (
len(key_id) == 1
and available_kms_keys[key_id[0]]["Arn"] == data["eks_kms_arn"]
):
key_id = key_id[0]
# Symmetric KMS keys with Encrypt and decrypt key-usage have the SYMMETRIC_DEFAULT key-spec
# EKS cluster encryption requires a Symmetric key that is set to encrypt and decrypt data
if available_kms_keys[key_id]["KeySpec"] != "SYMMETRIC_DEFAULT":
if available_kms_keys[key_id]["KeyUsage"] == "GENERATE_VERIFY_MAC":
raise ValueError(
f"Amazon Web Services KMS Key with ID {key_id} does not have KeyUsage set to 'Encrypt and decrypt' data"
)
elif available_kms_keys[key_id]["KeyUsage"] != "ENCRYPT_DECRYPT":
raise ValueError(
f"Amazon Web Services KMS Key with ID {key_id} is not of type Symmetric, and KeyUsage not set to 'Encrypt and decrypt' data"
)
else:
raise ValueError(
f"Amazon Web Services KMS Key with ID {key_id} is not of type Symmetric"
)
else:
raise ValueError(
f"Amazon Web Services KMS Key with ARN {data['eks_kms_arn']} not one of available/enabled keys={[v['Arn'] for v in available_kms_keys.values()]}"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid a lot of nesting here by flipping the logic on these if statements. Something like:

        available_kms_keys = amazon_web_services.kms_key_arns(data["region"])
        # don't check if eks_kms_arn is not set
        if "eks_kms_arn" not in data or data["eks_kms_arn"] is None:
            return data
        key_id = [
            id for id in available_kms_keys.keys() if id in data["eks_kms_arn"]
        ]
        # Raise error if key_id is not found in available_kms_keys
        if (
            len(key_id) != 1
            or available_kms_keys[key_id[0]]["Arn"] != data["eks_kms_arn"]
        ):
            raise ValueError(
                f"Amazon Web Services KMS Key with ARN {data['eks_kms_arn']} not one of available/enabled keys={[v['Arn'] for v in available_kms_keys.values()]}"
            )

        key_id = key_id[0]
        # EKS cluster encryption requires a Symmetric key that is set to encrypt and decrypt data
        if available_kms_keys[key_id]["KeySpec"] != "SYMMETRIC_DEFAULT":
            raise ValueError(
                f"Amazon Web Services KMS Key with ID {key_id} is not of type Symmetric"
            )
        if available_kms_keys[key_id]["KeyUsage"] != "ENCRYPT_DECRYPT":
            raise ValueError(
                f"Amazon Web Services KMS Key with ID {key_id} KeyUsage not set to 'Encrypt and decrypt' data"
            )        available_kms_keys = amazon_web_services.kms_key_arns(data["region"])
        # don't check if eks_kms_arn is not set
        if "eks_kms_arn" not in data or data["eks_kms_arn"] is None:
            return data
        key_id = [
            id for id in available_kms_keys.keys() if id in data["eks_kms_arn"]
        ]
        # Raise error if key_id is not found in available_kms_keys
        if (
            len(key_id) != 1
            or available_kms_keys[key_id[0]]["Arn"] != data["eks_kms_arn"]
        ):
            raise ValueError(
                f"Amazon Web Services KMS Key with ARN {data['eks_kms_arn']} not one of available/enabled keys={[v['Arn'] for v in available_kms_keys.values()]}"
            )

        key_id = key_id[0]
        # EKS cluster encryption requires a Symmetric key that is set to encrypt and decrypt data
        if available_kms_keys[key_id]["KeySpec"] != "SYMMETRIC_DEFAULT":
            raise ValueError(
                f"Amazon Web Services KMS Key with ID {key_id} is not of type Symmetric"
            )
        if available_kms_keys[key_id]["KeyUsage"] != "ENCRYPT_DECRYPT":
            raise ValueError(
                f"Amazon Web Services KMS Key with ID {key_id} KeyUsage not set to 'Encrypt and decrypt' data"
            )

Has less deeply nested logic and is therefore easier to read and test.

Copy link
Contributor Author

@joneszc joneszc Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcmcand
my hesitation to invoke return data in numerous places in this section is that could cause issues in the future if additional validators are appended after the KMS-specific validator lines under _check_input and validators are potentially skipped prematurely for aws services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joneszc I can see that. It might make sense to pull every validator into its own method then, and have _check_input just call a list of validators. That way each could exit early if they can and nothing would get skipped. I think that would also be cleaner for readability and avoid too much responsibility for a single method.

Copy link
Contributor Author

@joneszc joneszc Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcmcand
That would make sense. I could open a separate PR for breaking out multiple methods for _check_input, if you'd like. For now, I've updated the logic, based on your recommendation, for the first check. I've also added a check to ensure the KMS key is a customer-managed versus aws-managed key. I've left the final check in place with conditionals to precisely triage the error; for example, there is a case for a SYMMETRIC_DEFAULT key that could seem adequate but will fail if it is "GENERATE_VERIFY_MAC" rather than "ENCRYPT_DECRYPT".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open an issue for breaking out the _check_input methods so that doesn't get lost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joneszc, once you open the issue, can you reference this one as well, to have a cross reference for contextualization?

Copy link
Contributor Author

@joneszc joneszc Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viniciusdc @dcmcand
yes, here is the issue

@viniciusdc
Copy link
Contributor

viniciusdc commented Oct 25, 2024

However, attempting thereafter to re-deploy to remove the previously set EKS secrets encryption will fail as terraform attempts to delete and rebuild the EKS cluster but cannot due to existing node groups.

Hi @joneszc, thanks for checking that out! I was already expecting it to fail, but I had another thing in mind: they might be connected. Can you post a sanitized output of the terraform error and any error messages you might encounter in the CloudTrail history? I suspect you will find something related to the KMS key in there.

the main reason for this request is to validate if it will be beneficial to have this as an immutable field or, depending on the error, we can add manual steps to the user in our docs to disable it.

@joneszc
Copy link
Contributor Author

joneszc commented Oct 28, 2024

However, attempting thereafter to re-deploy to remove the previously set EKS secrets encryption will fail as terraform attempts to delete and rebuild the EKS cluster but cannot due to existing node groups.

Hi @joneszc, thanks for checking that out! I was already expecting it to fail, but I had another thing in mind: they might be connected. Can you post a sanitized output of the terraform error and any error messages you might encounter in the CloudTrail history? I suspect you will find something related to the KMS key in there.

the main reason for this request is to validate if it will be beneficial to have this as an immutable field or, depending on the error, we can add manual steps to the user in our docs to disable it.

@viniciusdc

Nebari output after failed attempt to re-deploy to remove eks cluster's envelope encryption of secrets:

[terraform]:   # module.kubernetes.aws_eks_cluster.main must be replaced
[terraform]: -/+ resource "aws_eks_cluster" "main" {
[terraform]:       ~ arn                       = "arn:aws:eks:us-east-1:<account-id>:cluster/nebari-test-dev" -> (known after apply)
[terraform]:       ~ certificate_authority     = [
[terraform]:           - {
[terraform]:               - data = "<>"
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:       + cluster_id                = (known after apply)
[terraform]:       ~ created_at                = "2024-10-28 15:25:47.172 +0000 UTC" -> (known after apply)
[terraform]:       - enabled_cluster_log_types = [] -> null
[terraform]:       ~ endpoint                  = "https://0000000000000000000000000.gr7.us-east-1.eks.amazonaws.com" -> (known after apply)
[terraform]:       ~ id                        = "nebari-test-dev" -> (known after apply)
[terraform]:       ~ identity                  = [
[terraform]:           - {
[terraform]:               - oidc = [
[terraform]:                   - {
[terraform]:                       - issuer = "https://oidc.eks.us-east-1.amazonaws.com/id/0000000000000000"
[terraform]:                     },
[terraform]:                 ]
[terraform]:             },
[terraform]:         ] -> (known after apply)
[terraform]:         name                      = "nebari-test-dev"
[terraform]:       ~ platform_version          = "eks.17" -> (known after apply)
[terraform]:       ~ status                    = "ACTIVE" -> (known after apply)
[terraform]:         tags                      = {
[terraform]:             "Environment" = "dev"
[terraform]:             "Name"        = "nebari-test-dev"
[terraform]:             "Owner"       = "terraform"
[terraform]:             "Project"     = "nebari-test"
[terraform]:         }
[terraform]:         # (3 unchanged attributes hidden)
[terraform]:
[terraform]:       - access_config {
[terraform]:           - authentication_mode                         = "CONFIG_MAP" -> null
[terraform]:           - bootstrap_cluster_creator_admin_permissions = false -> null
[terraform]:         }
[terraform]:
[terraform]:       - encryption_config { # forces replacement
[terraform]:           - resources = [
[terraform]:               - "secrets",
[terraform]:             ] -> null
[terraform]:
[terraform]:           - provider {
[terraform]:               - key_arn = "arn:aws:kms:us-east-1:<account-id>:key/0000000000000000" -> null
[terraform]:             }
[terraform]:         }
[terraform]:
[terraform]:       - kubernetes_network_config {
[terraform]:           - ip_family         = "ipv4" -> null
[terraform]:           - service_ipv4_cidr = "172.20.0.0/16" -> null
[terraform]:         }
[terraform]:
[terraform]:       ~ vpc_config {
[terraform]:           ~ cluster_security_group_id = "sg-xxxxxxxxxxxxxxxxxx" -> (known after apply)
[terraform]:           ~ vpc_id                    = "vpc-xxxxxxxxxxxxxxxx" -> (known after apply)
[terraform]:             # (5 unchanged attributes hidden)
[terraform]:         }
[terraform]:     }
[terraform]:
[terraform]:   # module.kubernetes.aws_iam_openid_connect_provider.oidc_provider must be replaced
[terraform]: -/+ resource "aws_iam_openid_connect_provider" "oidc_provider" {
[terraform]:       ~ arn             = "arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]:       ~ id              = "arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]:         tags            = {
[terraform]:             "Environment" = "dev"
[terraform]:             "Name"        = "nebari-test-dev-eks-irsa"
[terraform]:             "Owner"       = "terraform"
[terraform]:             "Project"     = "nebari-test"
[terraform]:         }
[terraform]:       ~ thumbprint_list = [
[terraform]:           - "9e99a48a9960b14926bb7f3b02e22da2b0ab7280",
[terraform]:           - "06b25927c42a721631c1efd9431e648fa62e1e39",
[terraform]:           - "d9fe0a65fa00cabf61f5120d373a8135e1461f15",
[terraform]:           - "7f3682e963aa03a7bcd67f11b0fedae315af49d4",
[terraform]:         ] -> (known after apply)
[terraform]:       ~ url             = "oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" # forces replacement -> (known after apply) # forces replacement
[terraform]:         # (2 unchanged attributes hidden)
[terraform]:     }
[terraform]:
[terraform]:   # module.kubernetes.aws_iam_policy.cluster_encryption[0] will be destroyed
[terraform]:   # (because index [0] is out of range for count)
[terraform]:   - resource "aws_iam_policy" "cluster_encryption" {
[terraform]:       - arn         = "arn:aws:iam::<account-id>:policy/nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - description = "IAM policy for EKS cluster encryption" -> null
[terraform]:       - id          = "arn:aws:iam::<account-id>:policy/nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - name        = "nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - path        = "/" -> null
[terraform]:       - policy      = jsonencode(
[terraform]:             {
[terraform]:               - Statement = [
[terraform]:                   - {
[terraform]:                       - Action   = [
[terraform]:                           - "kms:ListGrants",
[terraform]:                           - "kms:Encrypt",
[terraform]:                           - "kms:DescribeKey",
[terraform]:                           - "kms:Decrypt",
[terraform]:                         ]
[terraform]:                       - Effect   = "Allow"
[terraform]:                       - Resource = "arn:aws:kms:us-east-1:<account-id>:key/3zzzzzzzzzzzzz"
[terraform]:                     },
[terraform]:                 ]
[terraform]:               - Version   = "2012-10-17"
[terraform]:             }
[terraform]:         ) -> null
[terraform]:       - policy_id   = "ANPARM6PEZIZXIYANUQUT" -> null
[terraform]:       - tags        = {} -> null
[terraform]:       - tags_all    = {} -> null
[terraform]:     }
[terraform]:
[terraform]:   # module.kubernetes.aws_iam_role_policy_attachment.cluster_encryption[0] will be destroyed
[terraform]:   # (because index [0] is out of range for count)
[terraform]:   - resource "aws_iam_role_policy_attachment" "cluster_encryption" {
[terraform]:       - id         = "nebari-test-dev-eks-cluster-role-00000000000000000" -> null
[terraform]:       - policy_arn = "arn:aws:iam::<account-id>:policy/nebari-test-dev-eks-encryption-policy" -> null
[terraform]:       - role       = "nebari-test-dev-eks-cluster-role" -> null
[terraform]:     }
[terraform]:
[terraform]: Plan: 3 to add, 0 to change, 5 to destroy.
[terraform]:
[terraform]: Changes to Outputs:
[terraform]:   ~ cluster_oidc_issuer_url = "https://oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]:   ~ kubernetes_credentials  = (sensitive value)
[terraform]:   ~ oidc_provider_arn       = "arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000" -> (known after apply)
[terraform]: local_file.kubeconfig[0]: Destroying... [id=ebb9ba2900716cbac8f3zzzzzzzzzzzzz]
[terraform]: local_file.kubeconfig[0]: Destruction complete after 0s
[terraform]: module.kubernetes.aws_iam_openid_connect_provider.oidc_provider: Destroying... [id=arn:aws:iam::<account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/0000000000000000000000000]
[terraform]: module.kubernetes.aws_iam_openid_connect_provider.oidc_provider: Destruction complete after 0s
[terraform]: module.kubernetes.aws_eks_cluster.main: Destroying... [id=nebari-test-dev]
[terraform]: module.kubernetes.aws_eks_cluster.main: Still destroying... [id=nebari-test-dev, 10s elapsed]
[terraform]: module.kubernetes.aws_eks_cluster.main: Still destroying... [id=nebari-test-dev, 20s elapsed]
[terraform]: module.kubernetes.aws_eks_cluster.main: Still destroying... [id=nebari-test-dev, 30s elapsed]
[terraform]:
[terraform]: Error: deleting EKS Cluster (nebari-test-dev): operation error EKS: DeleteCluster, https response error StatusCode: 409, RequestID: de8f18ba-0abe-42ae-961f-86d8865fbcf3, ResourceInUseException: Cluster has nodegroups attached
[terraform]: 
[terraform]: 
[terraform]:
 Traceback (most recent call last) «
/home/ssm-user/nebari_private_test/nebari/src/_nebari/subcommands/deploy.py:92 in deploy
    
89   msg = "Digital Ocean support is currently being deprecated and will be removed
90   typer.confirm(msg)                                                     
91                                                                              
92   deploy_configuration(                                                      
93   config,                                                                
94   stages,                                                                
95   disable_prompt=disable_prompt,                                         
    
/home/ssm-user/nebari_private_test/nebari/src/_nebari/deploy.py:55 in deploy_configuration
    
52     s: hookspecs.NebariStage = stage(                                  
53     output_directory=pathlib.Path.cwd(), config=config             
54     )                                                                  
55     stack.enter_context(s.deploy(stage_outputs, disable_prompt))       
56                                                                        
57     if not disable_checks:                                             
58     s.check(stage_outputs, disable_prompt)                         
    
/usr/lib64/python3.11/contextlib.py:505 in enter_context                                
    
502   except AttributeError:                                                    
503   raise TypeError(f"'{cls.__module__}.{cls.__qualname__}' object does " 
504   f"not support the context manager protocol") from None
505   result = _enter(cm)                                                       
506   self._push_cm_exit(cm, _exit)                                             
507   return result                                                             
508 
    
/usr/lib64/python3.11/contextlib.py:137 in __enter__                                    
    
134   # they are only needed for recreation, which is not possible anymore      
135   del self.args, self.kwds, self.func                                       
136   try:                                                                      
137   return next(self.gen)                                                 
138   except StopIteration:                                                     
139   raise RuntimeError("generator didn't yield") from None                
140 
    
/home/ssm-user/nebari_private_test/nebari/src/_nebari/stages/infrastructure/__init__.py:961 in deploy
    
958 def deploy(                                                                   
959   self, stage_outputs: Dict[str, Dict[str, Any]], disable_prompt: bool = False
960 ):                                                                            
961   with super().deploy(stage_outputs, disable_prompt):                       
962   with kubernetes_provider_context(                                     
963     stage_outputs["stages/" + self.name]["kubernetes_credentials"]["value"]    
964   ):                                                                    
    
/usr/lib64/python3.11/contextlib.py:137 in __enter__                                    
    
134   # they are only needed for recreation, which is not possible anymore      
135   del self.args, self.kwds, self.func                                       
136   try:                                                                      
137   return next(self.gen)                                                 
138   except StopIteration:                                                     
139   raise RuntimeError("generator didn't yield") from None                
140 
    
/home/ssm-user/nebari_private_test/nebari/src/_nebari/stages/base.py:298 in deploy      
    
295   deploy_config["terraform_import"] = True                              
296   deploy_config["state_imports"] = state_imports                        
297                                                                             
298   self.set_outputs(stage_outputs, terraform.deploy(**deploy_config))        
299   self.post_deploy(stage_outputs, disable_prompt)                           
300   yield                                                                     
301 
    
/home/ssm-user/nebari_private_test/nebari/src/_nebari/provider/terraform.py:71 in deploy
    
 68     )                                                                 
 69                                                                             
 70   if terraform_apply:                                                       
 71   apply(directory, var_files=[f.name])                                  
 72                                                                             
 73   if terraform_destroy:                                                     
 74   destroy(directory, var_files=[f.name])

/home/ssm-user/nebari_private_test/nebari/src/_nebari/provider/terraform.py:153 in apply
    
150   + ["-var-file=" + _ for _ in var_files]                                   
151 )                                                                             
152 with timer(logger, "terraform apply"):                                        
153   run_terraform_subprocess(command, cwd=directory, prefix="terraform")      
154 
155
156 def output(directory=None):

/home/ssm-user/nebari_private_test/nebari/src/_nebari/provider/terraform.py:119 in run_terraform_subprocess                                                                
    
116 logger.info(f" terraform at {terraform_path}")                                
117 exit_code, output = run_subprocess_cmd([terraform_path] + processargs, **kwargs)
118 if exit_code != 0:                                                            
119   raise TerraformException("Terraform returned an error")                   
120 return output                                                                 
121 
122 
TerraformException: Terraform returned an error

Additional Error details from CloudTrail:
image

@joneszc joneszc requested review from dcmcand and viniciusdc October 30, 2024 17:53
@dcmcand
Copy link
Contributor

dcmcand commented Oct 31, 2024

So @joneszc am I reading that correctly that enabling this option will destroy and replace your cluster? We should probably go ahead and make this field immutable then. We definitely don't want anyone accidentally destroying their deploy. The docs should reflect that this should only be used on fresh deploys too.

Copy link
Contributor

@dcmcand dcmcand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joneszc , This is looking really good. I think that the config option needs to be immutable though.

@@ -121,6 +121,27 @@ def instances(region: str) -> Dict[str, str]:
return {t: t for t in instance_types}


@functools.lru_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stylistic, so it is up to you whether to adopt this. I would probably add a Kms_Key_Info class or something.

@dataclass
class Kms_Key_Info:
    Arn: str
    KeyUsage: str
    KeySpec: str
    KeyManager: str

The advantage here is more clarity in your type annotations. Rather than returning Dict[str, dict] you can return Dict[str, Kms_Key_Info] which is clearer what is actually returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcmcand
Updated, per your suggestion

@@ -174,6 +174,7 @@ class AWSInputVars(schema.Base):
eks_endpoint_access: Optional[
Literal["private", "public", "public_and_private"]
] = "public"
eks_kms_arn: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this field should be immutable to prevent clusters from being deleted.

See src/_nebari/stages/kubernetes_services/init.py:66 for an example of an immutable field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @joneszc am I reading that correctly that enabling this option will destroy and replace your cluster? We should probably go ahead and make this field immutable then. We definitely don't want anyone accidentally destroying their deploy. The docs should reflect that this should only be used on fresh deploys too.

Hello @dcmcand
Enabling the eks_kms_arn option, including doing so as a re-deploy on an existing cluster, is not a problem. The issue arises when one attempts to disable the cluster encryption on an existing cluster, which causes a failed attempt to destroy the cluster. I've added a warning for this behavior in the docs update. Making the field immutable will block users from being able to set cluster encryption on an existing cluster, which is functional. Is there a mechanism for allowing the field to be set, if not set already, but immutable if altered from a set state? The risk of destroying the cluster is ineluctably countered by the error:
Error: deleting EKS Cluster (nebari-test-dev): operation error EKS: DeleteCluster, https response error StatusCode: 409, RequestID: de8f18ba-0abe-42ae-961f-86d8865fbcf3, ResourceInUseException: Cluster has nodegroups attached

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so:
Fresh install with this enabled = no problem
Fresh install with this disabled = no problem
Enabling on existing install where it wasn't previously enabled = no problem
Disabling on existing install where it was previously enabled = tries to destroy cluster, but fails

What if someone tries to disable encryption, it fails, and then the reenable it in the config and try to deploy again? Does that succeed without error? If so then I think we are good to go.

It would be nice to have one way gates as an option on immutable fields for situations like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joneszc can you test the above scenario and update with the latest main?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcmcand
Yes, that is the correct logic flow. I'm testing now. I need to test how the components behave as well after enabling encryption and then actually encrypting the secrets, as the docs mention this important step, which might have additional repercussions on Nebari components besides the Cluster functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcmcand
Following a failed attempt to disable encryption, by setting a previously configured amazon_web_services.eks_kms_arn value to null or removing the field altogether, a follow-up re-deploy to reenable the original encryption setting does succeed without any issues. Also, an attempt to switch KMS keys by replacing a previously set KMS key arn with a new arn will succeed in a re-deploy, but the KMS key is not actually changed on the EKS Cluster. Essentially, no change is made and the cluster functions as before. I'll make a note of this in the docs PR.

Copy link
Contributor

@viniciusdc viniciusdc Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to hear, @joneszc. If I understood correctly if by any chance the user removes the encryption key from the config and redeploys, the user would first face an error, but if -- what I assume in a few minutes -- the user re-attempts the deployment, it will succeed?

If so, please add this as an admonition to the docs under a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @viniciusdc

That is correct.
I've added the admonitions to the docs PR#537, per your recommendation.

@joneszc joneszc requested a review from dcmcand November 4, 2024 15:53
@joneszc joneszc merged commit 8b2ffb9 into nebari-dev:main Nov 5, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: follow-up 📫 Someone needs to get back to this issue or PR
Projects
Status: Done 💪🏾
3 participants