Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsys: change deletionpolicy for ECS clusters to 'onDeletion' #2670

Closed
wants to merge 1 commit into from

Conversation

etungsten
Copy link
Contributor

@etungsten etungsten commented Dec 17, 2022

Issue number:

N/A

Description of changes:

    testsys: change deletionpolicy for ecs cluster to 'onDeletion'
    
    Prior to this change, there was a chance of a race-condition between a
    ECS cluster destruction job and a ECS instance destruction job where ECS
    cluster desctruction would fail because the ECS instances aren't fully
    deleted yet.

When the default deletion policy was set to onTestSuccess all resources destruction jobs would trigger at the same time and the testsys controller seems to ignore the dependencies between the ECS instance resource and the ECS cluster resource. This would in turn cause a race condition within ECS resource deletion where ECS cluster destruction might trigger before ECS instances are fully cleaned-up and cause the process to fail.

The end result is an ECS cluster testsys resource that have to be manually removed.

Testing done:
Tests kick off fine:

[cargo-make] INFO - cargo make 0.36.3
[cargo-make] INFO - Build File: Makefile.toml
[cargo-make] INFO - Task: test
[cargo-make] INFO - Profile: development
[cargo-make] INFO - Running Task: setup
[cargo-make] INFO - Running Task: fetch-sources
[cargo-make] INFO - Running Task: test-tools
[cargo-make] INFO - Running Task: test
[2022-12-17T00:07:35Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1'
[2022-12-17T00:07:35Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-instances-migration'
[2022-12-17T00:07:36Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-test-1-initial'
[2022-12-17T00:07:36Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-2-migrate'
[2022-12-17T00:07:36Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-test-3-migrated'
[2022-12-17T00:07:36Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-4-migrate'
[2022-12-17T00:07:36Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-test-5-final'
[cargo-make] INFO - Build Done in 3.69 seconds.
[cargo-make] INFO - cargo make 0.36.3
[cargo-make] INFO - Build File: Makefile.toml
[cargo-make] INFO - Task: test
[cargo-make] INFO - Profile: development
[cargo-make] INFO - Running Task: setup
[cargo-make] INFO - Running Task: fetch-sources
[cargo-make] INFO - Running Task: test-tools
[cargo-make] INFO - Running Task: test
[2022-12-17T00:07:40Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia'
[2022-12-17T00:07:40Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia-instances-migration'
[2022-12-17T00:07:40Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia-test-1-initial'
[2022-12-17T00:07:41Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia-2-migrate'
[2022-12-17T00:07:41Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia-test-3-migrated'
[2022-12-17T00:07:41Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia-4-migrate'
[2022-12-17T00:07:41Z INFO  testsys::run] Successfully added 'x86-64-aws-ecs-1-nvidia-test-5-final'
[cargo-make] INFO - Build Done in 3.76 seconds.
[cargo-make] INFO - cargo make 0.36.3
[cargo-make] INFO - Build File: Makefile.toml
[cargo-make] INFO - Task: test
[cargo-make] INFO - Profile: development
[cargo-make] INFO - Running Task: setup
[cargo-make] INFO - Running Task: fetch-sources
[cargo-make] INFO - Running Task: test-tools
[cargo-make] INFO - Running Task: test
[2022-12-17T00:07:45Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1'
[2022-12-17T00:07:45Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1-instances-migration'
[2022-12-17T00:07:45Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1-test-1-initial'
[2022-12-17T00:07:46Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1-2-migrate'
[2022-12-17T00:07:46Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1-test-3-migrated'
[2022-12-17T00:07:46Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1-4-migrate'
[2022-12-17T00:07:46Z INFO  testsys::run] Successfully added 'aarch64-aws-ecs-1-test-5-final'

After all the tests passes, the resources stay around as expected:

[cargo-make] INFO - Running Task: testsys
 NAME                                                    TYPE                 STATE                PASSED            SKIPPED            FAILED          
 aarch64-aws-ecs-1                                       Resource             completed                                                                 
 aarch64-aws-ecs-1-2-migrate                             Test                 pass                 2                 0                  0               
 aarch64-aws-ecs-1-4-migrate                             Test                 pass                 2                 0                  0               
 aarch64-aws-ecs-1-instances-migration                   Resource             completed                                                                 
 aarch64-aws-ecs-1-test-1-initial                        Test                 pass                 1                 0                  0               
 aarch64-aws-ecs-1-test-3-migrated                       Test                 pass                 1                 0                  0               
 aarch64-aws-ecs-1-test-5-final                          Test                 pass                 1                 0                  0               
 x86-64-aws-ecs-1                                        Resource             completed                                                                 
 x86-64-aws-ecs-1-2-migrate                              Test                 pass                 2                 0                  0               
 x86-64-aws-ecs-1-4-migrate                              Test                 pass                 2                 0                  0                x86-64-aws-ecs-1-instances-migration                    Resource             completed                                                                 
 x86-64-aws-ecs-1-nvidia                                 Resource             completed                                                                 
 x86-64-aws-ecs-1-nvidia-2-migrate                       Test                 pass                 2                 0                  0               
 x86-64-aws-ecs-1-nvidia-4-migrate                       Test                 pass                 2                 0                  0               
 x86-64-aws-ecs-1-nvidia-instances-migration             Resource             completed                                                                 
 x86-64-aws-ecs-1-nvidia-test-1-initial                  Test                 pass                 1                 0                  0               
 x86-64-aws-ecs-1-nvidia-test-3-migrated                 Test                 pass                 1                 0                  0               
 x86-64-aws-ecs-1-nvidia-test-5-final                    Test                 pass                 1                 0                  0               
 x86-64-aws-ecs-1-test-1-initial                         Test                 pass                 1                 0                  0               
 x86-64-aws-ecs-1-test-3-migrated                        Test                 pass                 1                 0                  0               
 x86-64-aws-ecs-1-test-5-final                           Test                 pass                 1                 0                  0 

Then test deletion works as expected. TestSys properly first kicks off instance deletion and waits until that's done before deleting the ECS clusters. Everything gets cleaned up in the end as expected:

[cargo-make] INFO - cargo make 0.36.3
[cargo-make] INFO - Build File: Makefile.toml
[cargo-make] INFO - Task: clean-test
[cargo-make] INFO - Profile: development
[cargo-make] INFO - Running Task: setup
[cargo-make] INFO - Running Task: fetch-sources
[cargo-make] INFO - Running Task: test-tools
[cargo-make] INFO - Running Task: clean-test
Starting delete for x86-64-aws-ecs-1-test-3-migrated
Starting delete for aarch64-aws-ecs-1-test-1-initial
Starting delete for x86-64-aws-ecs-1-nvidia-4-migrate
Starting delete for x86-64-aws-ecs-1-nvidia-test-5-final
Starting delete for x86-64-aws-ecs-1-test-5-final
Starting delete for aarch64-aws-ecs-1-4-migrate
Starting delete for aarch64-aws-ecs-1-test-3-migrated
Starting delete for x86-64-aws-ecs-1-test-1-initial
Starting delete for x86-64-aws-ecs-1-nvidia-test-3-migrated
Starting delete for x86-64-aws-ecs-1-nvidia-2-migrate
Starting delete for x86-64-aws-ecs-1-2-migrate
Starting delete for x86-64-aws-ecs-1-nvidia-test-1-initial
Starting delete for aarch64-aws-ecs-1-2-migrate
Starting delete for aarch64-aws-ecs-1-test-5-final
Starting delete for x86-64-aws-ecs-1-4-migrate
Delete finished for x86-64-aws-ecs-1-test-3-migrated
Delete finished for aarch64-aws-ecs-1-test-1-initial
Delete finished for x86-64-aws-ecs-1-nvidia-4-migrate
Delete finished for x86-64-aws-ecs-1-nvidia-test-5-final
Delete finished for x86-64-aws-ecs-1-test-5-final
Delete finished for aarch64-aws-ecs-1-4-migrate
Delete finished for aarch64-aws-ecs-1-test-3-migrated
Delete finished for x86-64-aws-ecs-1-test-1-initial
Delete finished for x86-64-aws-ecs-1-nvidia-test-3-migrated
Delete finished for x86-64-aws-ecs-1-nvidia-2-migrate
Delete finished for x86-64-aws-ecs-1-2-migrate
Delete finished for x86-64-aws-ecs-1-nvidia-test-1-initial
Delete finished for aarch64-aws-ecs-1-2-migrate
Delete finished for aarch64-aws-ecs-1-test-5-final
Delete finished for x86-64-aws-ecs-1-4-migrate
Starting delete for x86-64-aws-ecs-1-nvidia-instances-migration
Starting delete for x86-64-aws-ecs-1-instances-migration
Starting delete for aarch64-aws-ecs-1-instances-migration
Delete finished for aarch64-aws-ecs-1-instances-migration
Delete finished for x86-64-aws-ecs-1-instances-migration
Delete finished for x86-64-aws-ecs-1-nvidia-instances-migration
Starting delete for x86-64-aws-ecs-1
Starting delete for x86-64-aws-ecs-1-nvidia
Starting delete for aarch64-aws-ecs-1
Delete finished for x86-64-aws-ecs-1
Delete finished for x86-64-aws-ecs-1-nvidia
Delete finished for aarch64-aws-ecs-1
[2022-12-17T00:06:53Z INFO  testsys::delete] Delete finished

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Copy link
Contributor

@stmcginnis stmcginnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -169,7 +169,7 @@ pub(crate) async fn ec2_crd<'a>(
.set_labels(Some(labels))
.set_conflicts_with(conflicting_resources.into())
.set_secrets(Some(bottlerocket_input.crd_input.config.secrets.clone()))
.destruction_policy(DestructionPolicy::OnTestSuccess);
.destruction_policy(DestructionPolicy::OnDeletion);
Copy link
Contributor

@ecpullen ecpullen Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still be OnTestSuccess.

@@ -118,7 +118,7 @@ impl CrdCreator for VmwareK8sCreator {
.vcenter_workload_folder(&self.datacenter.folder)
.mgmt_cluster_kubeconfig_base64(&self.encoded_mgmt_cluster_kubeconfig)
.set_conflicts_with(Some(existing_clusters))
.destruction_policy(DestructionPolicy::OnTestSuccess)
.destruction_policy(DestructionPolicy::OnDeletion)
Copy link
Contributor

@ecpullen ecpullen Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should stay as OnTestSuccess.

@@ -189,7 +189,7 @@ impl CrdCreator for VmwareK8sCreator {
})
.assume_role(bottlerocket_input.crd_input.config.agent_role.clone())
.set_conflicts_with(Some(existing_clusters))
.destruction_policy(DestructionPolicy::OnTestSuccess)
.destruction_policy(DestructionPolicy::OnDeletion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

@ecpullen
Copy link
Contributor

ecpullen commented Dec 19, 2022

When the default deletion policy was set to onTestSuccess all resources destruction jobs would trigger at the same time and the testsys controller seems to ignore the dependencies between the ECS instance resource and the ECS cluster resource. This would in turn cause a race condition within ECS resource deletion where ECS cluster destruction might trigger before ECS instances are fully cleaned-up and cause the process to fail.

The controller follows the dependencies between resources, it seems like the ec2 agent or the ecs test agent may not be fully cleaned up when they report they are.

ecs-cluster-resource deletion will not start until ec2-instance-resource deletion has finished.

Prior to this change, there was a chance of a race-condition between a
ECS cluster destruction job and a ECS instance destruction job where ECS
cluster desctruction would fail because the ECS instances aren't fully
deleted yet.
@etungsten etungsten changed the title testsys: change deletionpolicy for test resources to 'onDeletion' testsys: change deletionpolicy for ECS clusters to 'onDeletion' Dec 19, 2022
@etungsten
Copy link
Contributor Author

We longer need this since fixes were implemented in testsys EC2 and ECS resource agents so that the race condition shouldn't occur again.
bottlerocket-os/bottlerocket-test-system#720
bottlerocket-os/bottlerocket-test-system#716

@etungsten etungsten closed this Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants