Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include region/partition criteria for test instance selection #933

Open
wants to merge 2 commits into
base: tests_strategy_improvement
Choose a base branch
from

Conversation

apozsuse
Copy link
Collaborator

This PR includes the partition/region as a criteria for the selection of the instances that will be used for the tests.

The mechanism will try to test all the instance feature combinations possible that can be covered for each partition with the test instance catalog configured. If no test can be executed in some partition (with exceptions as arm64 images in aws-cn or aws-us-gov partitions) an exception is raised (as there's some config issue in the instance catalog).

I still left the actual test execution commented. I'll test it and include it in the following PR.

Also updated the provided configuration example with actual values of instances for each region.

mash/services/test/ec2_job.py Show resolved Hide resolved
mash/services/test/ec2_job.py Outdated Show resolved Hide resolved
f'feature combinations: {self.feature_combinations}'
# mash will try to test all the possible features supported in the
# available test regions for each partition
for partition, test_regions in self.partitions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do some smoke testing with cn partition in particular. If this will start to cause a lot of unnecessary false positives. The networking in CN region can be troublesome which leads to false job failures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow your comment.
With the code as in this PR, all the instance features are tested in every partition, and that could be a problem, as you say, in CN region if tests fail per the unreliable behavior of the partition itself. Furthermore, a exception is raised if some feature combination is NOT possible to be tested in some partition.

Giving it a second thought, I was wondering if it would be better, to avoid too much job false failures, to try to test the features in a best effort basis (removing this exception raise and just writing a warning log if some feature cannot be tested in some partition) for us to control what is tested or not by means of the test instance catalog config.
That way we'd control, for example, if SEV is tested in CN or not, just removing the feature as supported in the CN test instance catalog... and we'd just a warning in the log stating SEV cannot be tested in CN partition with the current config...

wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trouble with only treating failures as warning level is that something could slip through the cracks. Technically the partitions are different reference implementations and thus there's a chance an image fails on one and no on another. Probably the best we can do is to note this concern when we deploy. That way if we notice an uptick in false failures we are aware of the reason.

@@ -84,6 +85,13 @@ def run_job(self):
)
return

self.partitions = get_partition_test_regions(self.test_regions)
if not self.partitions:
msg = 'At least one partition is required for tests.'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a configuration issue, i.e. the partition key missing in the config. The message should express that this needs to be fixed in the configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

self.cloud_architecture == 'aarch64':
# Skip test aarch64 images in China and GovCloud.
# There are no aarch64 based instance types available.
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer correct, gov-cloud has M6g and T4g [1] we should fix this, meaning push aarch64 images to the gov-cloud regions. T4g is also available in China [2]

[1] https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-instance-regions.html
[2] https://www.amazonaws.cn/en/ec2/instance-types/

# for the partition that can cover a single feat combination
msg = (
f'No instances in the instance catalog for {partition}'
f' partition can cover the any feature combination: '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/any//

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants