Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AL2022 Neuron Support #76

Merged
merged 3 commits into from
Sep 14, 2022
Merged

AL2022 Neuron Support #76

merged 3 commits into from
Sep 14, 2022

Conversation

Realmonia
Copy link
Contributor

@Realmonia Realmonia commented Sep 6, 2022

Summary

Merge feature/al2022neuron branch to main

Implementation details

  • Add new target to Makefile
  • Create al2022neu.pkr.hcl and modify al2022.pkr.hcl build sources to include al2022neu
  • Modify scripts/enable-ecs-agent-inferentia-support.sh to also run when AMI_TYPE is al2022neu. Skip neuron tools installation when it is on AL2022.
  • Run scripts/enable-ecs-agent-inferentia-support.sh to install neuron packages on AL2022 base instance and enable INF on instance.
  • Add a reboot before enable-ecs-agent-inferentia-support.sh script to install possible kernel upgrade (when release distribution is not latest on base AMI).
  • Some other updates

Testing

New tests cover the changes: No

Manual Tests:
Ran REGION=us-west-2 make al2022neu and successfully created an private AMI with name unofficial-amzn2022-ami-ecs-neu-hvm-2022.0.20220831-x86_64-ebs, id ami-0cb05523683ed2eb6

Launched inf1/trn1 instances with this AMI and ran a busybox task on this instance.
trn1.2xlarge instance gets 1 neuron device
trn1.32xlarge instance gets 16 neuron devices

Functional test:

./bin/run-test -domain desktop -realm us-west-2 -platform al2022neu --imageId ami-0cb05523683ed2eb6
  • --- FAIL: TestContainerAccessIntrospection (21.73s). This is a known AL2022 Base AMI issue with route table and not related the specific changes made in this PR.

      run.go:147: Expected callIntrospection to exit with 42; actually exited with 11. logGroup=macis-application-logs logStream=application-logs/callIntrospection/73871a3a91624c1c804ab7a175a8850c
    
  • --- FAIL: TestEFS/test-efs-rw-awsvpc (166.33s). Error:

    StoppedReason: "Error response from daemon: create ecs-ecsftest-test-efs-rw-awsvpc-2799b080208bbcb584a2f18b29920839-1-task-efs-write-c0dec88888cedca66e00: VolumeDriver.Create: mounting volume failed: b'mount.nfs4: Failed to resolve server fs-4eb22fe4.efs.us-west-2.amazonaws.com: Name or service not known'",
    

    However, on the affected host, efs url is able to be resolved.

    sh-5.1$ ping fs-4eb22fe4.efs.us-west-2.amazonaws.com
    PING fs-4eb22fe4.efs.us-west-2.amazonaws.com (10.0.65.39) 56(84) bytes of data.
    ^C
    --- fs-4eb22fe4.efs.us-west-2.amazonaws.com ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms
    

    This is likely due to virtual interface issue (ec2-net-utils for AL2022) udev rules configuration incorrectly handles virtual interfaces amazonlinux/amazon-ec2-net-utils#67

  • --- FAIL: TestEFS/test-efs-iam-awsvpc (166.30s). This is the sad case test for TestEFS/test-efs-rw-awsvpc and we expect a failure when TestEFS/test-efs-rw-awsvpc failed.

    StoppedReason: "Error response from daemon: create ecs-ecsftest-test-efs-iam-awsvpc-ad6ab14d62cc80293a953381263afeac-4-task-efs-iam-fc9adcaba8c7e1e88e01: Post \"http://%2Frun%2Fdocker%2Fplugins%2Famazon-ecs-volume-plugin.sock/VolumeDriver.Create\": context deadline exceeded"
    

Functional test passed for 20220610 AMI I built before, but test AMI was unfortunately deregistered and the base AL2022 AMI and it's distribution release is also deregistered.

AMI test:

./bin/ami-tests -domain desktop -realm us-west-2 -platform al2022neu --imageId ami-0cb05523683ed2eb6

passed.

Description for the changelog

AL2022 Neuron Support

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Realmonia Realmonia marked this pull request as ready for review September 13, 2022 03:26
@Realmonia Realmonia merged commit 4aa6da7 into main Sep 14, 2022
@YashdalfTheGray YashdalfTheGray mentioned this pull request Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants