Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: automate performance benchmarking #2

Closed
wants to merge 15 commits into from

Conversation

galargh
Copy link

@galargh galargh commented May 25, 2023

This PR updates infrastructure setup and creates a GitHub Action workflow for performance testing (libp2p#183).

Not covered (yet)

It doesn't cover the following items from libp2p#183:

They trigger the automation via a GitHub comment.

In this iteration, the automation can be triggered by clicking Run workflow in https://github.com/libp2p/test-plans/actions/workflows/perf.yml.

Push perf/runner/benchmark-results.json to the pull request.

Right now the results are saved to workflow artifacts on GitHub and S3 bucket. Pushing it to the branch the workflow runs on is no problem either. I missed this.

Implementation

Infrastructure

I split the infra into long-lived and short-lived parts. This makes starting/stopping instances in CI easier. I updated the README to reflect the changes I introduced.

Long-lived

Everything that's defined in terraform. The new things I added there:

  • I replaced aws_instance resources with aws_instance_launch_template
  • I added a configuration for S3 bucket (used for storing result JSONs)
  • I added a configuration for a bot user (its credentials are used in CI)
  • I added a configuration for a lambda which periodically checks for running instances (runs every hour, deletes perf EC2 isntances that have been running for more than 50 minutes)

This part of infra is supposed to incur only negligible cost.

Short-lived

It's only EC2 instances. That's it. Everything else can be long-lived. We don't have to wait for it to come up every time. We don't have to worry about deleting it every time.

I added a Makefile which takes care of managing short-lived infra.

GHA

The workflow is defined in the perf.yml. For now, I configured it so that it can be triggered on demand or called from another workflow.

To run the workflow in a repository, additional configuration has to be done - set up of the long-lived infrastructure. I described how to do it exactly in the workflow file - https://github.com/galorgh/test-plans/blob/f5a25258021b36a1851f1c9719b54395a9733ba2/.github/workflows/perf.yml#L3-L12.

Testing

Other

For testing, I set up the long-lived infrastructure in my AWS account. I'd suggest setting it up in the libp2p's AWS account.

You'll be able to trigger the workflow only after it is merged to the default branch.

@mxinden
Copy link
Owner

mxinden commented May 26, 2023

@galargh to make collaboration easier, I moved to an upstream branch. Would you mind re-opening this pull request against libp2p#184, more specifically https://github.com/libp2p/test-plans/tree/perf?

@mxinden
Copy link
Owner

mxinden commented May 26, 2023

Thank you @galargh for this work!

I split the infra into long-lived and short-lived parts.

Where would the .tfstate file of the long-lived terraform-provisioned infrastructure live?

Right now the results are saved to workflow artifacts on GitHub and S3 bucket.

I don't think this is necessary. For the sake of simplicity, I suggest only pushing the results to the branch of the pull request. Gives us a single source-of-truth, fully version controlled.

You'll be able to trigger the workflow only after it is merged to the default branch.

It is necessary to trigger the workflow before merging into the default branch (i.e. master). Otherwise how would one test a release candidate. Are there any larger blockers to triggering it before the merge?

I successfully ran the GHA workflow in my org - https://github.com/galorgh/test-plans/actions/runs/5080167881/attempts/1#summary-13756441047

😎


# How to configure a repository for running this workflow:
# 1. Run 'make ssh-keygen' in 'perf' to generate a new SSH key pair named 'user' in 'perf/terraform/region/files'
# 2. Export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the account of your choice
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# 2. Export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the account of your choice
# 2. Configure your AWS credentials, e.g. by exporting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or writing `~/.aws/credentials` for the account of your choice

Either should still work, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course. I'll link to https://registry.terraform.io/providers/hashicorp/aws/latest/docs#authentication-and-configuration instead which has all the options listed.

Comment on lines +62 to +65
- name: Configure SSH
uses: webfactory/ssh-agent@d4b9b8ff72958532804b70bbe600ad43b36d5f2e # v0.8.0
with:
ssh-private-key: ${{ secrets.PERF_SSH_PRIVATE_KEY }}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with the move to AWS launch templates there is no easy way for ephemeral SSH keys? Ephemeral keys would eliminate the need to manage long lived credentials. Please ignore in case ephemeral keys would add more complexity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep generating keys on the fly. We still do have to keep managing the long-lived AWS credentials. And if SSH keys are ephemeral, then those AWS creds have to be allowed to create AWS key pairs.

I'll add instructions on what it'd take to switch between the two next to the aws_key_pair resource and we can decide then. I don't feel strongly either way.

Comment on lines +95 to +108
- name: Archive results
uses: actions/upload-artifact@v2
with:
name: results
path: perf/runner/benchmark-results.json
- id: s3
name: Upload results
env:
AWS_BUCKET: ${{ inputs.aws-bucket || vars.PERF_AWS_BUCKET }}
AWS_BUCKET_PATH: ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/benchmark-results.json
run: |
aws s3 cp benchmark-results.json s3://$AWS_BUCKET/$AWS_BUCKET_PATH --acl public-read --region us-west-2
echo "url=https://$AWS_BUCKET.s3.amazonaws.com/$AWS_BUCKET_PATH" >> $GITHUB_OUTPUT
working-directory: perf/runner
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these steps are necessary. See comment above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I'll replace it with a push.

Comment on lines +24 to +26
# https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html
scale-down:
sam local invoke ScaleDown --template terraform/common/files/scale_down.yml --event terraform/common/files/scale_down.json
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this Make target called?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it for testing the lambda code. I'll move this next to the where all the other lambda files are and add a more descriptive description.

@@ -10,21 +10,51 @@ Benchmark results can be visualized with https://observablehq.com/@mxinden-works

## Provision infrastructure

1. `cd terraform`
2. Save your public SSH key as the file `./user.pub`.
### Bootstrap
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Bootstrap
### Bootstrap (long-lived resources)

export AWS_SECRET_ACCESS_KEY=$(cat perf_accessKeys.csv | tail -n 1 | cut -d, -f2)
```

### Nodes
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Nodes
### Start nodes (short-lived resources)

@mxinden
Copy link
Owner

mxinden commented May 26, 2023

Concerns splitting into long-lived and short-lived

I am not sure the here introduced split into short-lived and long-lived resources is a good idea.

  • It adds complexity. I.e. one now has to reason about two sets of resources.
  • It adds an additional step to testing against a personal AWS account. (Spin up both long-lived and short-lived resources.)
  • Testing against a personal AWS account doesn't require the AWS Lambda.
  • It speeds things up, as everything but the machines no longer need to be spun up per run. Though I don't think the reduced time-to-run matters much for us. As far as I can tell, spinning up the machines is the dominant time factor anyways.
  • As far as I can tell, it introduces the need for a long lived SSH keypair through the AWS launch template.
  • It requires us to persist the .tfstate file for the long-lived resources somewhere. (Though arguably we have to do so for the Lambda anyways.)

Alternative approach

Have two terraform projects.

  1. Again, long-lived, deploying the Lambda only.
  2. Short-lived deploying all other resources, introducing the machines themselves.

Benefits:

  • Allows easy testing against personal AWS account. One does not have to bother with the terraform project launching the Lambda. A single terraform apply is enough.
  • No long-lived SSH credentials.
  • Only necessary to persist the .tfstate file of the terraform project deploying the Lamdba.

@galargh let me know what you think. I am in no way an expert. Ultimate goal is to keep things simple. In case the above just adds complexity, please ignore.

@galargh
Copy link
Author

galargh commented May 26, 2023

@galargh to make collaboration easier, I moved to an upstream branch. Would you mind re-opening this pull request against libp2p#184, more specifically https://github.com/libp2p/test-plans/tree/perf?

Sure! I'll reopen the PR against that branch and I'll address all the non-inline comments there. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants