Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci3 #10775

Open
2 of 29 tasks
ludamad opened this issue Dec 16, 2024 · 0 comments
Open
2 of 29 tasks

ci3 #10775

ludamad opened this issue Dec 16, 2024 · 0 comments
Assignees

Comments

@ludamad
Copy link
Collaborator

ludamad commented Dec 16, 2024

Full CI3:

Barretenberg

  • Get rid of barretenberg/cpp/srs_db. Modify code to assume dat is in ~/.bb-crs (in flat format i.e. single file, no header). Will save 50s on build and prob some $.
  • Make bb/.js crs download only download whats needed to extend an existing CRS. Will make the process idempotent and then we don't need to worry about process races.
  • barretenberg bindgen tool should be run as part of build. Currently manually run and committed.
  • Cleanup all the barretenberg codespace vscode stuff (proving systems in root?)
  • Cleanup barretenberg/cpp/scripts.
  • Look into the 2 bbup's. Also bbup needs to release.

The rest

  • zstd for compression and cache node_modules with yarn.lock as hash.
  • Push both arches/manifest of the aztecprotocol:dind image so we don't have to build it (or use devcontainer?).
  • Make memsupend default be calculation e.g. 80%.
  • Use a new source_resource script for getting system resources.
  • Run flakes and have owners associated. Slack log link to owner on every flake, but don't fail build.
  • acir_tests benchmark script needs updating.
  • Benchmarks dashboard (tinystats?).
  • Add loud warnings/alerts around unexpectedly long build times. Both arches. e.g. currently we have a painful 10-20m build issue on arm with avm relations/poseidon2.cpp.
  • We may need to expand instance scope when requesting instances. Currently just m6a.32xlarge. But I'd like to see how we get on first. However need to prepare to respond if issues. Probably just loop instance by instance in order of preference. Slow but simple. It's an edge case.
  • All formatting/linting should be part of build, not test. Thus once built and cached, we don't do it again.
  • Improve our rebuild patterns. Support e.g. inversions !some/script/that/should/not/trigger
  • Enable "merge queues" and have it follow a different "masterly" workflow.
  • Delete every Dockerfile, Earthfile, bash script, and Yaml file, that isn't needed, or can be replaced with something simpler.
  • Do a pass over all repository secrets. Remove unneeded. Lock down releases.
  • Make a failing fast with no local changes print large announcement to alert ci team.
  • Skipping tests matching test_caches_open|requests in noir tests.
  • Maintain a fleet of spots that will live for up to e.g. 1 hour idle. Any spots request will come from the fleet if available. Will give faster startup times.
  • Put time -v / ulimit on tests and alert/kill on mem usage.
  • TXE should exit with 0 on SIGTERM.
  • Better secret scoping. e.g. only tagged releases should have access to dockerhub key. Tags can only be created by authorized flow (i.e. at least a PR with an approval).
  • Handle redis connection failures gracefully. e.g. at present the test filter, if fails to find redis, just runs no tests.
  • More log header metadata, link to top level log, arch.
  • Remove starterkit devcontainers. Just have one on the aztec-starter-kit published repo.
charlielye added a commit that referenced this issue Dec 17, 2024
CI3 is a conceptual goal for uniting the CI flow and the dev flow as
much as possible, adding more depth to the bootstrap and build scripts
to be able to handle our needs.

This PR introduces all the work on CI3 so far, but still has an earthly
caller shell to make sure we can minimize the number of variables that
have changed at once.

There is a lot of changes in this PR.
See https://github.com/AztecProtocol/aztec-packages/pull/10711/files for
a subset of the changes without yarn.lock etc noise.

The big picture:
- The CI build has been made much less stateful. ci.yml now uses the ci3
bootstrap pattern, without fully moving off the earthly targets just
yet.
- The S3 cache mechanism is now the main cache mechanism. Note there is
no persistent disk now supporting the build.
There is a global cache on S3, readable without auth, that caches them
for 10 days. We no longer think of the build in terms of docker/buildkit
layers but instead as chunks that have different rebuild patterns that
match files in the monorepo.
- Moving to yarn 4.5.2. 

Niceties:
- faster builds due to script improvements and distributed cache
uploading by default
- work is more properly isolated in chunks from the above effort
- spot recovery is implemented, retrying with on-demand
- we no longer use github runners, side-stepping lots of edge-cases, and
instead rely on our builder realizing there is no work to do / hitting a
timeout via shutdown -P
- Docker images are no longer copied from the builder, meaning a large
class of flake is gone.

Non-niceties:
- The earthly setup is much less granular. There is two stages that have
their own one-layer builds. The earthly cache is fairly redundant, using
the S3 cache for most meaningful caching. (earthly will not be used in
ci.yml in the future)
- Some CI files are now duplicated, we will do a follow-on pass to get
rid of earthly helpers, build-system, etc
- CI currently also downloads the CI image fresh each time, will change
- we are currently pushing images to dockerhub with no expiration,
should move to ECR
- noir-projects currently retries once in the Earthfile as a last minute
issue was hit, will be fixed in a follow-up
- Issues: #10775

---- 
WORKFLOW AFTER THIS PR:
- Run ./bootstrap.sh in root to bootstrap with cache, ./bootstrap.sh
full otherwise
- Run earthly +ci in root to 
- Put ci3 in your cache and note the ways to interact with ci in that
folder
- Note the new commands in ./bootstrap.sh like test-kind-network
- In yarn-project to run a single e2e test now use `test:e2e`. `test`
just runs the unit tests as per other projects.

---------

Co-authored-by: MirandaWood <miranda@aztecprotocol.com>
Co-authored-by: Charlie Lye <karl.lye@gmail.com>
Co-authored-by: Tom French <15848336+TomAFrench@users.noreply.github.com>
ludamad added a commit that referenced this issue Dec 19, 2024
CI3 is a conceptual goal for uniting the CI flow and the dev flow as
much as possible, adding more depth to the bootstrap and build scripts
to be able to handle our needs.

This PR introduces all the work on CI3 so far, but still has an earthly
caller shell to make sure we can minimize the number of variables that
have changed at once.

There is a lot of changes in this PR.
See https://github.com/AztecProtocol/aztec-packages/pull/10711/files for
a subset of the changes without yarn.lock etc noise.

The big picture:
- The CI build has been made much less stateful. ci.yml now uses the ci3
bootstrap pattern, without fully moving off the earthly targets just
yet.
- The S3 cache mechanism is now the main cache mechanism. Note there is
no persistent disk now supporting the build.
There is a global cache on S3, readable without auth, that caches them
for 10 days. We no longer think of the build in terms of docker/buildkit
layers but instead as chunks that have different rebuild patterns that
match files in the monorepo.
- Moving to yarn 4.5.2. 

Niceties:
- faster builds due to script improvements and distributed cache
uploading by default
- work is more properly isolated in chunks from the above effort
- spot recovery is implemented, retrying with on-demand
- we no longer use github runners, side-stepping lots of edge-cases, and
instead rely on our builder realizing there is no work to do / hitting a
timeout via shutdown -P
- Docker images are no longer copied from the builder, meaning a large
class of flake is gone.

Non-niceties:
- The earthly setup is much less granular. There is two stages that have
their own one-layer builds. The earthly cache is fairly redundant, using
the S3 cache for most meaningful caching. (earthly will not be used in
ci.yml in the future)
- Some CI files are now duplicated, we will do a follow-on pass to get
rid of earthly helpers, build-system, etc
- CI currently also downloads the CI image fresh each time, will change
- we are currently pushing images to dockerhub with no expiration,
should move to ECR
- noir-projects currently retries once in the Earthfile as a last minute
issue was hit, will be fixed in a follow-up
- Issues: #10775

---- 
WORKFLOW AFTER THIS PR:
- Run ./bootstrap.sh in root to bootstrap using the (publicly available)
S3 cache, ./bootstrap.sh
full to force a full build
- Run ./bootstrap.sh ci to test the in-progress 'full CI3' locally
- Run ./ci.sh ec2 to test the in-progress 'full CI3' on an isolated
runner
- For ci2.5, use earthly +ci in root to simulate ci.yml. This shares the
S3 cache. Make sure to alias earthly to scripts/earthly_local.
- Put ci3 in your cache and note the ways to interact with ci in that
folder
- Recommended workflow, as commits are now needed to run earthly or
bootstrap_ec2:
```
# in repo root
./ci.sh draft && git commit -am "blobs work" && git push && earthly +ci
```
Other useful ci.sh commands are gha-url to see the last github job
associated with your branch.
- Notable commands in ./bootstrap.sh are test-kind-network, test-e2e,
images-e2e
- In yarn-project to run a single e2e test now use `test:e2e`. `test`
just runs the unit tests as per other projects.

---------

Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: MirandaWood <miranda@aztecprotocol.com>
Co-authored-by: Tom French <15848336+TomAFrench@users.noreply.github.com>
Co-authored-by: ludamad <domuradical@gmail.com>
AztecBot pushed a commit to AztecProtocol/barretenberg that referenced this issue Dec 20, 2024
CI3 is a conceptual goal for uniting the CI flow and the dev flow as
much as possible, adding more depth to the bootstrap and build scripts
to be able to handle our needs.

This PR introduces all the work on CI3 so far, but still has an earthly
caller shell to make sure we can minimize the number of variables that
have changed at once.

There is a lot of changes in this PR.
See https://github.com/AztecProtocol/aztec-packages/pull/10711/files for
a subset of the changes without yarn.lock etc noise.

The big picture:
- The CI build has been made much less stateful. ci.yml now uses the ci3
bootstrap pattern, without fully moving off the earthly targets just
yet.
- The S3 cache mechanism is now the main cache mechanism. Note there is
no persistent disk now supporting the build.
There is a global cache on S3, readable without auth, that caches them
for 10 days. We no longer think of the build in terms of docker/buildkit
layers but instead as chunks that have different rebuild patterns that
match files in the monorepo.
- Moving to yarn 4.5.2. 

Niceties:
- faster builds due to script improvements and distributed cache
uploading by default
- work is more properly isolated in chunks from the above effort
- spot recovery is implemented, retrying with on-demand
- we no longer use github runners, side-stepping lots of edge-cases, and
instead rely on our builder realizing there is no work to do / hitting a
timeout via shutdown -P
- Docker images are no longer copied from the builder, meaning a large
class of flake is gone.

Non-niceties:
- The earthly setup is much less granular. There is two stages that have
their own one-layer builds. The earthly cache is fairly redundant, using
the S3 cache for most meaningful caching. (earthly will not be used in
ci.yml in the future)
- Some CI files are now duplicated, we will do a follow-on pass to get
rid of earthly helpers, build-system, etc
- CI currently also downloads the CI image fresh each time, will change
- we are currently pushing images to dockerhub with no expiration,
should move to ECR
- noir-projects currently retries once in the Earthfile as a last minute
issue was hit, will be fixed in a follow-up
- Issues: AztecProtocol/aztec-packages#10775

---- 
WORKFLOW AFTER THIS PR:
- Run ./bootstrap.sh in root to bootstrap using the (publicly available)
S3 cache, ./bootstrap.sh
full to force a full build
- Run ./bootstrap.sh ci to test the in-progress 'full CI3' locally
- Run ./ci.sh ec2 to test the in-progress 'full CI3' on an isolated
runner
- For ci2.5, use earthly +ci in root to simulate ci.yml. This shares the
S3 cache. Make sure to alias earthly to scripts/earthly_local.
- Put ci3 in your cache and note the ways to interact with ci in that
folder
- Recommended workflow, as commits are now needed to run earthly or
bootstrap_ec2:
```
# in repo root
./ci.sh draft && git commit -am "blobs work" && git push && earthly +ci
```
Other useful ci.sh commands are gha-url to see the last github job
associated with your branch.
- Notable commands in ./bootstrap.sh are test-kind-network, test-e2e,
images-e2e
- In yarn-project to run a single e2e test now use `test:e2e`. `test`
just runs the unit tests as per other projects.

---------

Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: MirandaWood <miranda@aztecprotocol.com>
Co-authored-by: Tom French <15848336+TomAFrench@users.noreply.github.com>
Co-authored-by: ludamad <domuradical@gmail.com>
charlielye added a commit that referenced this issue Feb 14, 2025
[CI3 introduction.](https://hackmd.io/bTnKHtTHT8mAdTtD0t7JvA?view)

This is a majority step towards the vision of CI3, still namely missing
merge queue.

New features:
- Grinding flakes in master. We run all tests on 5 separate runners to
report on flakes at the source.
- External contributors can now have CI run just by approving their PR. 
- Ability to debug CI entirely from commandline from any machine. Get
dropped into a productive shell right after the CI failure by doing
`./ci.sh ec2` while your PR is a draft (note: do not do this if pushing
to a non-draft PR).
- Add tests to CI by adding tests to bootstrap. Target a rich
environment with no differences from running inside the dev container.
- Releases that are fully dry-runnable and deployable from a single
command. See above hackmd for details.
- Recovery from spot eviction (finally implemented correctly).

Some remaining items are tracked here.
#10775

---------

Co-authored-by: ludamad <domuradical@gmail.com>
Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: thunkar <gregojquiros@gmail.com>
AztecBot pushed a commit to AztecProtocol/barretenberg that referenced this issue Feb 15, 2025
[CI3 introduction.](https://hackmd.io/bTnKHtTHT8mAdTtD0t7JvA?view)

This is a majority step towards the vision of CI3, still namely missing
merge queue.

New features:
- Grinding flakes in master. We run all tests on 5 separate runners to
report on flakes at the source.
- External contributors can now have CI run just by approving their PR. 
- Ability to debug CI entirely from commandline from any machine. Get
dropped into a productive shell right after the CI failure by doing
`./ci.sh ec2` while your PR is a draft (note: do not do this if pushing
to a non-draft PR).
- Add tests to CI by adding tests to bootstrap. Target a rich
environment with no differences from running inside the dev container.
- Releases that are fully dry-runnable and deployable from a single
command. See above hackmd for details.
- Recovery from spot eviction (finally implemented correctly).

Some remaining items are tracked here.
AztecProtocol/aztec-packages#10775

---------

Co-authored-by: ludamad <domuradical@gmail.com>
Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: thunkar <gregojquiros@gmail.com>
AztecBot pushed a commit to AztecProtocol/aztec-nr that referenced this issue Feb 15, 2025
[CI3 introduction.](https://hackmd.io/bTnKHtTHT8mAdTtD0t7JvA?view)

This is a majority step towards the vision of CI3, still namely missing
merge queue.

New features:
- Grinding flakes in master. We run all tests on 5 separate runners to
report on flakes at the source.
- External contributors can now have CI run just by approving their PR. 
- Ability to debug CI entirely from commandline from any machine. Get
dropped into a productive shell right after the CI failure by doing
`./ci.sh ec2` while your PR is a draft (note: do not do this if pushing
to a non-draft PR).
- Add tests to CI by adding tests to bootstrap. Target a rich
environment with no differences from running inside the dev container.
- Releases that are fully dry-runnable and deployable from a single
command. See above hackmd for details.
- Recovery from spot eviction (finally implemented correctly).

Some remaining items are tracked here.
AztecProtocol/aztec-packages#10775

---------

Co-authored-by: ludamad <domuradical@gmail.com>
Co-authored-by: ludamad <adam.domurad@gmail.com>
Co-authored-by: thunkar <gregojquiros@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants