- If a task is prefixed with
[Track]
it means it should be ensured that this task is done, but the folks with the corresponding role are not responsible to do it themselves.
- Signal:
- Responsibility for the quality of the release
- Continuously monitor CI signal, so a release can be cut at any time
- Add CI signal for new release branches
The goal of this task is to have test coverage for the new release branch and results in testgrid.
This task is performed after the new release branch is cut by the release workflow during the final weeks of the release cycle.
While we add test coverage for the new release branch we will also drop the tests for old release branches if necessary. Examples to follow assume the new release branch is release-1.8
- Create new jobs based on the jobs running against our
main
branch:- Copy the
main
branch entry asrelease-1.8
in thecluster-api-prowjob-gen.yaml
file in test-infra. - Modify the following at the
release-1.8
branch entry: * Change intervals (let's use the same as forrelease-1.7
).
- Copy the
- Create a new dashboard for the new branch in:
test-infra/config/testgrids/kubernetes/sig-cluster-lifecycle/config.yaml
(dashboard_groups
anddashboards
). - Remove old release branches and unused versions from the
cluster-api-prowjob-gen.yaml
file in test-infra according to our policy documented in Support and guarantees. As we just addedrelease-1.8
, then we can now drop test coverage for therelease-1.5
branch. - Regenerate the prowjob configuration running
make generate-test-infra-prowjobs
command from cluster-api repository. Before running this command, ensure to export theTEST_INFRA_DIR
variable, specifying the location of the test-infra repository in your environment. For further information, refer to this link.
TEST_INFRA_DIR=../../k8s.io/test-infra make generate-test-infra-prowjobs
- Verify the jobs and dashboards a day later by taking a look at:
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.8
- Update the PR markdown link checker accordingly (e.g.
main
->release-1.8
).
Prior art:
The goal of this task is to keep our tests running in CI stable.
Note: To be very clear, this is not meant to be an on-call role for Cluster API tests.
- Add yourself to the Cluster API alert mailing list <br>Note: An alternative to the alert mailing list is manually monitoring the testgrid dashboards (also dashboards of previous releases). Using the alert mailing list has proven to be a lot less effort though.
- Subscribe to
CI Activity
notifications for the Cluster API repo. - Check the existing failing-test and flaking-test issue templates under
.github/ISSUE_TEMPLATE/
folder of the repo, used to create an issue for failing or flaking tests respectively. Please make sure they are up-to-date and if not, send a PR to update or improve them. - Check if there are any existing jobs that got stuck (have been running for more than 12 hours) in a 'pending' state:
- If that is the case, notify the maintainers and ask them to manually cancel and re-run the stuck jobs.
- Triage CI failures reported by mail alerts or found by monitoring the testgrid dashboards:
- Create an issue using an appropriate template (failing-test) in the Cluster API repository to surface the CI failure.
- Identify if the issue is a known issue, new issue or a regression.
- Mark the issue as
release-blocking
if applicable.
- Triage periodic GitHub actions failures, with special attention to image scan results; Eventually open issues as described above.
- Run periodic deep-dive sessions with the CI team to investigate failing and flaking tests. Example session recording: https://www.youtube.com/watch?v=YApWftmiDTg
Note: Maintaining the health of the project is a community effort. CI team should use all of the tools available to them to attempt to keep the CI signal clean, however the #cluster-api Slack channel should be used to increase visibility of release blocking interruptions to the CI signal and seek help from community. This should be additive to the steps described above. When in doubt, err on the side of overcommunication to promote awareness and drive disruptions to resolution.
The Cluster API tests are pretty stable, but there are still some flaky tests from time to time.
To reduce the amount of flakes please periodically:
-
Take a look at recent CI failures via
k8s-triage
: -
Open issues using an appropriate template (flaking-test) for occurring flakes and ideally fix them or find someone who can.
Note: Given resource limitations in the Prow cluster it might not be possible to fix all flakes. Let's just try to pragmatically keep the amount of flakes pretty low.