Skip to content

Commit

Permalink
Add a KEP at sig-testing: continuously deploy k8s prow
Browse files Browse the repository at this point in the history
  • Loading branch information
chaodaiG committed Feb 23, 2021
1 parent 20b2158 commit 30faaa1
Show file tree
Hide file tree
Showing 2 changed files with 175 additions and 0 deletions.
157 changes: 157 additions & 0 deletions keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# KEP-2539: Continuously Deploy K8s Prow

<!-- toc -->
- [KEP-2539: Continuously Deploy K8s Prow](#kep-2539-continuously-deploy-k8s-prow)
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Prow Users](#prow-users)
- [Prow Oncall](#prow-oncall)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
- [Breaking Changes in Prow](#breaking-changes-in-prow)
- [Design Details](#design-details)
- [Automated Merging of Prow Autobump PRs](#automated-merging-of-prow-autobump-prs)
- [Roll Back Process](#roll-back-process)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
- [A new tool merges autobump PRs](#a-new-tool-merges-autobump-prs)
- [Pros:](#pros)
- [Cons:](#cons)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes


[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

This document proposes to change deployment of k8s prow from manual to be automated continuously.

## Motivation

Currently, deploying k8s prow consists of following steps:

1. Updates made in prow are released as container images.
1. Automated process creates PRs updating the prow images tags.
1. Oncall inspects prow logs to make sure prow is safe to upgrade.
1. Oncall inspects PRs created in step #2, approve them, and post message on Slack.
1. Oncall waits until #4 and a postsubmit job applying the changes on prow cluster.
1. Oncall waits until #5 is done, do several manual inspections to make sure prow works.

This is a very time consuming process (Roughly 3 hours per week for oncall), especially the context switching between the waits makes oncall hard to focus on their day job. Thus it’s desired to streamline the process so that we can save time.

Historically, the biggest reasons why the manual processes were introduced are:

- Errors in prow were not easily discoverable, most of time were reported by prow users, which was bad.

This problem has been largely solved by the introduction of prow monitoring + alerting by grafana, prometheus, and prometheus alertmanager stack. And based on our experience in the past quarter, we haven’t had a single case where prow errors were discovered by humans earlier than by prow alerts. This fact gives us reasonable confidence to proceed with continuous delivery, and the following assumptions should hold:

- Prow is stable as long as there is no alert. (Indicate that no need to inspect prow logs before bumping)
- Errors caused by prow upgrades are discovered by prow alerts in a timely manner.


### Goals

The proposed change switches from daily manual deployment to hourly automated deployment.

#### Prow Users

Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day).

#### Prow Oncall

- What’s Not Changed
- React to prow alerts and take actions.
- What’s Changed
- No more manual inspecting prow healthiness.
- No more manual lgtm/approve/retest autobump PRs.
- No more manual Slack posting.


### Non-Goals

Change how prow is released.


## Proposal

Prow autobump PRs are automatically merged every hour, only on working hours of working days.

### Notes/Constraints/Caveats (Optional)

#### Breaking Changes in Prow
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
One possible way of dealing with breaking changes, is:
- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md.
- [Stretch Goal][Push left] Discover breaking changes, especially configs or flag changes in prow integration test (This requires prow integration test use the same set of deployment configs as prod)
- [Optional] (This is not very reliable) Prow TLs inspects new PRs, manually identifies possible breaking changes and informs oncall for awareness. Either prow TL or oncall can take deeper look at the new PR and decide whether to take actions or not.

## Design Details

#### Automated Merging of Prow Autobump PRs

- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour.
- Tide blindly trusts PRs from the bot that does autobump, merging the PR as long as tests all pass. Flaky tests will be covered by auto-pushing of new autobump jobs later.

This approach uses tide auto-merge feature, so that no need to worry about repo requirements such as need more than one approver etc.

```
<<[UNRESOLVED (spiffxp) ]>>
Suggestion: annotate prow job with prow versions.
<<[/UNRESOLVED]>>
```


```
<<[UNRESOLVED (spiffxp) ]>>
Suggestion: how to keep slack reports on each automated bump.
<<[/UNRESOLVED]>>
```

#### Roll Back Process

When prow stopped functioning after a bump, prow oncall should:
- Stop auto-deploying by commenting `/hold` on latest autobump PR.
- Manually create rollback PR for rolling back to known good version.
- Manually apply the changes from rollback PR.

```
<<[UNRESOLVED]>>
Which version to roll back. This is generally not a problem due to low release volume of prow. @alvaroaleman suggested 6 hours intervals.
<<[/UNRESOLVED]>>
```

## Implementation History


## Alternatives


#### A new tool merges autobump PRs
This method is independent of tide, which makes sure it works on every prow instance.

##### Pros:
Not relying on tide, works really well with prow instances that don't have tide.

##### Cons:
Probably have significantly divergent code paths for finding and approving PRs on Gerrit vs PRs on GitHub.
18 changes: 18 additions & 0 deletions keps/sig-testing/2539-continuously-deploy-k8s-prow/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
title: Continuously Deploy K8s Prow
kep-number: 2539
authors:
- "@chaodaiG"
owning-sig: sig-testing
participating-sigs:
- sig-testing
- sig-release
status: provisional
creation-date: 2021-02-23
reviewers:
- "@spiffxp" # Sig-testing chair
- "@ameukam" # Sig-release chair
- "@alvaroaleman" # Prow approver
approvers:
- "@spiffxp" # Sig-testing chair
- "@ameukam" # Sig-release chair
- "@alvaroaleman" # Prow approver

0 comments on commit 30faaa1

Please sign in to comment.