[SHIPA-2066] ketch to monitor deployment better #177

stinkyfingers · 2021-10-25T16:10:08Z

Description

A lot of this was borrowed from: https://github.com/shipa-corp/shipa/blob/master/provision/kubernetes/app_manager.go#L404, as the ticket 2066 indicates. Essentially, this change creates an event watcher and repeatedly checks deployments for updates. It then records new app events with relevant info.

Fixes # 2066

Permits Shipa to generate output shown below during an App Deployment using Ketch provisioner. Corresponding shipa PR - we may want to remove the AppReconcileOutcome in the Shipa PR.

To use, make docker-build and docker-push this branch, run shipa specifying this new docker image for ketch in the shipa.yaml, and create & deploy a ketch-provisioned App.

johnshenk@Johns-MacBook-Pro shipa % shipa app create test -k test -t shipa-admin-team
shipa app deploy -a test -i gcr.io/kubernetes-312803/sample-go-app:latest
App "test" has been created!
Use app-info to check the status of the app and its units.
 ok
 ---> Security scan is disabled
 ----> Step 1:
 -----> Set target:
 -----> Update deployment 1:
          web => 1 units
AppReconcileOutcome: app test 0 reconcile success
AppReconcileOutcome: app test 0 reconcile fail: [Operation cannot be fulfilled on apps.shipa.io "test": the object has been modified; please apply your changes to the latest version and try again]

---- Updating units [web] ----
  ---> 1 of 1 new units created
  ---> test-web-1-585577fd6b-z8wd5 - Successfully assigned shipa-test/test-web-1-585577fd6b-z8wd5 to minikube []
  ---> 0 of 1 new units ready
  ---> waiting healthcheck on 1 created units
  ---> 1 of 1 new units ready
  ----> app test 1 reconcile success
Rollout successful

OK

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Chore (documentation addition or typo, file relocation)

Testing

New tests were added with this PR that prove my fix is effective or that my feature works (describe below this bullet)
This change requires no testing (i.e. documentation update)

Documentation

All added public packages, funcs, and types have been documented with doc comments
I have commented my code, particularly in hard-to-understand areas

Final Checklist:

I followed standard GitHub flow guidelines
I have performed a self-review of my own code
My changes generate no new warnings

rm message filter; use recorder

aleksej-paschenko

I think the loop in watchDeployEvents looks like a good solution.
Yesterday @DavisFrench merged a PR to send events with annotations, maybe we can annotate events in this PR too? so it'll be easier to consume events in shipa

aleksej-paschenko · 2021-10-26T13:52:17Z

internal/controllers/app_controller.go

 	"time"

 	"github.com/go-logr/logr"
 	"github.com/pkg/errors"
 	"helm.sh/helm/v3/pkg/release"
+	appsv1 "k8s.io/api/apps/v1"
+	apiv1 "k8s.io/api/core/v1"


it's already imported as v1

aleksej-paschenko · 2021-10-26T15:08:11Z

internal/controllers/app_controller.go

+			var dep appsv1.Deployment
+			if err := r.Get(ctx, client.ObjectKey{
+				Namespace: framework.Spec.NamespaceName,
+				Name:      fmt.Sprintf("%s-%s-%d", app.GetName(), process.Name, len(app.Spec.Deployments)),


should it be fmt.Sprintf("%s-%s-%d", app.GetName(), process.Name, latestDeployment.Version)?
because ketch generates a k8s Deployment name using <app-name>-<process-name>-<deploymentVersion> template.

Do we need this check?

if dep.Status.ObservedGeneration >= dep.Generation { continue }

Inside watchDeployEvents we wait when dep.Status.ObservedGeneration < dep.Generation condition becomes true, and then start monitoring things.

Generation check removed.

aleksej-paschenko · 2021-10-27T10:30:21Z

internal/controllers/app_controller.go

+			if dep.Status.ObservedGeneration >= dep.Generation {
+				continue
+			}
+			err = watchDeployEvents(ctx, app, framework.Spec.NamespaceName, &dep, &process, r.Recorder)


running this function in the app reconciler's goroutine blocks all other deployments.
We can either run it in a dedicated goroutine or set MaxConcurrentReconciles to something more suitable.

func (r *AppReconciler) SetupWithManager(mgr ctrl.Manager) error { return ctrl.NewControllerManagedBy(mgr). For(&ketchv1.App{}). WithOptions(controller.Options{MaxConcurrentReconciles: 10}). Complete(r) }

Moreover, ketch doesn't use ObservedGeneration and Generation.
A nice write-up about it https://alenkacz.medium.com/kubernetes-operator-best-practices-implementing-observedgeneration-250728868792
When ketch starts up, it goes thru all apps and updates their states, meaning it'll run this function for all apps.

I modified the MaxConcurrentReconciles

aleksej-paschenko · 2021-10-27T11:33:41Z

internal/controllers/app_controller.go

+	}
+
+	opts := listOptsForPodEvent(app)
+	opts.Watch = true


I think we can use opts.ResourceVersion here

opts := listOptsForPodEvent(app) opts.Watch = true opts.ResourceVersion = app.ResourceVersion

https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions

aleksej-paschenko · 2021-10-27T12:01:25Z

internal/controllers/app_controller.go

+	defer func() {
+		watch.Stop()
+		if watchCh != nil {
+			// Drain watch channel to avoid goroutine leaks.


Not sure I get it
why we can't go with

watch, err := cli.CoreV1().Events(namespace).Watch(ctx, opts) if err != nil { return err } defer watch.Stop() for { select { case <-time.After(100 * time.Millisecond): case msg, isOpen := <-watch.ResultChan(): if !isOpen { break } } }

Me neither. I added the deferred watch.Stop().

aleksej-paschenko · 2021-10-27T12:16:24Z

internal/controllers/app_controller.go

+// allNewPodsRunning returns true if a list of pods contains the same number of running pods with <app>-<process>-<deploymentVersion> as the
+// process.Units requires.
+func allNewPodsRunning(ctx context.Context, cli *kubernetes.Clientset, app *ketchv1.App, process *ketchv1.ProcessSpec, depRevision string) (bool, error) {
+	pods, err := cli.CoreV1().Pods(app.GetNamespace()).List(ctx, listOptsForPodEvent(app))


When we update a k8s Deployment, k8s controller creates a new ReplicaSet.
The previous ReplicaSet starts removing pods, the new one starts creating pods.
We are interested in new pods, right?

idk, maybe there is another way to get them, but here's a working solution:

get the k8s deployment's deployment.kubernetes.io/revision annotation

find a ReplicaSet that has the same annotation with the same value

find all pods that have a link in their ownerReference list pointing to the ReplicaSet from step 2.

more about ownerReference
https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/

I removed this func and used the existing checkPodStatus.

aleksej-paschenko · 2021-11-05T11:34:34Z

internal/controllers/app_controller.go

+	opts := metav1.ListOptions{
+		FieldSelector: "involvedObject.kind=Pod",
+	}
+	pods, err := cli.CoreV1().Pods(app.GetNamespace()).List(ctx, opts)


it looks like we get all pods here even ones not related to the app.

Good catch. I added a LabelSelector to limit by AppName.

and it might be that we need to check the pods of the current deployment's revision.
A use-case: as a user I deployed an application and it is failing now, several of the app's pods are not ready. I have built a new image and am deploying it right now.

I updated label selector to LabelSelector: fmt.Sprintf("%s/app-name=%s,%s/app-deployment-version=%d", group, app.Name, group, deploymentVersion), to add limiting by deployment version. Is this what you're thinking?

aleksej-paschenko · 2021-11-05T11:36:32Z

internal/controllers/app_controller.go

+	eventsInterface := cli.CoreV1().Events(namespace)
+	selector := eventsInterface.GetFieldSelector(&pod.Name, &namespace, nil, nil)
+	options := metav1.ListOptions{FieldSelector: selector.String()}
+	events, err := eventsInterface.List(context.TODO(), options)


should we use the passed in ctx?

Good call. Changed.

aleksej-paschenko · 2021-11-05T11:38:46Z

internal/controllers/app_controller.go

+	deadlineExeceededProgressCond = "ProgressDeadlineExceeded"
+	DefaultPodRunningTimeout      = 10 * time.Minute
+	maxWaitTimeDuration           = time.Duration(120) * time.Second
+	maxConcurrentReconciles       = 10


so we can deploy only 10 apps simultaneously?

I removed this and opted to run all of the watch in a goroutine to prevent blocking.

koncar · 2021-11-04T13:42:17Z

internal/controllers/app_controller.go

+	config, err := GetRESTConfig()
+	if err != nil {
+		return err
+	}


how about creating a new field in AppReconclier and using that instea of incluster config get

type AppReconciler struct { ... Config *restclient.Config }

and instantiate it in main with

AppReconclier{ Config: ctrl.GetConfigOrDie }

there is also a way given here
https://github.com/shipa-corp/ketch/blob/master/cmd/ketch/configuration/configuration.go#L93-L96

this would introduce 3rd way to initialize config, do we want it?

Good call. I utilized the existing function in the config and initialized in main as you suggest.

koncar · 2021-11-04T14:10:28Z

internal/controllers/app_controller.go

+					message: fmt.Sprintf("failed to get deployment: %v", err),
+				}
+			}
+			err = r.watchDeployEvents(ctx, app, framework.Spec.NamespaceName, &dep, &process, r.Recorder)


is this line blocking?

meaning should we watch processes in parallel?
or with this we wait for first process, then monitor second and so on?

i belive it should be parallel, what do you guys think?

Good call. I put all of the Event watch code in a goroutine. I was struggling a lot with an issue where the reconciler would complete before creating all events, but realized it was fixable with a requeue parameter rather than blocking everything.

koncar · 2021-11-05T14:18:12Z

internal/controllers/app_controller.go

+	}
+
+	opts := metav1.ListOptions{
+		FieldSelector: "involvedObject.kind=Pod",


should it be for the given app?

I'm not sure there is a way to filter Pod events by App since events don't have labels. isDeploymentEvent filters them by app name prefix before emitting them as App Events.

koncar · 2021-11-05T14:19:31Z

internal/controllers/app_controller.go

+	oldUpdatedReplicas := int32(-1)
+	oldReadyUnits := int32(-1)
+	oldPendingTermination := int32(-1)


this is always -1?

should we read old status instead?

It gets set to pendingTermination each cycle here: https://github.com/theketchio/ketch/pull/177/files#diff-01570fd749623ed07d5f2e0b7097495a4ef86b6f3419378180bc43fc73c9223eR443. I'm okay with changing it, but was trying to keep output similar to Shipa's: https://github.com/shipa-corp/shipa/blob/master/provision/kubernetes/app_manager.go#L445

koncar · 2021-11-05T14:22:03Z

internal/controllers/app_controller.go

+		}
+	}()
+
+	for {


will it always break?

do we always either get and error or the following condition is always met?

readyUnits == specReplicas && dep.Status.Replicas == specReplicas

should we have timeout?

Good call. I re-added the bit that checks for timeouts.

koncar · 2021-11-05T14:22:38Z

internal/controllers/app_controller.go

+}
+
+// stringifyEvent accepts an event and returns relevant details as a string
+func stringifyEvent(watchEvent watch.Event) string {


maybe using annotations, similar to CanaryEvents is better solution, what do you think?

That sounds good. I removed this and added an AppDeploymentEvent type that includes annotations and String(), similar to the Canary work. Hoping it's close to what you are expecting.

aleksej-paschenko · 2021-11-10T13:44:36Z

internal/controllers/app_controller.go

+	opts := metav1.ListOptions{
+		FieldSelector: "involvedObject.kind=Pod",
+	}
+	pods, err := cli.CoreV1().Pods(app.GetNamespace()).List(ctx, opts)


and it might be that we need to check the pods of the current deployment's revision.
A use-case: as a user I deployed an application and it is failing now, several of the app's pods are not ready. I have built a new image and am deploying it right now.

aleksej-paschenko · 2021-11-10T13:45:10Z

internal/controllers/k8s_config.go

+
+// GetRESTConfig returns a rest.Config. It uses the presence of KUBERNETES_SERVICE_HOST
+// to determine whether to use an InClusterConfig or the user's config.
+func GetRESTConfig() (*rest.Config, error) {


there is a nice ctrl.GetConfigOrDie() function

Yep. I removed this whole file as it's not used anymore.

koncar · 2021-11-08T17:20:57Z

internal/controllers/k8s_config.go

+package controllers
+
+import (
+	"os"
+	"path/filepath"
+
+	"k8s.io/client-go/rest"
+	"k8s.io/client-go/tools/clientcmd"
+)
+
+// GetRESTConfig returns a rest.Config. It uses the presence of KUBERNETES_SERVICE_HOST
+// to determine whether to use an InClusterConfig or the user's config.
+func GetRESTConfig() (*rest.Config, error) {
+	if os.Getenv("KUBERNETES_SERVICE_HOST") == "" {
+		return externalConfig()
+	}
+	return rest.InClusterConfig()
+}
+
+// externalConfig returns a REST config to be run external to the cluster, e.g. testing locally.
+func externalConfig() (*rest.Config, error) {
+	home, err := os.UserHomeDir()
+	if err != nil {
+		return nil, err
+	}
+
+	configStr := filepath.Join(home, ".kube", "config")
+	return clientcmd.BuildConfigFromFlags("", configStr)
+}


if you loaded config from ctrl, we don't need this file anymore right?

Yep. Removed this dead code.

koncar · 2021-11-10T15:27:40Z

internal/controllers/app_controller.go

+		}
+	}
+
+	go r.watchFunc(ctx, app, namespace, dep, process, recorder, watcher, cli, timeout)


do processes share watcher?

we instantiate watcher for each process, but i would say from code that we would get same events for each process, we just print them differently Updating units [%s]", process.Name

how do we know that some events belong to certain process?

Each process has it's own watcher (is that a good idea 🤷 ). The watcher does 2 things: 1) watch Pod Events and 2) keep checking the appsv1.Deployment for updates.

The Pod Events are filtered by process here because the appsv1.Deployment name includes the process name. An appsv1.Deployment is 1-to-1 with a Ketch Process (which confuses because Ketch has Deployments too).

The appsv1.Deployments that we keep checking are also filtered by appsv1.Deployment.Name e.g. here which is the Ketch Process Name.

At least, I think that's what's going on. I'm new here.

koncar · 2021-11-10T15:28:26Z

internal/controllers/app_controller.go

+func (r *AppReconciler) watchFunc(ctx context.Context, app *ketchv1.App, namespace string, dep *appsv1.Deployment, process *ketchv1.ProcessSpec, recorder record.EventRecorder, watcher watch.Interface, cli kubernetes.Interface, timeout <-chan time.Time) error {
+	var err error
+	watchCh := watcher.ResultChan()
+	// recorder.Eventf(app, v1.EventTypeNormal, appReconcileStarted, "Updating units [%s]", process.Name)


can we remove this line

koncar · 2021-11-10T15:29:28Z

internal/controllers/app_controller.go

+	}
+}
+
+func (a *AppDeploymentEvent) String() string {


do we need this function if we use annotations?

Good idea. I removed this func and moved the string presentation logic into the corresponding shipa PR: https://github.com/shipa-corp/shipa/pull/1028

…od-getter; improves annotations

koncar

awesome job

stinkyfingers · 2021-11-16T16:30:33Z

internal/chart/application_chart_test.go

@@ -319,7 +320,8 @@ func TestNewApplicationChart(t *testing.T) {
 				Version: "0.0.1",
 				AppName: tt.application.Name,
 			}
-			client := HelmClient{cfg: &action.Configuration{KubeClient: &fake.PrintingKubeClient{}, Releases: storage.Init(driver.NewMemory())}, namespace: tt.framework.Spec.NamespaceName}
+
+			client := HelmClient{cfg: &action.Configuration{KubeClient: &fake.PrintingKubeClient{}, Releases: storage.Init(driver.NewMemory())}, namespace: tt.framework.Spec.NamespaceName, c: clientfake.NewClientBuilder().Build()}


This test was panicking without a client.

stinkyfingers · 2021-11-16T16:31:38Z

internal/controllers/app_controller.go

+
+	reconcileStartedEvent := newAppDeploymentEvent(app, ketchv1.AppReconcileStarted, fmt.Sprintf("Updating units [%s]", process.Name), process.Name)
+	recorder.AnnotatedEventf(app, reconcileStartedEvent.Annotations, v1.EventTypeNormal, reconcileStartedEvent.Reason, reconcileStartedEvent.Description)
+	go r.watchFunc(ctx, app, namespace, dep, process.Name, recorder, watcher, cli, timeout, watcher.Stop)


Watch asynchronously.

aleksej-paschenko

looks good!

WIP - adds crude events monitor

1561799

rm message filter; use recorder

stinkyfingers added the work in progress label Oct 25, 2021

aleksej-paschenko reviewed Oct 27, 2021

View reviewed changes

stinkyfingers added 4 commits October 28, 2021 08:59

address pr comments

0719fc6

adds some unit tests

89fd374

adds test; more updates

efc160b

same changes

bfe49c9

stinkyfingers removed the work in progress label Nov 3, 2021

?

2867c3b

aleksej-paschenko requested changes Nov 5, 2021

View reviewed changes

koncar requested changes Nov 5, 2021

View reviewed changes

stinkyfingers added 2 commits November 5, 2021 10:19

merge and add requeue

5b703a8

addresses PR comments

30e7146

aleksej-paschenko reviewed Nov 10, 2021

View reviewed changes

koncar reviewed Nov 10, 2021

View reviewed changes

stinkyfingers added 2 commits November 11, 2021 11:30

removes dead code; fixes stop() func; adds fieldselector to timeout p…

20506b2

…od-getter; improves annotations

addsd process name to deployment annotations

a0682b7

koncar approved these changes Nov 15, 2021

View reviewed changes

stinkyfingers added 3 commits November 15, 2021 08:26

mv DeploymentEvent to api_types

5841bc9

mv appDeploymentEvents to app_types

7232e13

merge

0b35d2d

stinkyfingers commented Nov 16, 2021

View reviewed changes

aleksej-paschenko approved these changes Nov 17, 2021

View reviewed changes

stinkyfingers merged commit 8be6bff into main Nov 17, 2021

stinkyfingers deleted the shipa-1990 branch November 17, 2021 14:03

stinkyfingers mentioned this pull request Nov 22, 2021

Adds missing event verbs to ketch clusterrole #197

Merged

11 tasks

+              		}
+              	}()
+              	for {

[SHIPA-2066] ketch to monitor deployment better #177

[SHIPA-2066] ketch to monitor deployment better #177

Conversation

stinkyfingers commented Oct 25, 2021 • edited Loading

Description

Type of change

Testing

Documentation

Final Checklist:

aleksej-paschenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksej-paschenko Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koncar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksej-paschenko left a comment

Choose a reason for hiding this comment

stinkyfingers commented Oct 25, 2021 •

edited

Loading

aleksej-paschenko Oct 27, 2021 •

edited

Loading