Skip to content

Commit

Permalink
🕵️ Introduce lib for gardener-node-agent (gardener#8249)
Browse files Browse the repository at this point in the history
* Introduce lib for gardener-node-agent gardener#8023

* Apply suggestions from code review

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

* Revisit first parts of the node agent concept

* Rephrase reason comparision

* Speed benefits mostly in large clusters

* Remove commented imports

* refactor(nodeagent): rename extractTarGz

* fix(nodeagent): pick newest file from layers

* fix(nodeagent): dropped projected info for token

That the token is projected doesn't matter.

* fix(nodeagent): removed empty tests

Nothing to test here

* fix(nodeagent): mirror v1alpha1 changes

* refactor(nodeagent): dbus logs, events and naming

* feat(nodeagent): validate for supported kubernetesversion

* feat(nodeagent): improved coverage for config validation

* revert(nodeagent): fake dbus tests did not provide any value

* docs(nodeagent): fix config registration docs

* fix(docs): reorder the basic design and postpone installation

* docs(nodeagent): binary path

* docs(nodeagent): be more explicit between cloud config and osc

* docs(nodeagent): link operatingsystemconfig extension

* docs(nodeagent): future development section

Removes the TODO inside the Scalability section and appends it in a separate section.

* fix(codegen): generate nodeagent

* fix(nodeagent): fix checks

* Update pkg/nodeagent/apis/config/validation/validation_test.go

Co-authored-by: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>

* docs(nodeagent): rename architecture svg

* docs(nodeagent): improved wording

* fix(nodeagent): camel cased validation

* fix(nodeagent): wording

* docs(nodeagent): prefer `kubelet`

* docs(nodeagent): `kube-apiserver`

* fix(nodeagent): validation test specs

* fix(dbus): remove empty suite

* refactor(dbus): typo and formatting

* fix(nodeagent): extract secure from remote

* Apply suggestions from code review

Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>

* docs(nodeagent): rephrase gardener community

* docs(nodeagent): remove mentioning of supported archs

* refactor(nodeagent): rename api types

* fix(nodeagent): lowercase kubelet data volume size

* refactor(nodeagent): validation naming and formatting

* refactor(nodeagent): binary cabundle

* refactor(nodeagent): use semver for Kubernetes Version

* fix(nodeagent): remove unused fake dbus

Currently it is unused. In an upcoming PR it will be reintroduced by future controllers.

* Apply suggestions from code review

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

* tmp: delete controller-registration to regenerate

* chore: generate

* docs(nodeagent): corrected architecture diagram

* revert: controller registration due to tar incompatabilities

* docs(nodeagent): wording architecture diagram

* fix(generate): add trailing newline for controller registration

* feat(nodeagent): test registry extraction

* fix(nodeagent): lint

* Update pkg/nodeagent/apis/config/validation/validation.go

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>

* PR review feedback

---------

Co-authored-by: Johannes Scheerer <johannes.scheerer@sap.com>
Co-authored-by: Oliver Götz <47362717+oliver-goetz@users.noreply.github.com>
Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>
  • Loading branch information
4 people authored Oct 4, 2023
1 parent ba583aa commit b485706
Show file tree
Hide file tree
Showing 375 changed files with 71,558 additions and 462 deletions.
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
* [Gardener Admission Controller](concepts/admission-controller.md)
* [Gardener Resource Manager](concepts/resource-manager.md)
* [Gardener Operator](concepts/operator.md)
* [Gardener Node Agent](concepts/node-agent.md)
* [Gardenlet](concepts/gardenlet.md)
* [Backup Restore](concepts/backup-restore.md)
* [etcd](concepts/etcd.md)
Expand Down
310 changes: 310 additions & 0 deletions docs/concepts/images/gardener-nodeagent-architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 63 additions & 0 deletions docs/concepts/node-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Gardener Node Agent

The goal of the `gardener-node-agent` is to bootstrap a machine into a worker node and maintain node-specific components, which run on the node and are unmanaged by Kubernetes (e.g. the `kubelet` service, systemd units, ...).

It effectively is a Kubernetes controller deployed onto the worker node.

## Architecture and Basic Design

![Design](./images/gardener-nodeagent-architecture.svg)

This figure visualizes the overall architecture of the `gardener-node-agent`. On the left side, it starts with an [`OperatingSystemConfig`](../extensions/operatingsystemconfig.md) resource (`OSC`) with a corresponding worker pool specific `cloud-config-<worker-pool>` secret being passed by reference through the userdata to a machine by the `machine-controller-manager` (MCM).

On the right side, the `cloud-config` secret will be extracted and used by the `gardener-node-agent` after being installed. Details on this can be found in the next section.

Finally, the `gardener-node-agent` runs a systemd service watching on secret resources located in the `kube-system` namespace like our `cloud-config` secret that contains the `OperatingSystemConfig`. When `gardener-node-agent` applies the OSC, it installs the `kubelet` + configuration on the worker node.

## Installation and Bootstrapping

This section describes how the `gardener-node-agent` is initially installed onto the worker node.

In the beginning, there is a very small bash script called [`gardener-node-init.sh`](../../pkg/component/extensions/operatingsystemconfig/original/components/containerd/templates/scripts/init.tpl.sh), which will be copied to `/var/lib/gardener-node-agent/gardener-node-init.sh` on the node with cloud-init data. This script's sole purpose is downloading and starting the `gardener-node-agent`. The binary artifact is extracted from an [OCI artifact](https://github.com/opencontainers/image-spec/blob/main/manifest.md) and lives at `/usr/local/bin/gardener-node-agent`. The `kubelet` should also be contained in the same OCI artifact.

Along with the init script, a configuration for the `gardener-node-agent` is carried over to the worker node at `/var/lib/gardener-node-agent/configuration.yaml`. This configuration contains things like the shoot's `kube-apiserver` endpoint, the according certificates to communicate with it, the bootstrap token for the `kubelet`, and so on.

In a bootstrapping phase, the `gardener-node-agent` sets itself up as a systemd service. It also executes tasks that need to be executed before any other components are installed, e.g. formatting the data device for the `kubelet`.

## Reasoning

The `gardener-node-agent` is a replacement for what was called the `cloud-config-downloader` and the `cloud-config-executor`, both written in `bash`. The `gardener-node-agent` implements this functionality as a regular controller and feels more uniform in terms of maintenance.

With the new architecture we gain a lot, let's describe the most important gains here.

### Developer Productivity

Since the Gardener community develops in Go day by day, writing business logic in `bash` is difficult, hard to maintain, almost impossible to test. Getting rid of almost all `bash` scripts which are currently in use for this very important part of the cluster creation process will enhance the speed of adding new features and removing bugs.

### Speed

Until now, the `cloud-config-downloader` runs in a loop every `60s` to check if something changed on the shoot which requires modifications on the worker node. This produces a lot of unneeded traffic on the API server and wastes time, it will sometimes take up to `60s` until a desired modification is started on the worker node.
By writing a "real" Kubernetes controller, we can watch for the `Node`, the `OSC` in the `Secret`, and the shoot-access token in the `secret`. If any of these object changed, and only then, the required action will take effect immediately.
This will speed up operations and will reduce the load on the API server of the shoot especially for large clusters.

## Scalability

The `cloud-config-downloader` adds a random wait time before restarting the `kubelet` in case the `kubelet` was updated or a configuration change was made to it. This is required to reduce the load on the API server and the traffic on the internet uplink. It also reduces the overall downtime of the services in the cluster because every `kubelet` restart transforms a node for several seconds into `NotReady` state which potentionally interrupts service availability.

Decision was made to keep the existing jitter mechanism which calculates the `kubelet-download-and-restart-delay-seconds` on the controller itself.

### Correctness

The configuration of the `cloud-config-downloader` is actually done by placing a file for every configuration item on the disk on the worker node. This was done because parsing the content of a single file and using this as a value in `bash` reduces to something like `VALUE=$(cat /the/path/to/the/file)`. Simple, but it lacks validation, type safety and whatnot.
With the `gardener-node-agent` we introduce a new API which is then stored in the `gardener-node-agent` `secret` and stored on disk in a single YAML file for comparison with the previous known state. This brings all benefits of type safe configuration.
Because actual and previous configuration are compared, removed files and units are also removed and stopped on the worker if removed from the `OSC`.

### Availability

Previously, the `cloud-config-downloader` simply restarted the systemd units on every change to the `OSC`, regardless which of the services changed. The `gardener-node-agent` first checks which systemd unit was changed, and will only restart these. This will prevent unneeded `kubelet` restarts.

### Future Development

The `gardener-node-agent` opens up the possibilty for further improvements.

Necessary restarts of the `kubelet` could be deterministic instead of the aforementioned random jittering. In that case, the `gardenlet` could add annotations across all nodes. As the `gardener-node-agent` watches the `Node` object, it could wait with `kubelet` restarts, OSC changes or react immediately. Critical changes could be performed in chunks of nodes in serial order, but an equal time spread is possible, too.
19 changes: 19 additions & 0 deletions example/node-agent/10-componentconfig.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
apiVersion: nodeagent.config.gardener.cloud/v1alpha1
kind: NodeAgentConfiguration
clientConnection:
qps: 100
burst: 130
kubeconfig: path/to/kubeconfig
logLevel: info
logFormat: text
debugging:
enableProfiling: false
enableContentionProfiling: false
featureGates: {}
operatingSystemConfigSecretName: name-of-osc-secret
accessTokenSecretName: name-of-access-token-secret
image: gardener-node-agent-image:v1
hyperkubeImage: hyperkube-image:v2
kubernetesVersion: 1.28.2
# kubeletDataVolumeSize: 50Gi
19 changes: 16 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,22 @@ require (
github.com/gogo/protobuf v1.3.2
github.com/google/gnostic-models v0.6.8
github.com/google/go-cmp v0.5.9
github.com/google/go-containerregistry v0.15.2
github.com/hashicorp/go-multierror v1.1.1
github.com/kubernetes-csi/external-snapshotter/client/v4 v4.2.0
github.com/mitchellh/hashstructure/v2 v2.0.2
github.com/onsi/ginkgo/v2 v2.11.0
github.com/onsi/gomega v1.27.10
github.com/prometheus/client_golang v1.16.0
github.com/robfig/cron v1.2.0
github.com/spf13/afero v1.9.5
github.com/spf13/cobra v1.7.0
github.com/spf13/pflag v1.0.5
github.com/spf13/viper v1.16.0
github.com/texttheater/golang-levenshtein v1.0.1
go.uber.org/automaxprocs v1.5.3
go.uber.org/goleak v1.2.1
go.uber.org/mock v0.2.0
go.uber.org/zap v1.26.0
golang.org/x/crypto v0.13.0
golang.org/x/text v0.13.0
Expand Down Expand Up @@ -68,7 +71,7 @@ require (
)

require (
github.com/BurntSushi/toml v1.0.0 // indirect
github.com/BurntSushi/toml v1.2.1 // indirect
github.com/Masterminds/goutils v1.1.1 // indirect
github.com/Masterminds/semver v1.5.0 // indirect
github.com/Masterminds/sprig v2.22.0+incompatible // indirect
Expand All @@ -79,9 +82,14 @@ require (
github.com/blang/semver/v4 v4.0.0 // indirect
github.com/cenkalti/backoff/v4 v4.2.1 // indirect
github.com/cespare/xxhash/v2 v2.2.0 // indirect
github.com/containerd/stargz-snapshotter/estargz v0.14.3 // indirect
github.com/coreos/go-semver v0.3.1 // indirect
github.com/cyphar/filepath-securejoin v0.2.2 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/docker/cli v23.0.5+incompatible // indirect
github.com/docker/distribution v2.8.1+incompatible // indirect
github.com/docker/docker v23.0.5+incompatible // indirect
github.com/docker/docker-credential-helpers v0.7.0 // indirect
github.com/emicklei/go-restful/v3 v3.10.1 // indirect
github.com/evanphx/json-patch v5.6.0+incompatible // indirect
github.com/evanphx/json-patch/v5 v5.6.0 // indirect
Expand All @@ -98,6 +106,7 @@ require (
github.com/go-task/slim-sprig v0.0.0-20230315185526-52ccab3ef572 // indirect
github.com/gobuffalo/flect v1.0.2 // indirect
github.com/gobwas/glob v0.2.3 // indirect
github.com/godbus/dbus/v5 v5.0.4 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/google/cel-go v0.16.1 // indirect
Expand All @@ -113,30 +122,35 @@ require (
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/josharian/intern v1.0.0 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/klauspost/compress v1.16.5 // indirect
github.com/magiconair/properties v1.8.7 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/mattn/go-isatty v0.0.17 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.4 // indirect
github.com/mitchellh/copystructure v1.2.0 // indirect
github.com/mitchellh/go-homedir v1.1.0 // indirect
github.com/mitchellh/mapstructure v1.5.0 // indirect
github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/moby/spdystream v0.2.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/opencontainers/go-digest v1.0.0 // indirect
github.com/opencontainers/image-spec v1.1.0-rc3 // indirect
github.com/pelletier/go-toml/v2 v2.0.8 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/client_model v0.4.0 // indirect
github.com/prometheus/common v0.44.0 // indirect
github.com/prometheus/procfs v0.10.1 // indirect
github.com/russross/blackfriday/v2 v2.1.0 // indirect
github.com/shopspring/decimal v1.2.0 // indirect
github.com/spf13/afero v1.9.5 // indirect
github.com/sirupsen/logrus v1.9.0 // indirect
github.com/spf13/cast v1.5.1 // indirect
github.com/spf13/jwalterweatherman v1.1.0 // indirect
github.com/stoewer/go-strcase v1.2.0 // indirect
github.com/subosito/gotenv v1.4.2 // indirect
github.com/vbatts/tar-split v0.11.3 // indirect
go.etcd.io/etcd/api/v3 v3.5.9 // indirect
go.etcd.io/etcd/client/pkg/v3 v3.5.9 // indirect
go.etcd.io/etcd/client/v3 v3.5.9 // indirect
Expand All @@ -150,7 +164,6 @@ require (
go.opentelemetry.io/otel/sdk v1.10.0 // indirect
go.opentelemetry.io/otel/trace v1.10.0 // indirect
go.opentelemetry.io/proto/otlp v0.19.0 // indirect
go.uber.org/mock v0.2.0
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/exp v0.0.0-20230321023759-10a507213a29 // indirect
golang.org/x/mod v0.12.0 // indirect
Expand Down
Loading

0 comments on commit b485706

Please sign in to comment.