Skip to content

Commit

Permalink
KEP 2258: Node service log viewer
Browse files Browse the repository at this point in the history
Co-authored-by: Christian Glombek <cglombek@redhat.com>
  • Loading branch information
aravindhp and LorbusChris committed Mar 3, 2021
1 parent 1dedf5b commit 2abb586
Show file tree
Hide file tree
Showing 3 changed files with 442 additions and 0 deletions.
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-windows/2258.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 2258
alpha:
approver: "@johnbelamaric"
395 changes: 395 additions & 0 deletions keps/sig-windows/2258-node-service-log-viewer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,395 @@
# KEP-2258: Node service log viewer

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Implement client for logs endpoint viewer (OS agnostic)](#implement-client-for-logs-endpoint-viewer-os-agnostic)
- [Linux distros with systemd / journald](#linux-distros-with-systemd--journald)
- [Linux distributions without systemd / journald](#linux-distributions-without-systemd--journald)
- [Windows](#windows)
- [User Stories](#user-stories)
- [Risks and Mitigations](#risks-and-mitigations)
- [Large log files and events](#large-log-files-and-events)
- [Wider access to all node level service logs](#wider-access-to-all-node-level-service-logs)
- [Design Details](#design-details)
- [kubelet](#kubelet)
- [kubectl](#kubectl)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
- [Beta -&gt; GA Graduation](#beta---ga-graduation)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [x] (R) Graduation criteria is in place
- [x] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [x] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary

A Kubernetes cluster administrator has to log in to the relavant control-plane
or worker nodes to view the logs of the API server, kubelet etc. Or they would
have to implement a client side reader. A simpler and more elegant method would
be to allow them to use the kubectl CLI to also view these logs similar to
using it for other interactions with the cluster. Given the sensitive nature of
the information in node logs, this feature will only be available to cluster
administrators.

## Motivation

Troubleshooting issues with control-plane and worker nodes typically requires
a cluster administrator to SSH into the nodes for debugging. While certain
issues will require being on the node, issues with the kube-proxy or kubelet,
to name a couple, could be solved by perusing their logs. However this
too requires the administrator to SSH access into the nodes. Having a way for
them to view the logs using kubectl will significantly simplify their
troubleshooting.


### Goals
Provide a cluster administrator with a streaming view of logs using kubectl
without them having to implement a client side reader or logging into the node.
This would work for:
- Services on Linux worker and control plane nodes:
- That have systemd / journald support.
- That have services that log to `/var/log/`
- Windows worker nodes (all supported variants) that log to `C:\var\log`,
System and Application logs, Windows Event Logs and Event Tracing (ETW).

### Non-Goals
- Providing support for non-systemd Linux distributions.
- Reporting logs for nodes that have config or connection issues with the
cluster.
- Getting logs from services that do not use /var/log/.

## Proposal

### Implement client for logs endpoint viewer (OS agnostic)
- Extend `kubectl logs` to work with node objects.
- Implement a client for the `/var/log/` kubelet endpoint viewer.

### Linux distros with systemd / journald
Supplement the the `/var/log/` endpoint viewer on the kubelet with a thin shim
over the `journal` directory that shells out to journalctl. Then extend
`kubectl logs` to also work with node objects.

### Linux distributions without systemd / journald
Running the new "kubectl logs nodes" command against services on nodes that do
not use systemd / journald should return "OS not supported". However getting
logs from `/var/log/` should work on all systems.

### Windows
Reuse the kubelet API for querying the Linux journal for invoking the
`Get-WinEvent` cmdlet in a PowerShell.

### User Stories

Consider a scenario where pods / containers are refusing to come up on certain
nodes. As mentioned in the motivation section, troubleshooting this scenario
involves the cluster administrator to SSH into nodes to scan the logs. Allowing
them to use `kubectl logs` to do the same as they would to debug issues with a
pod / container would greatly simply their debug workflow. This also opens up
opportunities for tooling and simplifying automated log gathering. The feature
can also be used to debug issues with Kubernetes services especially in Windows
nodes that run as native Windows services and not as DaemonSets or Deployments.

Here are some example of how a cluser administrator would use this feature:
```
# See what logs are available in masters in /var/log/
kubectl logs nodes --role master --path=/
# Display cron log file (/var/log/cron) from all masters
kubectl logs nodes --role master --path=cron
# Show kubelet and crio journal logs from all masters
kubectl logs nodes --role master --path journal -s kubelet -s crio
# Show kubelet log file (/var/log/kubelet/kubelet.log) from all Windows worker nodes
kubectl logs nodes --label kubernetes.io/os=windows --path kubelet/kubelet.log
# Display docker runtime WinEvent log entries from a specific Windows worker node
kubectl logs nodes <node-name> --service docker
```

### Risks and Mitigations

#### Large log files and events
If the log that is attempted to be viewed is very large (GBs) there is
potential for the node performance to be degraded. To mitigate this we can
document that node logs should always be rotated in clusters that enable this
feature. We should also take into account nodes that don't take advantage of
journald's rate limiting options. We can then take real world feedback around
this for better mitigation when graduating the feature from alpha to beta.

#### Wider access to all node level service logs
The cluster administrator can now view all logs in /var/log/, systemd/journald
services and Windows services. Given that the cluster administrator can log
into the nodes and view the same information this should not be an issue.
However there is potential for scenarios where the cluster administrator does
not have access to the infrastructure. This again would benefit from real world
usage feedback.

## Design Details

#### kubelet

The kubelet already has a `/var/log/` [endpoint viewer](https://github.com/kubernetes/kubernetes/blob/b184272e278571d1e6650605dd4c39be897eaaa2/pkg/kubelet/kubelet.go#L1403)
that is lacking a client. Given its existence we can supplement that with a
wafer thin shim over the /journal directory that shells out to journalctl. This
allows us to extend the endpoint for getting logs from the system journal on
Linux systems that support systemd. To enable filtering of logs, we can reuse
the existing filters supported by journalctl. The `kubectl logs` will have
command line options for specifying these filters when interacting with node
objects.

On the Windows side viewing of logs from services that use `C:\var\log` will
be supported by the existing endpoint. For Windows services that log to the
the System and Application logs, Windows Event Logs and Event Tracing (ETW),
we can leverage the [Get-WinEvent cmdlet](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.diagnostics/get-winevent?view=powershell-7.1)
that supports getting logs from all these sources. The cmdlet has filtering
options that can be leveraged to filter the logs in the same manner we do
with the journal logs.

Please note that filtering will not be available for logs in `/var/log/` or
`C:\var\log\`.

On clusters where the same service using different logging mechanisms,
example: kubelet logs to `/var/log` on some nodes and uses the journal on other
nodes, the cluster administrator will have to use `kubectl logs` with different
set of options to read both types of logs.

The feature now enables the cluster administrator to interrogate all services.
This could be prevented by having a whitelist of allowed services. But this
comes with severe disadvantages as there could be nodes (especially with
Windows) that have other services to support networking and monitoring.
These services are variable and will depend on how the nodes have been
configured. Here are some examples:
- [hybrid-overlay-node](https://github.com/ovn-org/ovn-kubernetes/tree/master/go-controller/hybrid-overlay)
- [windows-exporter](https://github.com/prometheus-community/windows_exporter).


The `/var/log/` endpoint is enabled using the `enableSystemLogHandler` kubelet
configuration options. To gain access to this new feature this option needs to
be enabled. In addition when introducing this feature it will be hidden behind a
`NodeLogs` feature gate in the kubelet that needs to be explicitly enabled. So
you need to enable both options to get access to this new feature and disabling
`enableSystemLogHandler` will disable the new feature irrespective of the
`NodeLogs` feature gate.

A reference implementation of this feature without the feature gate is
available [here](https://github.com/kubernetes/kubernetes/pull/96120).

#### kubectl

`kubectl` has an existing `logs` command that is used to view the logs for a
container in a pod or a specified resource. The sub-command looks at resource
types, so can be extended to work with node objects to view the logs of services
on the nodes. Given that the `logs` command depends on RBAC policies for access
to appropriate resource type and associated endpoints, it will allow us to
restrict node logs access to only cluster administrators as long as the cluster
is setup in that manner. Access to the `node/logs` sub-resource needs to be
explicitly granted as a user with access to `nodes` will not automatically have
access to `node/logs`.

The default mode for `logs` sub-command for node objects would be to query
the systemd journal on supported Linux nodes and Application, System, and ETW
on Windows nodes, allowing searching, time and service based filtering. A
`--path` argument will be added to see a list of log files available under
`/var/logs/` or `C:\var\logs\` and view those contents directly.

Here are some examples and explanation of the options that will be added.
```
Examples:
# Show kubelet logs from all masters
kubectl logs nodes --role master -s kubelet
# See what logs are available in masters in /var/logs
kubectl logs nodes --role master --path=/
# Show docker logs from Windows nodes
kubectl logs nodes -l kubernetes.io/os=windows --path=journal -s docker
# Show logs from all applications logging to the event logs on Windows nodes
kubectl logs nodes -l kubernetes.io/os=windows --path=journal
Options:
--case-sensitive=true: Filters are case sensitive by default. Pass --case-sensitive=false to do a case insensitive filter.
-g, --grep='': Filter log entries by the provided regex pattern. Only applies to node journal logs.
-o, --output='': Display journal logs in an alternate format (short, cat, json, short-unix). Only applies to node journal logs.
--path='journal': Retrieve the specified path within the node's /var/logs/ folder. The 'journal' value will allow querying the journal or Get-WinEvent on supported operating systems.
--raw=false: Perform no transformation of the returned data.
--role='': Set a label selector by node role.
-l, --selector='': Selector (label query) to filter on.
--since='': Return logs after a specific ISO timestamp or relative date. Only applies to node journal or Get-WinEvent logs.
--tail=0: Return up to this many lines (not more than 100k) from the end of the log. Only applies to node journal or Get-WinEvent logs.
--sort=timestamp: Interleave logs by sorting the output. Defaults on when viewing node journal logs.
-s, --service=[]: Return log entries from the specified service(s). Only applies to node journal or Get-WinEvents logs.
--until='': Return logs before a specific ISO timestamp or relative date.
```

The `--sort=timestamp` feature will introduce log unification across node
objects by timestamps which can be extended to pod logs. This will allow users
to see logs across nodes from the same time. Similarly for pods, it will allow
seeing logs across containers aligned by time.

Given that the feature will be introduced behind a feature gate, by default
`kubectl logs nodes` will return a feature not enabled message. When the
feature is enabled in alpha phase, `kubectl logs nodes` will display a
warning message that the feature is in alpha. When the `--service` option
is used against Linux nodes that do not support systemd/journald, an OS
not supported message will be returned.

A reference implementation of this feature can be see in the OpenShift
[oc adm node-logs](https://github.com/openshift/origin/pull/21701/commits/d9dc6892d395c239b19831641218ca047e74ad94)
sub-command.

### Test Plan
Add unit tests to kubelet and kubectl that exercise the new arguments that
have been added. A reference implementation of the tests can be seen
[here](https://github.com/kubernetes/kubernetes/pull/96120/commits/c606a38ec38ccfe486033495a1dc433279ce71f8#diff-1d703a87c6d6156adf2d0785ec0174bb365855d4883f5758c05fda1fee8f7f1bR1)

### Graduation Criteria

The plan is to introduce the feature as alpha in the v1.21 time frame behind the
`NodeLogs` feature gate.

#### Alpha -> Beta Graduation

The plan is to graduate the feature to beta in the v1.22 time frame. At that
point we would have collected feedback from cluster administrators and
developers who have enabled the feature. Based on this feedback and issues
opened we should consider adding a kubelet side throttle for the viewing the
logs.

#### Beta -> GA Graduation

The plan is to graduate the feature to GA in the v1.23 time frame at which point
any major issues would have been surface during the alpha and beta phases.

### Upgrade / Downgrade Strategy

### Version Skew Strategy

If a kubectl version that has the new `logs nodes --service` option is used
against a node that is using a kubelet that does not have the extended
`/var/log` endpoint viewer, the result should be "feature not supported".

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

* **How can this feature be enabled / disabled in a live cluster?**
- [x] Feature gate
- Feature gate name: NodeLogs
- Components depending on the feature gate: kubelet

* **Does enabling the feature change any default behavior?** No

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?** Yes. It can be disabled by disabling the `NodeLogs` feature
gate in the kubelet.

* **What happens if we reenable the feature if it was previously rolled back?**
There will be no adverse effects of enabling the feature gate after it was
disabled.

* **Are there any tests for feature enablement/disablement?** No

### Rollout, Upgrade and Rollback Planning

_This section must be completed when targeting beta graduation to a release._

### Monitoring Requirements

_This section must be completed when targeting beta graduation to a release._

### Dependencies

_This section must be completed when targeting beta graduation to a release._

* **Does this feature depend on any specific services running in the cluster?**
- kubelet
- Usage description:
- Impact of its outage on the feature: If kubelet is not running on the
node this feature will not work.
- Impact of its degraded performance or high-error rates on the feature:
If the kubelet is degraded this feature will also be degraded i.e. the
node logs will not be returned.

### Scalability

* **Will enabling / using this feature result in any new API calls?**
No

* **Will enabling / using this feature result in introducing new API types?**
Yes. We will need to add a `NodeLogOptions` counterpart to
[PodLogOptions](https://github.com/kubernetes/kubernetes/blob/548ad1b8d35d51e6d33ea21dcc75d60a789b00e6/pkg/apis/core/types.go#L4409)

* **Will enabling / using this feature result in any new calls to the cloud
provider?**
No

* **Will enabling / using this feature result in increasing size or count of
the existing API objects?**
No

* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?**
No

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
In the case of large logs, there is potential for an increase in RAM and CPU
usage on the node when an attempt is made to stream them. Feedback from the
field during alpha will provide more clarity as we graduate from alpha to
beta.

### Troubleshooting

## Implementation History

- Created on Jan 14, 2021

## Drawbacks

## Alternatives

Alternatively we could use a client side reader on the nodes to redirect the
logs. The Windows side would require privileged container support. However this
would not help scenarios where containers are not launching successfully on the
nodes.

For the kubectl changes an alternative to extending `kubect logs` would be to
introduce a plugin or add a new sub-command under `kubectl alpha`.
Loading

0 comments on commit 2abb586

Please sign in to comment.