Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP 2258: Node service log viewer #2271

Merged
merged 1 commit into from
May 7, 2021

Conversation

aravindhp
Copy link
Contributor

KEP to introduce kubectl option for viewing logs of system services on Windows and Linux nodes.

Co-authored-by: Christian Glombek cglombek@redhat.com

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 14, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @aravindhp!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @aravindhp. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels Jan 14, 2021
@aravindhp aravindhp mentioned this pull request Jan 14, 2021
9 tasks
@aravindhp
Copy link
Contributor Author

/cc @immuzz

@k8s-ci-robot k8s-ci-robot requested a review from immuzz January 14, 2021 23:19
@aravindhp
Copy link
Contributor Author

I signed it

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 15, 2021
@marosset
Copy link
Contributor

/sig node
Since this has proposed kubelet changes

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 15, 2021
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 15, 2021
@marosset
Copy link
Contributor

/check-cla

@aravindhp
Copy link
Contributor Author

I signed it

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 15, 2021
Copy link
Member

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great enhancement! I realize it's WIP but had a few early comments.

- Windows worker nodes (all supported variants)

### Non-Goals
Providing support for non-systemd Linux distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is obvious but maybe you can point out here that reporting logs for nodes/kubelets that have config/connection issues with the cluster is out-of-scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

systemd / journald should return "OS not supported".

### Windows
Reuse the kubelet API for querying the Linux journal for invoking the `Get-WinEvent`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows kubelets may log to just a regular file. Would there be some logic here to figure out whether the log type configured and stream them out based on the configuration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've planned for any detection logic for this case.
One thing to note is that if the kubelet (or any other process) logs to C:\var\log\somepath on Windows, those log files can be streamed by the kubelet's existing log server, and one part of this proposal is implementing a client command for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddebroy I have sort of addressed this in my next commit. PTAL.

- "@LorbusChris"
owning-sig: sig-windows
participating-sigs:
- sig-node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the kubectl enhancements, I guess sig-cli needs to be involved as well and needs to approve?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want someone form sig-auth to approve here as well.
At least from the windows side exposing the ability to query arbitrary event logs can expose a lot more system information than what is exposed in kubelet logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sig-auth to participating sigs.

@aravindhp
Copy link
Contributor Author

/sig cli

@k8s-ci-robot k8s-ci-robot added the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label Jan 19, 2021
@smarterclayton
Copy link
Contributor

I don't see a clear principle that delineates this feature from "we made a general log-query API". How far is too far and why ?

Does Kubelet know about the node? Yes
Does Kubelet have a rough idea of an opinion of the shape of logging of process it runs on the node? Yes
Does Kubelet have a goal of making debugging workloads on the node? Yes
Does Kubelet have a goal of working well on Linux and Windows machines such that admins can correlate failures outside the Kubelet but inside the Kubernetes system? 75% yes

<---

Does Kubelet own the machine? No
Should individual distributions of Kubernetes / services be required to be conformant on how to troubleshoot the node? Maybe not?
Would we allow a lot of other types of Linux systems to inject this same logging interface? Maybe not?

For OpenShift all of the answers turned out to be yes, and so it was a win to add something like this (because we explicitly took choice about node configuration away). I don't think that's universal. I don't think this endpoint should be required for conformance. Would most Kube distros have this, find value in it, and lead to happier and healthy clusters, without requiring us to solve completely? Probably.

I remember us having this same discussion about whether pod logs should be viewable, but we had the bar of Docker log to contend with, so we just went ahead and did it. Same for exec - exec was so useful we decided not to fallback on SSH and just put exec in. I can't imagine Kubernetes without pod logs and pod exec.

@thockin
Copy link
Member

thockin commented Feb 18, 2021

For OpenShift all of the answers turned out to be yes, and so it was a win to add something like this (because we explicitly took choice about node configuration away).

Agree, and I think it would be a win for most installations. I am just trying to find the edges of it.

I don't think this endpoint should be required for conformance. Would most Kube distros have this, find value in it, and lead to happier and healthy clusters, without requiring us to solve completely? Probably.

I think I agree with that.

My only remaining concern is forcing users to know and spell out files vs journals - can we find some way to help that?

@immuzz
Copy link

immuzz commented Feb 19, 2021

@aravindhp Thank you for driving this KEP. This is one of the most asked feature by Windows Containers customers for scenarios like gMSA where they have to individually log into each node to view logs just to set it up. I have reviewed the KEP thoroughly and have asked Windows engineers to review as well. Look good to us. This will be very useful in making K8s more managed. Cant wait to try this and happy to help along the way! Thanks again!

@jayunit100
Copy link
Member

... FWIW we'd take advantage of this if it was available :)

@aravindhp
Copy link
Contributor Author

My only remaining concern is forcing users to know and spell out files vs journals - can we find some way to help that?

@thockin, I can't think of an easy way to do this with making assumptions about the service and how it would log. Moreover given this feature is restricted to cluster admins and is advanced, won't having knowledge of whether they should asking for a file or journal be fine? Especially given they need this knowledge today when they have to log into the node and inspect it.

@thockin
Copy link
Member

thockin commented Apr 27, 2021

@aravindhp

I can't think of an easy way to do this with making assumptions about the service and how it would log

Why can't this be an abstraction? If I ask for kubectl logs nodes foobar have kubelet first look for a "foobar" journal, then look under "/var/log" for "foobar" files and heuristically choose which one to serve.

Why push this on users?

@thockin thockin self-assigned this Apr 27, 2021
@aravindhp
Copy link
Contributor Author

aravindhp commented Apr 27, 2021

@thockin

If I ask for kubectl logs nodes foobar have kubelet first look for a "foobar" journal, then look under "/var/log" for "foobar" files and heuristically choose which one to serve

This was the assumption I was hesitant to make i.e. a service named foobar implies that it will log to /var/log/foobar* or /var/log/foobar/*. Are you suggesting that we do a best effort but still have the option of being granular for services like foobar that log to /var/log/barfoo?

Also services like kube-proxy logs to multiple files like kube-proxy.exe.ERROR, kube-proxy.exe.INFO. In those cases do we output all the files?

@marosset
Copy link
Contributor

marosset commented Apr 28, 2021

To me, the biggest value that this enhancement adds to K8s is that it makes it possible to grab logs off nodes without needed to remotely connect to said nodes - this is especially important for Windows where configuring the nodes for a remote access is more complicated than adding SSH keys to a file.

Today in order to debug Windows nodes users need to first enable remote access (which is usually done by interacting with vendor specific infrastructure to manipulate the node) and then start looking for the logs they want to inspect.
This enhancement as is would solve one of these two problems.

Why can't this be an abstraction? If I ask for kubectl logs nodes foobar have kubelet first look for a "foobar" journal, then look under "/var/log" for "foobar" files and heuristically choose which one to serve.

Something like this may help but I have a small concern that this could increase complexity and cause some hinderances.
My feeling is that anyone using this feature to debug issues will most likely already know where the logs are on their nodes.

I wouldn't be opposed to adding in some heuristics or something to help in finding logs but do think it would be important to let users get at the exact logs they are looking for and/or handle cases like

Are you suggesting that we do a best effort but still have the option of being granular for services like foobar that log to /var/log/barfoo?

@thockin
Copy link
Member

thockin commented Apr 28, 2021

This was the assumption I was hesitant to make i.e. a service named foobar implies that it will log to /var/log/foobar* or /var/log/foobar/*. Are you suggesting that we do a best effort but still have the option of being granular for services like foobar that log to /var/log/barfoo?

Yeah, I am saying that, on average, I expect this sort of a heuristic to work pretty well. If we want to add more specific manual overrides, those are the exception, rather than the norm.

Also services like kube-proxy logs to multiple files like kube-proxy.exe.ERROR, kube-proxy.exe.INFO. In those cases do we output all the files?

Yay glog. Those are additive - the INFO file is inclusive of the ERROR file. If we want to serve both through this API, maybe we have an extra (optional) arg that factors into the heuristic.

And if I am wrong that the heuristic approach works, then I will waive further objections.

So how can we prove whether it does or doesn't work?

@aravindhp
Copy link
Contributor Author

@thockin

Yay glog. Those are additive - the INFO file is inclusive of the ERROR file. If we want to serve both through this API, maybe we have an extra (optional) arg that factors into the heuristic.

Yay glog indeed 😃 In fact there are three files including the WARNING file which is also included in the INFO file. So in heuristic approach by default we only server the INFO file or do we serve all files starting with kube-proxy i.e. starting with the service name? Then we can have the specific file option if the user wants to look at only one of them.

And if I am wrong that the heuristic approach works, then I will waive further objections.
So how can we prove whether it does or doesn't work?

One way would be to call out the heuristic approach in the KEP. Then implement the feature without heuristics for alpah and get feedback from the users. Alternatively implement the feature with the heuristics approach and without the journal vs file options for alpha and get feedback from the users. Then react to the feedback appropriately when moving from alpha to beta. I am open to other options too.

@thockin
Copy link
Member

thockin commented Apr 28, 2021

in heuristic approach by default we only server the INFO file or do we serve all files

as a heuristic, if we see 3 such files, we can assume they are glog output and choose INFO. If there's really demand for only error messages, we can expand the heuristic.

My fear is that if we implement with a lot of control we will have a hard time removing them, whereas if we start with relatively little control, we may not need them at all.

I should caveat - this is armchair architecture. From my distance I can handwave about it, but I don't have to implement all the crazy suggestions. It may be that I am asking for dumb things, and I just need people to show me why.

Co-authored-by: Christian Glombek <cglombek@redhat.com>
@aravindhp
Copy link
Contributor Author

@thockin Thank you for the patient feedback. I have updated the KEP to reflect the heuristic approach you suggested. Please take a look.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty happy - small nits only from me

/lgtm


Options:
--case-sensitive=true: Filters are case sensitive by default. Pass --case-sensitive=false to do a case insensitive filter.
-g, --grep='': Filter log entries by the provided regex pattern. Only applies to node journal logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do some flags only work on some implementations? Can we at least clarify somewhere that "some capabilities are optional, and may not be supported by some sources of log information", rather than specifically "journal"?

Copy link
Contributor Author

@aravindhp aravindhp May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a note to call this out during the implementation. Would that suffice or should I open a follow up PR to amend the enhancement?

--raw=false: Perform no transformation of the returned data.
--role='': Set a label selector by node role.
-l, --selector='': Selector (label query) to filter on.
--since='': Return logs after a specific ISO timestamp or relative date. Only applies to node journal or Get-WinEvent logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these

to get logs from these, it will attempt to get logs from `/var/log/foobar.log`,
`/var/log/foobar/foobar.log`, `/var/log/foobar*INFO` or
`/var/log/foobar/foobar*INFO` in that order.
Here are some examples and explanation of the options that will be added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say that the heuristic can grow and evolve as we get real feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021
@k8s-ci-robot k8s-ci-robot merged commit 5e445c5 into kubernetes:master May 7, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 7, 2021
Copy link
Contributor Author

@aravindhp aravindhp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank your for the review, @thockin


Options:
--case-sensitive=true: Filters are case sensitive by default. Pass --case-sensitive=false to do a case insensitive filter.
-g, --grep='': Filter log entries by the provided regex pattern. Only applies to node journal logs.
Copy link
Contributor Author

@aravindhp aravindhp May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a note to call this out during the implementation. Would that suffice or should I open a follow up PR to amend the enhancement?

to get logs from these, it will attempt to get logs from `/var/log/foobar.log`,
`/var/log/foobar/foobar.log`, `/var/log/foobar*INFO` or
`/var/log/foobar/foobar*INFO` in that order.
Here are some examples and explanation of the options that will be added.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@carlisia
Copy link

Hello @aravindhp 👋, 1.22 Docs Shadow here.
This enhancement is marked as ‘Needs Docs’ for the 1.22 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT.

Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: API review completed, 1.22
Development

Successfully merging this pull request may close these issues.