RFC: structured, contextual logging #639

pohly · 2021-05-31T11:22:48Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Embracing go-logr as logger has several advantages:

adding a certain prefix and/or set of objects can be done consistently with less code duplication
it's even possible for the caller to do that, something that doesn't work when the prefix is part of the format string
in Go tests, all output can be associated with the currently running test via RFC: testinglogger: per-test, structured logging kubernetes/klog#240

The latter was needed to debug #638

Does this PR introduce a user-facing change?:

Log messages are structured (fixed message, additional information in key/value pairs).

pkg/capacity/capacity.go

xing-yang · 2021-06-08T13:45:43Z

go.mod

@@ -61,3 +61,6 @@ replace k8s.io/component-base => k8s.io/component-base v0.21.0
 replace k8s.io/component-helpers => k8s.io/component-helpers v0.21.0

 replace k8s.io/csi-translation-lib => k8s.io/csi-translation-lib v0.21.0
+
+// WIP
+replace k8s.io/klog/v2 => github.com/pohly/klog/v2 v2.4.1-0.20210527141230-ac596814502c


What's the plan for this?

I proposed to have the code in klog: kubernetes/klog#240

It's currently on hold because the logr API changes need to be dealt with first.

xing-yang · 2021-06-08T13:49:59Z

pkg/capacity/capacity.go


-	klog.Info("Started Capacity Controller")
+	logger.Info("started controller")


What's the reason for starting an info log message with lower case?

It's not a full sentence, therefore initial capital letter looked odd. I don't know whether there is some guidance on this.

It's also consistent with error messages. For those the official guidance is to start with lower case because the error might get wrapped.

I'd like to see if the structured log KEP has any official guidance on this. The official guidance for error message is only referring to the returned error message, not error messages in the logs.

The logr example uses lower case, incidentally also with a "starting" message:
https://github.com/go-logr/logr#typical-usage

But the Kubernetes documentation says "Start from a capital letter": https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#remove-string-formatting-from-log-message

It also has some other recommendations. Once we agree to go further with this and the klog PR is merged, I'll revisit the log messages and update them accordingly.

xing-yang · 2021-06-08T13:51:48Z

pkg/capacity/capacity.go


-	klog.V(3).Infof("Capacity Controller: storage class %s was removed", sc.Name)
+	logger.V(3).Info("removed")


Can you show an example of the output of this changed log msg vs the original log msg?

Looks like I have a coverage gap in capacity_test.go - onSCDelete is not called. Will fix.

In the meantime, here's the corresponding output from onSCAddedOrUpdated, as printed by go test. With these changes:

capacity.go:373: INFO onSCAddOrUpdate: updated or added storageclass="triple-sc" ... capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="triple-sc" workitem={segment:0x27ef430 storageClassName:triple-sc}

Without them:

I0608 17:48:40.721547 866697 capacity.go:361] Capacity Controller: storage class triple-sc was updated or added ... I0608 17:48:40.721526 866697 capacity.go:468] Capacity Controller: enqueuing {segment:0x27e7350 storageClassName:triple-sc}

Note that the "enqueuing" messages without this PR lacks context. It's not clear why addWorkItem was called. With contextual logging, the onSCAddOrUpdate function name and the storage class get passed down and are added to the log message.

This is important once things start to happen in parallel. When everything is sequential, one can read the log from top to bottom and remember which values were logged earlier. But when run in parallel, it's not clear whether log output actually follows the one printed directly above it.

This is a problem in our driver logs where it is hard to associate a gRPC error response with the corresponding call.

Here's the log output for a complete testcase:

=== RUN TestRefresh === RUN TestRefresh/truncated_topology === PAUSE TestRefresh/truncated_topology === CONT TestRefresh/truncated_topology capacity.go:338: INFO onTopologyChanges: topology changed added=[0x27ef430 = layer0: foo+ layer1: X+ layer2: A 0x27ef450 = layer0: foo+ layer1: X+ layer2: B] removed=[] capacity.go:373: INFO onSCAddOrUpdate: updated or added storageclass="direct-sc" capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="direct-sc" workitem={segment:0x27ef430 storageClassName:direct-sc} capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="direct-sc" workitem={segment:0x27ef450 storageClassName:direct-sc} capacity.go:373: INFO onSCAddOrUpdate: updated or added storageclass="triple-sc" capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="triple-sc" workitem={segment:0x27ef430 storageClassName:triple-sc} capacity.go:481: INFO onSCAddOrUpdate: enqueuing storageclass="triple-sc" workitem={segment:0x27ef450 storageClassName:triple-sc} capacity.go:338: INFO onTopologyChanges: topology changed added=[0x27ef430 = layer0: foo+ layer1: X+ layer2: A 0x27ef450 = layer0: foo+ layer1: X+ layer2: B] removed=[] capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef430 storageClassName:direct-sc} capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef450 storageClassName:direct-sc} capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef430 storageClassName:triple-sc} capacity.go:481: INFO onTopologyChanges: enqueuing workitem={segment:0x27ef450 storageClassName:triple-sc} capacity.go:276: INFO prepare: initial state topology segments=2 storage classes=2 potential CSIStorageCapacity objects=4 capacity.go:288: INFO prepare: checking for existing CSIStorageCapacity objects --- PASS: TestRefresh (0.00s)

Note that this would be impossible to do without this PR because log output from different test cases would be mixed.

k8s-triage-robot · 2021-09-07T14:05:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-10-07T15:05:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pohly · 2021-10-07T16:38:01Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-01-05T16:50:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-04T17:13:25Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The k8s.io/component-base/logs API is used to add several new command line flags and the corresponding implementation: --feature-gates: ContextualLogging=true|false (ALPHA - default=true) LoggingAlphaOptions=true|false (ALPHA - default=false) LoggingBetaOptions=true|false (BETA - default=true) --log-flush-frequency duration Maximum number of seconds between log flushes (default 5s) --log-json-info-buffer-size quantity [Alpha] In JSON format with split output streams, the info messages can be buffered for a while to increase performance. The default value of zero bytes disables buffering. The size can be specified as number of bytes (512), multiples of 1000 (1K), multiples of 1024 (2Ki), or powers of those (3M, 4G, 5Mi, 6Gi). Enable the LoggingAlphaOptions feature gate to use this. --log-json-split-stream [Alpha] In JSON format, write error messages to stderr and info messages to stdout. The default is to write a single stream to stdout. Enable the LoggingAlphaOptions feature gate to use this. 35a42 --logging-format string Sets the log format. Permitted formats: "json" (gated by LoggingBetaOptions), "text". (default "text") In contrast to the defaults in the (pretty conservative) Kubernetes components, contextual logging gets enabled by default. That has the advantage that code can be rewritten with the assumption that WithValue and WithName calls really have an effect. Users can still disable the feature, but logs will be less informative in that case.

k8s-triage-robot · 2023-01-22T17:45:45Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pohly · 2023-01-22T22:13:28Z

/remove-lifecycle rotten

k8s-ci-robot · 2023-02-03T11:36:35Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-05-04T12:17:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-06-03T12:45:39Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pohly · 2023-06-05T06:01:44Z

/remove-lifecycle rotten

I was looking for a volunteer to continue with this, but so far without luck. I'll probably finish this myself.

bells17 · 2023-08-12T09:38:51Z

Hi @pohly

I was looking for a volunteer to continue with this, but so far without luck. I'll probably finish this myself.

Would it be acceptable if I were to take on this development task?

pohly · 2023-08-21T17:37:34Z

@bells17: help with this would be very welcome. Feel free to take my branch, rebase it and continue in a new PR.

I think with this PR and kubernetes-csi/node-driver-registrar#259 it is technically clear how to use component-base/logs. The rest of the conversion can go as described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md

k8s-ci-robot · 2023-10-02T17:20:35Z

@pohly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-csi-external-provisioner-1-22-on-kubernetes-1-22	`c4a1889`	link	true	`/test pull-kubernetes-csi-external-provisioner-1-22-on-kubernetes-1-22`
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-21	`c4a1889`	link	true	`/test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-21`
pull-kubernetes-csi-external-provisioner-1-23-on-kubernetes-1-23	`c4a1889`	link	false	`/test pull-kubernetes-csi-external-provisioner-1-23-on-kubernetes-1-23`
pull-kubernetes-csi-external-provisioner-1-21-on-kubernetes-1-21	`c4a1889`	link	true	`/test pull-kubernetes-csi-external-provisioner-1-21-on-kubernetes-1-21`
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-23	`c4a1889`	link	true	`/test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-23`
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-24	`c4a1889`	link	true	`/test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-24`
pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-25	`c4a1889`	link	true	`/test pull-kubernetes-csi-external-provisioner-distributed-on-kubernetes-1-25`
pull-kubernetes-csi-external-provisioner-1-27-on-kubernetes-1-27	`a2d6a4d`	link	true	`/test pull-kubernetes-csi-external-provisioner-1-27-on-kubernetes-1-27`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-01-22T00:21:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-21T00:41:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-22T01:20:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-03-22T01:20:54Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from gnufied and xing-yang May 31, 2021 11:23

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 31, 2021

pohly force-pushed the structured-logging branch 2 times, most recently from a3d1606 to 6f3e4f8 Compare May 31, 2021 11:30

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 31, 2021

pohly mentioned this pull request May 31, 2021

test flake: capacity metrics check fails #638

Closed

pohly force-pushed the structured-logging branch from 6f3e4f8 to c4a1889 Compare May 31, 2021 13:39

pohly commented May 31, 2021

View reviewed changes

pkg/capacity/capacity.go Outdated Show resolved Hide resolved

pohly mentioned this pull request May 31, 2021

RFC: testinglogger: per-test, structured logging kubernetes/klog#240

Closed

xing-yang reviewed Jun 8, 2021

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2021

xing-yang mentioned this pull request Jun 17, 2021

migrate volume/csi/csi_plugin.go logs to structured logging kubernetes/kubernetes#100323

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 7, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022

pohly mentioned this pull request Jan 12, 2022

KEP-3077: contextual logging kubernetes/enhancements#3078

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 4, 2022

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 23, 2022

pohly added 2 commits December 23, 2022 17:43

WIP: structured, contextual logging for capacity

a2d6a4d

pohly force-pushed the structured-logging branch from a2eef0f to a2d6a4d Compare December 23, 2022 16:44

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 22, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 22, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2023

pohly mentioned this pull request May 2, 2023

implement logging best practices kubernetes-sigs/dra-example-driver#10

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 3, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 5, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024

k8s-ci-robot closed this Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: structured, contextual logging #639

RFC: structured, contextual logging #639

pohly commented May 31, 2021

xing-yang Jun 8, 2021

pohly Jun 8, 2021

xing-yang Jun 8, 2021

pohly Jun 8, 2021

xing-yang Jun 9, 2021

pohly Jun 9, 2021

xing-yang Jun 8, 2021

pohly Jun 8, 2021 •

edited

Loading

pohly Jun 8, 2021

k8s-triage-robot commented Sep 7, 2021

k8s-triage-robot commented Oct 7, 2021

pohly commented Oct 7, 2021

k8s-triage-robot commented Jan 5, 2022

k8s-triage-robot commented Feb 4, 2022

k8s-triage-robot commented Jan 22, 2023

pohly commented Jan 22, 2023

k8s-ci-robot commented Feb 3, 2023

k8s-triage-robot commented May 4, 2023

k8s-triage-robot commented Jun 3, 2023

pohly commented Jun 5, 2023

bells17 commented Aug 12, 2023

pohly commented Aug 21, 2023

k8s-ci-robot commented Oct 2, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

k8s-triage-robot commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024


		klog.Info("Started Capacity Controller")
		logger.Info("started controller")


		klog.V(3).Infof("Capacity Controller: storage class %s was removed", sc.Name)
		logger.V(3).Info("removed")

RFC: structured, contextual logging #639

RFC: structured, contextual logging #639

Conversation

pohly commented May 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Sep 7, 2021

k8s-triage-robot commented Oct 7, 2021

pohly commented Oct 7, 2021

k8s-triage-robot commented Jan 5, 2022

k8s-triage-robot commented Feb 4, 2022

k8s-triage-robot commented Jan 22, 2023

pohly commented Jan 22, 2023

k8s-ci-robot commented Feb 3, 2023

k8s-triage-robot commented May 4, 2023

k8s-triage-robot commented Jun 3, 2023

pohly commented Jun 5, 2023

bells17 commented Aug 12, 2023

pohly commented Aug 21, 2023

k8s-ci-robot commented Oct 2, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

k8s-triage-robot commented Mar 22, 2024

k8s-ci-robot commented Mar 22, 2024

pohly Jun 8, 2021 •

edited

Loading