-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: collect the metrics of nydusd events #263
Conversation
431824b
to
b2fc494
Compare
Codecov ReportBase: 33.30% // Head: 33.61% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #263 +/- ##
==========================================
+ Coverage 33.30% 33.61% +0.31%
==========================================
Files 29 29
Lines 3099 3106 +7
==========================================
+ Hits 1032 1044 +12
+ Misses 1958 1953 -5
Partials 109 109
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
e76fe24
to
b1e3197
Compare
* Copyright (c) 2021. Alibaba Cloud. All rights reserved. | ||
* | ||
* SPDX-License-Identifier: Apache-2.0 | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use license claims like:
/*
* Copyright (c) 2022. Nydus Developers. All rights reserved.
*
* SPDX-License-Identifier: Apache-2.0
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, already fixed it.
Help: "nydusd lifetime event times.", | ||
}, | ||
[]string{daemonIDLabel, timeLabel, eventLabel}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot package is an implementation of Containerd remote snapshotter. It does not know about nydusd daemons backing the nydus meta layer snapshot. Can we move the code to the daemon package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done.
NydusdEvent = prometheus.NewCounterVec( | ||
prometheus.CounterOpts{ | ||
Name: "nydusd_lifetime_event_times", | ||
Help: "nydusd lifetime event times.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming it to nydusd_lifetime_events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow Prom name suggestions https://prometheus.io/docs/instrumenting/writing_exporters/#naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks.
pkg/filesystem/fs/fs.go
Outdated
@@ -113,6 +114,10 @@ func NewFileSystem(ctx context.Context, opt ...NewFSOpt) (*Filesystem, error) { | |||
if err := d.WaitUntilState(types.DaemonStateRunning); err != nil { | |||
return nil, errors.Wrapf(err, "wait for daemon %s", d.ID()) | |||
} | |||
if err := exporter.ExportNydusdEventMetric(d.States.ID, string(types.DaemonStateRunning)); err != nil { | |||
log.L.Errorf("export nydusd event metric failed, daemon ID: %s, event: %s, error: %v", d.States.ID, string(types.DaemonStateRunning), err) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about recording nydusd lifetime event in function WaitUntilState
thus not calling it right after calling WaitUntilState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/metrics/exporter/export.go
Outdated
} | ||
|
||
func (mexp *MutexExporter) Get() *Exporter { | ||
mexp.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need a lock here? Is there a race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no race condition now. I already removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help check nydus-snapshotter resource consumption if enabling metrics after running more than 10 minutes with 5 container images?
containers: nerdctl ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3a45e9a3c9b1 alpine:nydusv6 "sleep 1200" 16 minutes ago Up alpine-3a45e
6e21ffbcb33f python:nydusv6 "sleep 1200" 16 minutes ago Up python-6e21f
c116554c0584 debian:nydusv6 "sleep 1200" 16 minutes ago Up debian-c1165
c23ca6ff0674 busybox:nydusv6 "sleep 1200" 16 minutes ago Up busybox-c23ca
c7068ea2882c golang:nydusv6 "sleep 1200" 16 minutes ago Up golang-c7068 enable metrics: {"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:28"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:29"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:30"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:31"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:32"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:33"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:34"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:35"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:36"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:37"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:38"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:39"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:40"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:41"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:42"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:43"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:44"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:45"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:46"} disable metrics: {"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:42:57"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:42:58"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:42:59"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:00"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:01"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:02"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:03"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:04"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:05"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:06"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:07"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:08"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:09"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:10"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:11"} |
404c2d3
to
03d467d
Compare
@@ -16,6 +16,7 @@ import ( | |||
|
|||
var ( | |||
defaultCleanUpPeriod = 10 * time.Minute | |||
DefaultTTL = 3 * time.Minute | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For snapshots daemons events, it's better to keep the metrics 2 days or longer. It does not occupy so much memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I do not use TTL for snapshots daemons events
. It just works for FS Metrics
.
19f0a66
to
d79e469
Compare
pkg/daemon/daemon.go
Outdated
@@ -201,6 +202,10 @@ func (d *Daemon) WaitUntilState(expected types.DaemonState) error { | |||
_, err, shared := d.stateGetterGroup.Do(d.ID(), stateGetter) | |||
log.L.Debugf("Get daemon %s with shared result: %v ", d.ID(), shared) | |||
|
|||
if exportErr := exporter.ExportNydusdEventMetric(d.States.ID, string(expected)); exportErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check err
if it is nil. Then record the metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
d79e469
to
1335cb3
Compare
pkg/metrics/exporter/export.go
Outdated
|
||
daemon.NydusdEvent.WithLabelValues(daemonID, time.Now().Format("2006-01-02 15:04:05.000"), event).Inc() | ||
|
||
return e.output() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bad idea to write metrics to files each time this function is called. It will harm snapshotter performance.
I suppose only write them out when querying metrics via HTTP api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/metrics/exporter/export.go
Outdated
@@ -69,6 +88,17 @@ func (e *Exporter) ExportFsMetrics(m *types.FsMetrics, imageRef string) error { | |||
return e.output() | |||
} | |||
|
|||
func ExportNydusdEventMetric(daemonID string, event string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, we can rename this function to RecordDaemonEvent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Export means the metrics is consumed and passed to another subsystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is right. I have Fixed it.
I can get two duplicated event records applying this PR
|
Can enable nydus-snapshotter metrics by default? |
Um... Seems you forgot to record the event when nydusd exits normally in |
1335cb3
to
79e38dc
Compare
pkg/metrics/exporter/export.go
Outdated
@@ -40,32 +40,33 @@ func WithOutputFile(metricsFile string) Opt { | |||
} | |||
} | |||
|
|||
func NewExporter(opts ...Opt) (*Exporter, error) { | |||
func NewExporter(opts ...Opt) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to initialize the exporter in the package init() function which guarantees that it must be successfully initialized.
Then RecordDaemonEvent() can't fail which makes it neater without checking its results. we can ignore its absence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/metrics/exporter/record.go
Outdated
func RecordDaemonEvent(daemonID string, event string) error { | ||
if _, err := getExporter(); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fails to get the global exporter, it just ignore its absence and return success to caller
pkg/metrics/exporter/record.go
Outdated
@@ -0,0 +1,41 @@ | |||
package exporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have license claim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/daemon/daemon.go
Outdated
@@ -188,6 +189,10 @@ func (d *Daemon) WaitUntilState(expected types.DaemonState) error { | |||
d.ID(), expected, state) | |||
} | |||
|
|||
if exportErr := exporter.RecordDaemonEvent(d.ID(), string(expected)); exportErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename exporter to collector, which means the metrics is not exported to other system yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, finished.
pkg/metrics/exporter/export.go
Outdated
@@ -26,6 +24,8 @@ type Exporter struct { | |||
outputFile string | |||
} | |||
|
|||
var globalExporter *Exporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exporter should be defined as a interface to other system to fetch metrics via HTTP or RPC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, looks good to me
05c4878
to
ef8dc62
Compare
Collect the metrics of nydus daemon events, including INIT, RUNNING and DIED. Signed-off-by: Bin Tang <tangbin.bin@bytedance.com>
ef8dc62
to
fe511b8
Compare
Collect the metrics of nydus daemon events, including INIT, RUNNING and DIED.
Result:
Signed-off-by: Bin Tang tangbin.bin@bytedance.com