Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a metric to provide missed events per type #1674

Merged
merged 2 commits into from
Nov 2, 2023
Merged

Conversation

tpapagian
Copy link
Member

@tpapagian tpapagian commented Oct 30, 2023

Example:

$ curl localhost:2112/metrics 2> /dev/null | grep 'sent_events_total\|missed_events_total\|ringbuf_perf_event_lost_total\|ringbuf_queue_lost_total\|msg_op_total\|ringbuf_queue_received_total'
# HELP tetragon_missed_events_total The total number of Tetragon events per type that are failed to sent from the kernel.
# TYPE tetragon_missed_events_total counter
tetragon_missed_events_total{msg_op="13"} 73300
tetragon_missed_events_total{msg_op="23"} 28
tetragon_missed_events_total{msg_op="24"} 606
tetragon_missed_events_total{msg_op="5"} 20
tetragon_missed_events_total{msg_op="7"} 22
# HELP tetragon_msg_op_total The total number of times we encounter a given message opcode. For internal use only.
# TYPE tetragon_msg_op_total counter
tetragon_msg_op_total{msg_op="13"} 4.268532e+06
tetragon_msg_op_total{msg_op="23"} 12444
tetragon_msg_op_total{msg_op="24"} 2110
tetragon_msg_op_total{msg_op="5"} 11908
tetragon_msg_op_total{msg_op="7"} 12447
# HELP tetragon_ringbuf_perf_event_lost_total The total number of Tetragon ringbuf perf events lost.
# TYPE tetragon_ringbuf_perf_event_lost_total counter
tetragon_ringbuf_perf_event_lost_total 73976
# HELP tetragon_ringbuf_queue_lost_total The total number of Tetragon events ring buffer queue lost.
# TYPE tetragon_ringbuf_queue_lost_total counter
tetragon_ringbuf_queue_lost_total 0
# HELP tetragon_ringbuf_queue_received_total The total number of Tetragon events ring buffer queue received.
# TYPE tetragon_ringbuf_queue_received_total counter
tetragon_ringbuf_queue_received_total 4.307441e+06

This PR adds an eBPF map collector for getting metrics directly from a map. This map contains information about the return values of all perf_event_output calls (i.e. if it fails). This provides us the ability to determine missed events per type. Metric tetragon_missed_events_total contains such information.

Using the previous example, we can see that we lost 73976 events from the user-space (tetragon_ringbuf_perf_event_lost_total). This is the same as the sum of all tetragon_missed_events_total metrics gathered from the kernel.

@netlify
Copy link

netlify bot commented Oct 30, 2023

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 097b35f
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/6540b5a1ff11ee0007cad97a
😎 Deploy Preview https://deploy-preview-1674--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@tpapagian tpapagian added area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change labels Oct 30, 2023
@tpapagian tpapagian force-pushed the pr/apapag/ebpf_metrics branch 2 times, most recently from 1c5ca91 to 0f24125 Compare October 30, 2023 12:50
@tpapagian tpapagian changed the title Add a metric to provide per-event missed events Add a metric to provide missed events per type Oct 30, 2023
pkg/metrics/metricsconfig/initmetrics.go Outdated Show resolved Hide resolved
pkg/bpfmetrics/metrics.go Outdated Show resolved Hide resolved
@tpapagian tpapagian force-pushed the pr/apapag/ebpf_metrics branch from 0f24125 to a2ab538 Compare October 30, 2023 14:35
@tpapagian tpapagian marked this pull request as ready for review October 30, 2023 14:53
@tpapagian tpapagian requested a review from a team as a code owner October 30, 2023 14:53
@tpapagian tpapagian requested a review from kkourt October 30, 2023 14:53
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments :) The most important one being that it looks like we want to expose two different metric names here.

bpf/lib/process.h Outdated Show resolved Hide resolved
bpf/process/types/basic.h Outdated Show resolved Hide resolved
pkg/metrics/eventmetrics/collector.go Show resolved Hide resolved
pkg/metrics/eventmetrics/collector.go Outdated Show resolved Hide resolved
pkg/metrics/eventmetrics/collector.go Outdated Show resolved Hide resolved
Signed-off-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com>
@tpapagian tpapagian force-pushed the pr/apapag/ebpf_metrics branch from a2ab538 to 097b35f Compare October 31, 2023 08:06
@tpapagian
Copy link
Member Author

I left some comments :) The most important one being that it looks like we want to expose two different metric names here.

Thanks for the review! I have made the changes that you proposed.

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this will help with debugging.
I have some small comments, PTAL.

bpf/cgroup/bpf_cgroup_events.h Outdated Show resolved Hide resolved
bpf/lib/process.h Outdated Show resolved Hide resolved
bpf/process/types/basic.h Outdated Show resolved Hide resolved
@tpapagian tpapagian force-pushed the pr/apapag/ebpf_metrics branch 2 times, most recently from 1c293ad to f294f5d Compare November 1, 2023 09:13
bpf/lib/process.h Outdated Show resolved Hide resolved
@tpapagian tpapagian force-pushed the pr/apapag/ebpf_metrics branch 3 times, most recently from ad0ea1c to d1c041a Compare November 1, 2023 09:38
Example:
$ curl localhost:2112/metrics 2> /dev/null | grep 'sent_events_total\|missed_events_total\|ringbuf_perf_event_lost_total\|ringbuf_queue_lost_total\|msg_op_total\|ringbuf_queue_received_total'
tetragon_missed_events_total{msg_op="13"} 73300
tetragon_missed_events_total{msg_op="23"} 28
tetragon_missed_events_total{msg_op="24"} 606
tetragon_missed_events_total{msg_op="5"} 20
tetragon_missed_events_total{msg_op="7"} 22
tetragon_msg_op_total{msg_op="13"} 4.268532e+06
tetragon_msg_op_total{msg_op="23"} 12444
tetragon_msg_op_total{msg_op="24"} 2110
tetragon_msg_op_total{msg_op="5"} 11908
tetragon_msg_op_total{msg_op="7"} 12447
tetragon_ringbuf_perf_event_lost_total 73976
tetragon_ringbuf_queue_lost_total 0
tetragon_ringbuf_queue_received_total 4.307441e+06

This PR adds an eBPF map collector for getting metrics directly from a
map. This map contains information about the return values of all
perf_event_output calls (i.e. if it fails). This provides us the
ability to determine missed events per type. Metric
tetragon_missed_events_total contains such information.

Using the previous example, we can see that we lost 73976 events from
the user-space (tetragon_ringbuf_perf_event_lost_total). This is the same
as the sum of all tetragon_missed_events_total metrics gathered from the
kernel.

Signed-off-by: Anastasios Papagiannis <tasos.papagiannnis@gmail.com>
@tpapagian tpapagian force-pushed the pr/apapag/ebpf_metrics branch from d1c041a to 19778a4 Compare November 1, 2023 10:30
@kkourt kkourt added the needs-backport/1.0 This PR needs backporting to 1.0 label Nov 2, 2023
@kkourt
Copy link
Contributor

kkourt commented Nov 2, 2023

Thanks! I think this will be very useful to have, which is why I added a backport label for 1.0.

@kkourt kkourt merged commit d5a7ee2 into main Nov 2, 2023
@kkourt kkourt deleted the pr/apapag/ebpf_metrics branch November 2, 2023 09:26
@tpapagian
Copy link
Member Author

Backport PR: #1702

@tpapagian tpapagian added backport-pending/1.0 and removed needs-backport/1.0 This PR needs backporting to 1.0 labels Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Related to prometheus metrics backport-done/1.0 release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants