Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Improve monitoring BPF maps and userspace caches #1950

Merged
merged 5 commits into from
Mar 1, 2024

Conversation

sadath-12
Copy link
Contributor

@sadath-12 sadath-12 commented Jan 9, 2024

  • Remove tetragon_map_drops_total
  • Remove tetragon_map_in_use_gauge{map="eventcache"}
  • Replace tetragon_map_in_use_gauge{map="processLru"} with tetragon_process_cache_size
  • Improve doc comments and help texts for some metrics
Fix metrics related to monitoring BPF maps and userspace caches. Remove `tetragon_map_drops_total` (it was duplicating `tetragon_errors_total{type="process_cache_evicted"}`). Remove `tetragon_map_in_use_gauge{map="eventcache"}` (event cache is not a BPF map). Replace `tetragon_map_in_use_gauge{map="processLru"}` with `tetragon_process_cache_size` (process cache is not a BPF map).

Ref #1774

@sadath-12 sadath-12 requested a review from a team as a code owner January 9, 2024 12:01
@sadath-12 sadath-12 requested a review from tpapagian January 9, 2024 12:01
@sadath-12
Copy link
Contributor Author

WIP !

@kkourt kkourt marked this pull request as draft January 9, 2024 12:03
@kkourt
Copy link
Contributor

kkourt commented Jan 9, 2024

Moved to draft since it seems to be WIP.

@sadath-12
Copy link
Contributor Author

Moved to draft since it seems to be WIP.

yes thank you @kkourt

@sadath-12 sadath-12 mentioned this pull request Jan 9, 2024
6 tasks
@sadath-12 sadath-12 changed the title metrics: correction Fix: monitoring BPF maps Jan 9, 2024
@sadath-12 sadath-12 force-pushed the metrics-1 branch 2 times, most recently from 3fa3d38 to dce5732 Compare January 9, 2024 13:05
@sadath-12 sadath-12 marked this pull request as ready for review January 9, 2024 13:28
@sadath-12
Copy link
Contributor Author

ready to review

@lambdanis
Copy link
Contributor

@sadath-12 I see your PR has conflicts with the main branch, can you rebase?

@lambdanis lambdanis added area/metrics Related to prometheus metrics release-note/bug This PR fixes an issue in a previous release of Tetragon. labels Jan 10, 2024
@sadath-12
Copy link
Contributor Author

Done @lambdanis

@lambdanis
Copy link
Contributor

lambdanis commented Jan 12, 2024

@sadath-12 It seems commits are still a bit off. You can check this in the checkpath action run (here).

There should be no merge commits, and ideally different tasks you implemented in this PR should be separate commits - they're quite independent of each other, so separate commits would make it easier to review the git history. I think it would be good to rewrite the git history here:

git reset HEAD~2
git add -p
# add only changes for a single task
git commit
# repeat git add, git commit for other tasks

@lambdanis lambdanis self-assigned this Jan 12, 2024
Copy link

netlify bot commented Jan 15, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 5305df2
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/65e161822f87730008423a67
😎 Deploy Preview https://deploy-preview-1950--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@sadath-12 sadath-12 force-pushed the metrics-1 branch 4 times, most recently from cee5c96 to c691609 Compare January 16, 2024 15:24
@sadath-12 sadath-12 requested a review from mtardy as a code owner January 16, 2024 15:24
@willfindlay willfindlay requested a review from lambdanis January 22, 2024 14:52
@willfindlay
Copy link
Contributor

Please fix the merge commit to have a proper description. Also <90619459+sadath-12@users.noreply.github.com> seems to be messing up our signed-off-by detection. Could you replace it with your email like you did for the other commit?

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! I left a few comments requesting changes.

Could you also update the commit messages to be a bit more precise, describing which metric is modified exactly? Messages like removed eventcache are confusing, as eventcache itself stays there.

pkg/metrics/mapmetrics/mapmetrics.go Outdated Show resolved Hide resolved
pkg/process/cache.go Show resolved Hide resolved
pkg/metrics/eventcachemetrics/eventcachemetrics.go Outdated Show resolved Hide resolved
pkg/metrics/eventcachemetrics/eventcachemetrics.go Outdated Show resolved Hide resolved
pkg/metrics/eventcachemetrics/eventcachemetrics.go Outdated Show resolved Hide resolved
@sadath-12 sadath-12 force-pushed the metrics-1 branch 6 times, most recently from 981ac45 to 51688bb Compare February 22, 2024 12:45
@lambdanis
Copy link
Contributor

@sadath-12 I see the CI is failing because of a commit message being too long:
https://github.com/cilium/tetragon/actions/runs/8004683889/job/21862554217?pr=1950

Could you rephrase it to fit the characters limit?

@sadath-12
Copy link
Contributor Author

@sadath-12 I see the CI is failing because of a commit message being too long: https://github.com/cilium/tetragon/actions/runs/8004683889/job/21862554217?pr=1950

Could you rephrase it to fit the characters limit?

ya reworded that . Tried to explain in description without making much sense (but enough) in the title of commit msg

@jrfastab
Copy link
Contributor

@lambdanis @willfindlay could you take a look and merge if its ready to go? Thanks. CI failure there is something unrelated, I see this failure from time to time.

@sadath-12 sadath-12 requested a review from lambdanis February 29, 2024 12:54
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request for a fix in a metrics description, otherwise looks good.

pkg/metrics/mapmetrics/mapmetrics.go Outdated Show resolved Hide resolved
@sadath-12 sadath-12 requested a review from lambdanis March 1, 2024 05:03
@lambdanis lambdanis force-pushed the metrics-1 branch 2 times, most recently from 4c8ca78 to e520fbd Compare March 1, 2024 11:41
sadath-12 and others added 5 commits March 1, 2024 11:50
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com>
This metric was counting process cache evictions, so effectively duplicating
tetragon_errors_total{type="process_cache_evicted"}. Additionally, having it
in the mapmetrics package was confusing, as it's monitoring a userspace
process cache, not BPF maps. Let's remove it.

Co-authored-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Using the map size metric for reporting the event cache size was confusing, as
event cache is not a BPF map. Let's remove it.

A metric reporting the event cache size will be added in a follow-up commit.

Co-authored-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Some of them were incorrect.

Co-authored-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
Replace tetragon_map_in_use_gauge{map="processLru"} with
tetragon_process_cache_size. It's reporting the process cache size. Using the
map size metric for this purpose was confusing, as process cache is not a BPF
map.

Co-authored-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: sadath-12 <sadathsadu2002@gmail.com>
Signed-off-by: Anna Kapuscinska <anna@isovalent.com>
@lambdanis lambdanis changed the title Fix: monitoring BPF maps metrics: Improve monitoring caches and BPF maps Mar 1, 2024
@lambdanis
Copy link
Contributor

Thanks @sadath-12 for the updates. The code changes look good now. I rewrote the commit messages to describe exactly which metrics changed and split some commits to not mix up different changes. I dropped the _gauge suffix removal, as it's a breaking change that's hard to justify as a bug, so it shouldn't be mixed up in one PR with unrelated changes IMO. I also rewrote the PR description to include a release note.

@lambdanis lambdanis changed the title metrics: Improve monitoring caches and BPF maps metrics: Improve monitoring BPF maps and userspace caches Mar 1, 2024
@lambdanis
Copy link
Contributor

The CI failure is a known issue: #2010. I'm merging this PR.

@lambdanis lambdanis merged commit e34fcec into cilium:main Mar 1, 2024
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Related to prometheus metrics release-note/bug This PR fixes an issue in a previous release of Tetragon.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants