-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ethtool): Gather statistics from namespaces #11895
Conversation
bb4b613
to
d64989e
Compare
Update: not sure what changed, but PR checks are all passing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, some questions and comments below!
if err := netns.Set(initialNamespace); err != nil { | ||
c.Log.Errorf("Could not return to initial namespace: %s", err) | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this is if an existing user has some additional network namespaces created, they will suddenly start getting errors from Telegraf around operation not permitted, right?
As an example, I created one namespace and ran again with no elevated permissions and error out, whereas before I was collecting metrics without issue:
2022-09-28T14:39:04Z W! [inputs.ethtool] Could not switch to namespace vnet0: operation not permitted
2022-09-28T14:39:04Z E! [inputs.ethtool] Could not return to initial namespace: operation not permitted
2022-09-28T14:39:04Z E! [inputs.ethtool] Error in plugin: operation not permitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I did make some changes that I didn't put through my local testing to satisfy the CI linter. I wonder if that changed things. The intention is that unless at least one of the new namespace filters is configured, behavior should be unchanged from prior to this change.
Let me investigate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found and fixed the bug. It should be working as intended now (extra warnings in the logs, but functions as before otherwise).
9861d16
to
a1db0ff
Compare
namespaces, err := os.ReadDir("/var/run/netns") | ||
if err != nil { | ||
c.Log.Warnf("Could not find namespace directory: %s", err) | ||
return allInterfaces, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the filtering of the interfaces happen before we hit this point? This way telegraf only reads this directory and attempts to change namespaces, if we have a list of namespaces to actually review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's one option.
It would move even more of the logic out of the abstraction layer and into the implementation layer, which also makes the unit tests less useful because they're testing less of the runtime logic. Which is why I didn't go with that approach initially.
But it would be overall more performant to filter the interfaces and namespaces sooner, plus, in the namespace case, would reduce annoying log warning spam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further testing revealed that this approach has very noticeable performance impact on my workload. Going to see what can be done to improve performance.
a1db0ff
to
a0b6bfb
Compare
Okay, I think this is ready again. More major refactor, but it now creates an OS thread per-namespace. This saves significant CPU performance at the cost of some extra memory. It also makes it easier to have cleaner logs in the case where default (non-namespaced) behavior is desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! I have a couple minor comments.
I do want to chat with others on the team about whether this should be a possible separate plugin. I don't want the additional namespace tooling and library to get in the way in what is a rather simple plugin. But this might be fine as is.
I think ideally, this should become a more global configuration option with an easy way for plugins to take advantage of the namespace logic to do their thing. Many more plugins than just ethtool have the same limitation/issue. This is just (currently) the only networking plugin used in my project, so it's the only one where we need namespace support. |
a0b6bfb
to
2acdc38
Compare
5792e40
to
58d7abd
Compare
To support monitoring an entire host with one telegraf process, or any number of other uses of namespaced network interfaces, the ethtool plugin now supports gathering statistics from interfaces in additional namespaces. This functionality must be enabled by adding (at least) one of the new namespace filters to the plugin configuration, and requires adding CAP_SYS_ADMIN to the telegraf process. Resolves influxdata#11754
3be205e
to
d75d718
Compare
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for your work @zeffron!
Required for all PRs
resolves #11754
Added network namespace support to the ethtool plugin. By default, it continues
to operate as it always had, gathering metrics from the initial namespace only.
To gather metrics from additional namespaces,
CAP_SYS_ADMIN
must be added tothe
telegraf
process, and at least one of the new namespace filters must beconfigured.
Unit tests for the new namespace support also require running with
CAP_SYS_ADMIN
. They will be skipped ifCAP_SYS_ADMIN
is misisng. Theoriginal unit tests have been updated to do minimal namespace support testing
and do not require
CAP_SYS_ADMIN
.