-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add per_cpu option to load scraper for hostmetricsreceiver #5243
Conversation
|
Which |
Hello @jpkrohling, here is an output with more information:
|
I just tested with |
@jpkrohling sorry again sadly the command finally ends with error for opencensus receiver:
|
Co-authored-by: Mark Stumpf <mstumpf@splunk.com>
The Now the |
I tried to add a test for the perCPU option, the tests pass locally but the CI fails: https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/19852/workflows/f3ff8a45-0d7f-4a8f-8cde-4dd89af2bf5a/jobs/160845 it seems not related to my changes but please tell me if I miss something |
receiver/hostmetricsreceiver/internal/scraper/loadscraper/load_scraper.go
Outdated
Show resolved
Hide resolved
receiver/hostmetricsreceiver/internal/scraper/loadscraper/load_scraper_test.go
Show resolved
Hide resolved
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@@ -19,4 +19,7 @@ import "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostm | |||
// Config relating to Load Metric Scraper. | |||
type Config struct { | |||
internal.ConfigSettings `mapstructure:",squash"` // squash ensures fields are correctly decoded in embedded struct | |||
|
|||
// If true, metrics will be average load per cpu | |||
PerCPU bool `mapstructure:"per_cpu"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really "per_cpu" (which for me means that is broken down by cpu core), is more or less average per cpu, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is absolutely correct. I confess I simply copied the existing option from smart agent: https://docs.signalfx.com/en/latest/integrations/agent/monitors/load.html#configuration
do you want I change it for something like average_per_cpu
?
@@ -63,6 +64,13 @@ func (s *scraper) scrape(_ context.Context) (pdata.MetricSlice, error) { | |||
return metrics, scrapererror.NewPartialScrapeError(err, metricsLen) | |||
} | |||
|
|||
if s.config.PerCPU { | |||
divisor := float64(runtime.NumCPU()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we better expose "num_cpu" as an individual metric, and if people are interested in this they can use the backend of choice, or a "processor" to compute this. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bogdandrutu this is a legitimate but difficult question ^^
The best I can do is to explain the full goal of this PR and let maintainers of this project decide.
context
We are customer of former SignalFx product aquired by Splunk more than 2 years ago I think.
We are trying to migrate from the deprecated smart agent to OpenTelemetry Collector.
In my opinion this deprecation by Splunk was a bit rushed and let their customers with a migration difficult to achieve given that the lack of documentation for SignalFx / Splunk specific requirements and common usages (which is not your problem!).
This PR is a tiny fragment of the work I am doing to make this migration seamless and transparent.
goal
About the load metrics, we come from the smart agent load monitor: https://docs.signalfx.com/en/latest/integrations/agent/monitors/load.html which provide an option to average the load metrics per cpu number.
This PR aims to bring a similar feature for parity between signalfx smart agent and open telemetry collector to be able to use the otel receiver but keep all existing dependent resources working like charts, detectors..
workaround
For now, as workaround I still use the smart agent monitor instead of hostmetrics receiver:
receivers:
hostmetrics:
collection_interval: 10s
scrapers:
cpu:
disk:
filesystem:
memory:
network:
#load:
paging:
processes:
smartagent/load:
type: load
perCPU: true
but I would prefer to drop smart agent monitors in profit of otel receivers if possible
your suggestion
your suggestion is full of sens obviously and in fact I could already create the average per cpu on load outside the receiver because I already have everything I need.
Indeed, the SignalFx exporter for Otel Collector already expose a metric for number of cpu here:
opentelemetry-collector-contrib/exporter/signalfxexporter/internal/translation/constants.go
Line 123 in 4f91eb0
# compute cpu.num_processors |
so I can basically divide the load metric from hostmetrics receiver by this cpu.num_processors
and I will get similar value I had with smart agent load monitor.
the problem
doing this way will force users to update all existing resources to use a more complex query to calculate this load averaged per cpu.
the problem is we have lot of customers on different signalfx organizations, we do not manage every resources ourselves and without to speak about the complexity of tracking, detect and update resources out of our scope this is also tricky due to contractual / responsibility / permission considerations.
even for the resources we manage ourselves properly (iac, git, terraform etc) and are of our responsibility it can be tedious to update existing resources without insert new source of mistakes. For example, our detector to create alert on load metrics is very simple: https://github.com/claranet/terraform-signalfx-detectors/blob/master/modules/smart-agent_system-common/detectors-gen.tf#L80
We explicitly instruct users of this module to enable perCPU:true
option and everything is good.
We could change this module to modify the query and make the division with cpu.num_processors
. In this way, the detector will be compatible with both otel collector and smart agent.
Nevertheless, if I do this I know some users will upgrade to the new version of the module without to remove the perCPU
option from the agent configuration and so will break their detector silently ...
conclusion
It is a mess !
Honestly I fully understand if you do not want to integrate a "useless" feature in otel collector especially if it is only for a specific vendor.
but here is the full explanation, this seems to me to be the less disruptive, the safest and the most straightforward way to handle my need.
I also would like to argue that:
- this option still can be useful to others (no splunk customers)
- this could ease the setup of monitoring in general
- this way will result in less required metrics (1 instead of 2) which can be important as much for vendor like splunk where each metric can increase the price as for opensource based solution where storage of metrics remains the most difficult part
ok not sure of the relevance of these arguments but nothing ventured, nothing gained :)
**Description:** Following up #5243 the `make` commands fails on `opencensus` receiver caused by timeout. **Link to tracking Issue:** #5243 (comment) **Testing:** Run `make` command **Documentation:** no change **Additional information:** here is an example of duration on my laptop: ``` time make -C ./receiver/opencensusreceiver test make: Entering directory '/home/qmanfroi/git/signalfx/opentelemetry-collector-contrib/receiver/opencensusreceiver' go test -race -timeout 300s --tags=containers_image_openpgp,exclude_graphdriver_btrfs,exclude_graphdriver_devicemapper ./... ok github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver 140.310s ok github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver/internal/ocmetrics 0.329s ok github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver/internal/octrace 0.614s make: Leaving directory '/home/qmanfroi/git/signalfx/opentelemetry-collector-contrib/receiver/opencensusreceiver' real 2m21.503s user 0m4.198s sys 0m1.047s ``` here is the specification of my laptop: * CPU: Intel(R) Core(TM) i7-8550U CPU * Memory: 16GB * Disk: PM981 NVMe Samsung 512GB * OS: Arch Linux
@dmitryax can we provide this as translation rule in the signalfx exporter? |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
receiver/hostmetricsreceiver/internal/scraper/loadscraper/load_scraper.go
Outdated
Show resolved
Hide resolved
+1 to this. The sysinfo load stats are already aggregates so adding a helpful option to avoid pushing further computation to user config or providers seems reasonably in line with the overall intention of this scraper to me. |
Co-authored-by: Ryan Fitzpatrick <rmfitzpatrick@users.noreply.github.com>
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Hello @bogdandrutu @jpkrohling |
I can do another review, but looks like there's still a pending discussion with @bogdandrutu: #5243 (comment) . Once that is resolved, ping me and I'll review this. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
@xp-1000 would you mind rebasing this to keep the PR active? If I'm able to push to your branch I'd be happy to help with this as well.** @bogdandrutu your review would be appreciated. This flag is a helpful way to avoid offloading metric transformation to the end user and is* one that Splunk GDI would like to see land. |
Hi @rmfitzpatrick I rebased my fork and invited you on the projet. I don't think I can reopen this PR myself though |
@xp-1000 apologies for the delay (I've been afk) and missed your invitation, which is now expired :(. Would you mind trying once again? (edit: I'm opting not to duplicate this PR to not lose your context or the conversation, but can if you'd prefer that route.*) |
Hello @rmfitzpatrick the new invitation is sent |
Hello @bogdandrutu and @dmitryax, waiting for your decision between this solution or using the signalfx exporter translation rules, could you reopen this PR please ? |
Looks like I can't reopen this, it says that "the branch was force-pushed or recreated". Are you able to send in a new PR? |
ok @jpkrohling thanks to have tried, I opened a new one here: #6999 |
* Add version to the deprecated message Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com> * Update mapprovider.go
Description: Add
per_cpu
optional option to load scraper of hostmetricsreceiver to produce average load per cpu which should ease the creation of alert for users without to divide themselves the metric (when another one is available for cpu count).Testing: no test added. I read the https://github.com/open-telemetry/opentelemetry-collector/blob/main/CONTRIBUTING.md but I confess it not clear for me, help is welcome:
hostmetricsreceiver is not in the list https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/Makefile#L18
the
make
command currently fails in my env from main branch:the
test
target seems not support to run test only for one receiver (to bypass the existing error) and allows me to test: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/Makefile.Common#L49Documentation: added a config bloc for load including the new option