Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add per_cpu option to load scraper for hostmetricsreceiver #5243

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions receiver/hostmetricsreceiver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,15 @@ filesystem:
match_type: <strict|regexp>
```

### Load

`per_cpu` divides the average load per CPU (default: `false`).

```yaml
load:
per_cpu: <false|true>
```

### Network

```yaml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,7 @@ import "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostm
// Config relating to Load Metric Scraper.
type Config struct {
internal.ConfigSettings `mapstructure:",squash"` // squash ensures fields are correctly decoded in embedded struct

// If true, metrics will be average load per cpu
PerCPU bool `mapstructure:"per_cpu"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really "per_cpu" (which for me means that is broken down by cpu core), is more or less average per cpu, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is absolutely correct. I confess I simply copied the existing option from smart agent: https://docs.signalfx.com/en/latest/integrations/agent/monitors/load.html#configuration

do you want I change it for something like average_per_cpu ?

}
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ package loadscraper

import (
"context"
"runtime"
"time"

"github.com/shirou/gopsutil/load"
Expand Down Expand Up @@ -63,6 +64,13 @@ func (s *scraper) scrape(_ context.Context) (pdata.MetricSlice, error) {
return metrics, scrapererror.NewPartialScrapeError(err, metricsLen)
}

if s.config.PerCPU {
divisor := float64(runtime.NumCPU())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we better expose "num_cpu" as an individual metric, and if people are interested in this they can use the backend of choice, or a "processor" to compute this. What do you think?

Copy link
Contributor Author

@xp-1000 xp-1000 Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bogdandrutu this is a legitimate but difficult question ^^
The best I can do is to explain the full goal of this PR and let maintainers of this project decide.

context

We are customer of former SignalFx product aquired by Splunk more than 2 years ago I think.
We are trying to migrate from the deprecated smart agent to OpenTelemetry Collector.

In my opinion this deprecation by Splunk was a bit rushed and let their customers with a migration difficult to achieve given that the lack of documentation for SignalFx / Splunk specific requirements and common usages (which is not your problem!).

This PR is a tiny fragment of the work I am doing to make this migration seamless and transparent.

goal

About the load metrics, we come from the smart agent load monitor: https://docs.signalfx.com/en/latest/integrations/agent/monitors/load.html which provide an option to average the load metrics per cpu number.

This PR aims to bring a similar feature for parity between signalfx smart agent and open telemetry collector to be able to use the otel receiver but keep all existing dependent resources working like charts, detectors..

workaround

For now, as workaround I still use the smart agent monitor instead of hostmetrics receiver:

receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
      disk:
      filesystem:
      memory:
      network:
      #load:
      paging:
      processes:
  smartagent/load:
    type: load
    perCPU: true

but I would prefer to drop smart agent monitors in profit of otel receivers if possible

your suggestion

your suggestion is full of sens obviously and in fact I could already create the average per cpu on load outside the receiver because I already have everything I need.

Indeed, the SignalFx exporter for Otel Collector already expose a metric for number of cpu here:

so I can basically divide the load metric from hostmetrics receiver by this cpu.num_processors and I will get similar value I had with smart agent load monitor.

the problem

doing this way will force users to update all existing resources to use a more complex query to calculate this load averaged per cpu.

the problem is we have lot of customers on different signalfx organizations, we do not manage every resources ourselves and without to speak about the complexity of tracking, detect and update resources out of our scope this is also tricky due to contractual / responsibility / permission considerations.

even for the resources we manage ourselves properly (iac, git, terraform etc) and are of our responsibility it can be tedious to update existing resources without insert new source of mistakes. For example, our detector to create alert on load metrics is very simple: https://github.com/claranet/terraform-signalfx-detectors/blob/master/modules/smart-agent_system-common/detectors-gen.tf#L80
We explicitly instruct users of this module to enable perCPU:true option and everything is good.
We could change this module to modify the query and make the division with cpu.num_processors. In this way, the detector will be compatible with both otel collector and smart agent.
Nevertheless, if I do this I know some users will upgrade to the new version of the module without to remove the perCPU option from the agent configuration and so will break their detector silently ...

conclusion

It is a mess !

Honestly I fully understand if you do not want to integrate a "useless" feature in otel collector especially if it is only for a specific vendor.

but here is the full explanation, this seems to me to be the less disruptive, the safest and the most straightforward way to handle my need.

I also would like to argue that:

  • this option still can be useful to others (no splunk customers)
  • this could ease the setup of monitoring in general
  • this way will result in less required metrics (1 instead of 2) which can be important as much for vendor like splunk where each metric can increase the price as for opensource based solution where storage of metrics remains the most difficult part

ok not sure of the relevance of these arguments but nothing ventured, nothing gained :)

avgLoadValues.Load1 = avgLoadValues.Load1 / divisor
avgLoadValues.Load5 = avgLoadValues.Load1 / divisor
avgLoadValues.Load15 = avgLoadValues.Load1 / divisor
xp-1000 marked this conversation as resolved.
Show resolved Hide resolved
}

metrics.EnsureCapacity(metricsLen)

initializeLoadMetric(metrics.AppendEmpty(), metadata.Metrics.SystemCPULoadAverage1m, now, avgLoadValues.Load1)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ package loadscraper
import (
"context"
"errors"
"runtime"
"testing"

"github.com/shirou/gopsutil/load"
Expand All @@ -31,27 +32,41 @@ import (
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver/internal/scraper/loadscraper/internal/metadata"
)

const (
testStandard = "Standard"
testAverage = "PerCPUEnabled"
)

func TestScrape(t *testing.T) {
type testCase struct {
name string
loadFunc func() (*load.AvgStat, error)
expectedErr string
perCPU bool
saveMetrics bool
}

testCases := []testCase{
{
name: "Standard",
name: testStandard,
saveMetrics: true,
},
{
name: testAverage,
perCPU: true,
saveMetrics: true,
},
{
name: "Load Error",
loadFunc: func() (*load.AvgStat, error) { return nil, errors.New("err1") },
expectedErr: "err1",
},
}
results := make(map[string]pdata.MetricSlice)

for _, test := range testCases {
t.Run(test.name, func(t *testing.T) {
scraper := newLoadScraper(context.Background(), zap.NewNop(), &Config{})
scraper := newLoadScraper(context.Background(), zap.NewNop(), &Config{PerCPU: test.perCPU})
xp-1000 marked this conversation as resolved.
Show resolved Hide resolved
if test.loadFunc != nil {
scraper.load = test.loadFunc
}
Expand Down Expand Up @@ -83,11 +98,34 @@ func TestScrape(t *testing.T) {
assertMetricHasSingleDatapoint(t, metrics.At(2), metadata.Metrics.SystemCPULoadAverage15m.New())

internal.AssertSameTimeStampForAllMetrics(t, metrics)

// save metrics for additional tests if flag is enabled
if test.saveMetrics {
results[test.name] = metrics
}
})
}

// Additional test for average per CPU
numCPU := runtime.NumCPU()
for i := 0; i < results[testStandard].Len(); i++ {
assertCompareAveragePerCPU(t, results[testAverage].At(i), results[testStandard].At(i), numCPU)
}
}

func assertMetricHasSingleDatapoint(t *testing.T, metric pdata.Metric, descriptor pdata.Metric) {
internal.AssertDescriptorEqual(t, descriptor, metric)
assert.Equal(t, 1, metric.Gauge().DataPoints().Len())
}

func assertCompareAveragePerCPU(t *testing.T, average pdata.Metric, standard pdata.Metric, numCPU int) {
valAverage := average.Gauge().DataPoints().At(0).DoubleVal()
valStandard := standard.Gauge().DataPoints().At(0).DoubleVal()
if numCPU == 1 {
// For hardware with only 1 cpu, results must be very close
assert.InDelta(t, valAverage, valStandard, 0.1)
} else {
// For hardward with multiple CPU, average per cpu is fatally less than standard
assert.Less(t, valAverage, valStandard)
}
}