[outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) #5567

fean5959a · 2019-03-11T09:49:46Z

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

Hello, I propose to add capabilities to group metric by tag to allow to bypass Stackdriver quota.
MetricDescriptor is limited to 500, so this PR allow to switch to register metrics like Stackdriver Agent and organize metrics by tags.
Initial logic is preserve.

I don't write unit tests by I tested on my GCP environment.

danielnelson · 2019-03-11T19:55:29Z

It seems that 500 metric descriptors should be enough when metrics are using the measurement name and field names properly, such as in the net input:

net,interface=eth0 bytes_recv=453254750i,bytes_sent=23934425i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=366455i,packets_sent=191985i 1552333511000000000

However we have some plugins that do not layout the metrics like this, here is an example of what the output from the prometheus input looks like:

cpu_usage_user,cpu=cpu0,url=http://example.org:9273/metrics gauge=1.513622603430151 1505776751000000000

This creates many unique measurement names, which requires many metric descriptors. However, I don't think the right fix is to move the measurement name into the field name, we need to address these issues at the metric creation points (such as #4415) . Also, I don't want to have multiple ways that the output can layout the data as that is a maintenance and usability headache, we need to pick a style and use it across the board.

@fean5959a What is the layout of the metrics that are causing you to go over the 500 metric descriptor limit?

fean5959a · 2019-03-12T07:59:05Z

You're right in the approach and I going to expose my context.

I hava 2 Vault Cluster and 4 Consul Cluster, 28 VM Instances in GCP). I activated Hashicorp Telemetry and use Stackdriver Agent provided by Google. Vault and Consul send metrics through compliant Statsd agent. This is a Collectd "like" agent which organized metrics into 3 MetricDescriptor (derive, gauge, latency).

Why I reach the quota with native Telegraf agent : for example consider metric "consul.session_ttl.active", I have in the reality consul..session_ttl.active so one per VM instance etc ...., and some other are like "net" metric in your example with many fields value ... For these reason I can't use Telegraf agent which create 1 measurement per metrics and field value (or payed to extand my quota). An other example with CPU Usage, I have 10 Metric Descriptor / VM Instance with native Telegraf agent ...

Stackdriver Agent register use natively the method of my Pull Request ...

After many tests to understand how Stackdriver works, a MetricDescriptor has only one type (integer, double, etc ....), so we can't create metric with many field of different type of value. If I right understood we can have per MetricDescriptor :

1 type of value
Many tags (10 I think)

The goal of my pull request is just to work as Stackdriver Agent.

Perhaps my pull request is just an other output plugin ?

I right with you and with your comments, I just want a good solution for me, I think other people will have this problem with Stackdriver quota.

Perhaps a good solution is to create a measurement per value type and then add tags per field ?

fean5959a · 2019-03-12T15:53:01Z

I perform some other tests today and I confirm I reach Stackdriver quota with the last release Telegraf agent.

I tested too an other version of my code more compliant with your description I create a MetricDescriptor per measurement/type value and put field in tags.

Example :
.../custom.googleapis.com/telegraf/swap-counter
.../custom.googleapis.com/telegraf/swap-derive
.../custom.googleapis.com/telegraf/swap-gauge
-> field = used_percent

danielnelson · 2019-03-24T02:51:15Z

Is it even possible to have a single MetricDescriptor that contains multiple time series with different labels? From looking at the documentation it seems to me that each descriptor can only have a single time series.

fean5959a · 2019-03-25T08:57:43Z

Well, what I understood and what I tested, yes I think that each descriptor can only have a sible time series but, a descriptor is characterized by :

Name
Kind
Value type
Labels
All these determine if the descriptor is unique. For example consider CPU metrics (idle, user, system etc ...)
If each values are float with kind "gauge" then we can have a metric descriptor with :
Name : custom.googleapis.com/cpu-gauge
Kind : GAUGE
Value type : float
And for each metrics different labels (tags):
- host (for all metrics)
- name (with value : idle, user, system etc ...)

So for a same timestamp and a single MetricDescriptor (Name : custom.googleapis.com/cpu-gauge, Value type : float, Kind Gauge) we can have retreive all CPU metrics.

I'am not sure about vocabulary if this is consider like a sigle TimeSeries or multiple TimeSeries but it work and this is permit to play with Stackdriver quota.

I confirme that native Stackdriver agent with statsd configuration work like this because when I tested it, I had only 3 MetricDescriptor available and all my metrics data organized by labels (tags).

Actualy I don't run the PR because I change a little the code by it work very fine.

I constat an other problem too, when there are a lot of metrics, api don't allow to write more than 1 point of a MetricDescriptor (Name + Kind + Value type + Labels) per request. If a follow my example it is not possible to write CPU Idle for more than 1 timestamp value.

danielnelson · 2019-03-25T17:25:57Z

Thanks for the info, for sure this is a clever way to get Stackdriver to accept more data, but I'm very hesitant to add this layout since it doesn't feel like we would be structuring the data properly.

Maybe it is best to contact Google support about an increased limit as suggested here?

danielnelson · 2019-04-03T00:03:21Z

@fean5959a I'm going to close this pull request, I don't think this layout is something we will want to change. If you do contact support about an increased limit I'd be interested in knowing how it goes though.

fean5959a and others added 4 commits March 11, 2019 10:29

Add grouping metrics

672e5d5

Add metrics grouping

e567458

Add metrics grouping

3f0b256

Merge branch 'master' of github.com:fean5959a/telegraf

c85a93d

fean5959a changed the title ~~[Stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500)~~ [outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) Mar 11, 2019

danielnelson added the area/gcp Google Cloud plugins including cloud_pubsub, cloud_pubsub_push, stackdriver label Mar 11, 2019

danielnelson closed this Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) #5567

[outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) #5567

fean5959a commented Mar 11, 2019

danielnelson commented Mar 11, 2019

fean5959a commented Mar 12, 2019 •

edited

Loading

fean5959a commented Mar 12, 2019

danielnelson commented Mar 24, 2019

fean5959a commented Mar 25, 2019

danielnelson commented Mar 25, 2019

danielnelson commented Apr 3, 2019

[outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) #5567

[outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) #5567

Conversation

fean5959a commented Mar 11, 2019

Required for all PRs:

danielnelson commented Mar 11, 2019

fean5959a commented Mar 12, 2019 • edited Loading

fean5959a commented Mar 12, 2019

danielnelson commented Mar 24, 2019

fean5959a commented Mar 25, 2019

danielnelson commented Mar 25, 2019

danielnelson commented Apr 3, 2019

fean5959a commented Mar 12, 2019 •

edited

Loading