Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) #5567

Closed
wants to merge 4 commits into from

Conversation

fean5959a
Copy link

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

Hello, I propose to add capabilities to group metric by tag to allow to bypass Stackdriver quota.
MetricDescriptor is limited to 500, so this PR allow to switch to register metrics like Stackdriver Agent and organize metrics by tags.
Initial logic is preserve.

I don't write unit tests by I tested on my GCP environment.

@fean5959a fean5959a changed the title [Stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) [outputs/stackdriver] Allow to group metrics to bypass MetricDescriptor quota (500) Mar 11, 2019
@danielnelson
Copy link
Contributor

It seems that 500 metric descriptors should be enough when metrics are using the measurement name and field names properly, such as in the net input:

net,interface=eth0 bytes_recv=453254750i,bytes_sent=23934425i,drop_in=0i,drop_out=0i,err_in=0i,err_out=0i,packets_recv=366455i,packets_sent=191985i 1552333511000000000

However we have some plugins that do not layout the metrics like this, here is an example of what the output from the prometheus input looks like:

cpu_usage_user,cpu=cpu0,url=http://example.org:9273/metrics gauge=1.513622603430151 1505776751000000000

This creates many unique measurement names, which requires many metric descriptors. However, I don't think the right fix is to move the measurement name into the field name, we need to address these issues at the metric creation points (such as #4415) . Also, I don't want to have multiple ways that the output can layout the data as that is a maintenance and usability headache, we need to pick a style and use it across the board.

@fean5959a What is the layout of the metrics that are causing you to go over the 500 metric descriptor limit?

@danielnelson danielnelson added the area/gcp Google Cloud plugins including cloud_pubsub, cloud_pubsub_push, stackdriver label Mar 11, 2019
@fean5959a
Copy link
Author

fean5959a commented Mar 12, 2019

You're right in the approach and I going to expose my context.

I hava 2 Vault Cluster and 4 Consul Cluster, 28 VM Instances in GCP). I activated Hashicorp Telemetry and use Stackdriver Agent provided by Google. Vault and Consul send metrics through compliant Statsd agent. This is a Collectd "like" agent which organized metrics into 3 MetricDescriptor (derive, gauge, latency).

Why I reach the quota with native Telegraf agent : for example consider metric "consul.session_ttl.active", I have in the reality consul..session_ttl.active so one per VM instance etc ...., and some other are like "net" metric in your example with many fields value ... For these reason I can't use Telegraf agent which create 1 measurement per metrics and field value (or payed to extand my quota). An other example with CPU Usage, I have 10 Metric Descriptor / VM Instance with native Telegraf agent ...

Stackdriver Agent register use natively the method of my Pull Request ...

After many tests to understand how Stackdriver works, a MetricDescriptor has only one type (integer, double, etc ....), so we can't create metric with many field of different type of value. If I right understood we can have per MetricDescriptor :

  • 1 type of value
  • Many tags (10 I think)

The goal of my pull request is just to work as Stackdriver Agent.

Perhaps my pull request is just an other output plugin ?

I right with you and with your comments, I just want a good solution for me, I think other people will have this problem with Stackdriver quota.

Perhaps a good solution is to create a measurement per value type and then add tags per field ?

@fean5959a
Copy link
Author

I perform some other tests today and I confirm I reach Stackdriver quota with the last release Telegraf agent.

I tested too an other version of my code more compliant with your description I create a MetricDescriptor per measurement/type value and put field in tags.

Example :
.../custom.googleapis.com/telegraf/swap-counter
.../custom.googleapis.com/telegraf/swap-derive
.../custom.googleapis.com/telegraf/swap-gauge
-> field = used_percent

@danielnelson
Copy link
Contributor

Is it even possible to have a single MetricDescriptor that contains multiple time series with different labels? From looking at the documentation it seems to me that each descriptor can only have a single time series.

@fean5959a
Copy link
Author

Well, what I understood and what I tested, yes I think that each descriptor can only have a sible time series but, a descriptor is characterized by :

  • Name
  • Kind
  • Value type
  • Labels
    All these determine if the descriptor is unique. For example consider CPU metrics (idle, user, system etc ...)
    If each values are float with kind "gauge" then we can have a metric descriptor with :
  • Name : custom.googleapis.com/cpu-gauge
  • Kind : GAUGE
  • Value type : float
  • And for each metrics different labels (tags):
    • host (for all metrics)
    • name (with value : idle, user, system etc ...)

So for a same timestamp and a single MetricDescriptor (Name : custom.googleapis.com/cpu-gauge, Value type : float, Kind Gauge) we can have retreive all CPU metrics.

I'am not sure about vocabulary if this is consider like a sigle TimeSeries or multiple TimeSeries but it work and this is permit to play with Stackdriver quota.

I confirme that native Stackdriver agent with statsd configuration work like this because when I tested it, I had only 3 MetricDescriptor available and all my metrics data organized by labels (tags).

Actualy I don't run the PR because I change a little the code by it work very fine.

I constat an other problem too, when there are a lot of metrics, api don't allow to write more than 1 point of a MetricDescriptor (Name + Kind + Value type + Labels) per request. If a follow my example it is not possible to write CPU Idle for more than 1 timestamp value.

@danielnelson
Copy link
Contributor

Thanks for the info, for sure this is a clever way to get Stackdriver to accept more data, but I'm very hesitant to add this layout since it doesn't feel like we would be structuring the data properly.

Maybe it is best to contact Google support about an increased limit as suggested here?

@danielnelson
Copy link
Contributor

@fean5959a I'm going to close this pull request, I don't think this layout is something we will want to change. If you do contact support about an increased limit I'd be interested in knowing how it goes though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gcp Google Cloud plugins including cloud_pubsub, cloud_pubsub_push, stackdriver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants