Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram buckets are not summed when converted to prometheus v2 #60

Closed
tbolon opened this issue Mar 29, 2022 · 4 comments · Fixed by #184 or open-telemetry/opentelemetry-collector-contrib#19454
Assignees

Comments

@tbolon
Copy link

tbolon commented Mar 29, 2022

In the otel specification, it is stated that buckets are not cumulative : "The sum of the bucket_counts must equal the value in the count field."

// bucket_counts is an optional field contains the count values of histogram
// for each bucket.
//
// The sum of the bucket_counts must equal the value in the count field.
//
// The number of elements in bucket_counts array must be by one greater than
// the number of elements in explicit_bounds array.
repeated fixed64 bucket_counts = 6;

In your example in the collector-contrib repo you seem to expect that the buckets are cumulative (the values are always higher on each successive bucket), on both v1 and v2 example.

http_request_duration_seconds 0.05=24054,0.1=33444,0.2=100392,0.5=129389,1=133988,sum=53423,count=144320
rpc_duration_seconds 0.01=3102,0.05=3272,0.5=4773,0.9=9001,0.99=76656,sum=1.7560473e+07,count=2693
prometheus,le=0.05 http_request_duration_seconds_bucket=24054
prometheus,le=0.1  http_request_duration_seconds_bucket=33444
prometheus,le=0.2  http_request_duration_seconds_bucket=100392
prometheus,le=0.5  http_request_duration_seconds_bucket=129389
prometheus,le=1    http_request_duration_seconds_bucket=133988
prometheus         http_request_duration_seconds_count=144320,http_request_duration_seconds_sum=53423

prometheus,quantile=0.01 rpc_duration_seconds=3102
prometheus,quantile=0.05 rpc_duration_seconds=3272
prometheus,quantile=0.5  rpc_duration_seconds=4773
prometheus,quantile=0.9  rpc_duration_seconds=9001
prometheus,quantile=0.99 rpc_duration_seconds=76656
prometheus               rpc_duration_seconds_count=1.7560473e+07,rpc_duration_seconds_sum=2693

The prometheus exporter takes care of summing buckets before sending them to prometheus.

	for _, bucket := range buckets {
		index := indicesMap[bucket]
		var countPerBucket uint64
		if len(ip.ExplicitBounds()) > 0 && index < len(ip.ExplicitBounds()) {
			countPerBucket = ip.BucketCounts()[index]
		}
                // **** here ****
		cumCount += countPerBucket
		points[bucket] = cumCount
	}

I don't see a cumulative sum in the prometheus_v2 exporter, so I suppose the data will be wrong in influxdb: each bucket will only have its own values, and not include the count from lower buckets.

		for i, explicitBound := range explicitBounds {
			t := make(map[string]string, len(tags)+1)
			for k, v := range tags {
				t[k] = v
			}
			f := make(map[string]interface{}, len(fields)+1)
			for k, v := range fields {
				f[k] = v
			}


			boundTagValue := strconv.FormatFloat(explicitBound, 'f', -1, 64)
			t[common.MetricHistogramBoundKeyV2] = boundTagValue
			f[measurement+common.MetricHistogramBucketSuffix] = float64(bucketCounts[i])


			if err = w.WritePoint(ctx, common.MeasurementPrometheus, t, f, ts, common.InfluxMetricValueTypeHistogram); err != nil {
				return fmt.Errorf("failed to write point for histogram: %w", err)
			}
		}

Please note that I didn't check prometheus_v1, but I suspect the problem to be the same.

@jacobmarble jacobmarble self-assigned this Apr 1, 2022
@jacobmarble
Copy link
Contributor

Updated link from the proto definition cited above:

  // bucket_counts is an optional field contains the count values of histogram
  // for each bucket.
  //
  // The sum of the bucket_counts must equal the value in the count field.
  //
  // The number of elements in bucket_counts array must be by one greater than
  // the number of elements in explicit_bounds array.
  repeated fixed64 bucket_counts = 6;

@jacobmarble
Copy link
Contributor

I've confirmed that this is a bug. Not sure how I missed this detail for almost two years!

@jacobmarble
Copy link
Contributor

I'll keep the issue open until the fixed behavior is merged into otel collector contrib

@jacobmarble
Copy link
Contributor

jacobmarble commented Mar 11, 2023

This fix should land in the next release of OpenTelemetry Collector Contrib, which should happen within two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment