Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud exporter reports unexpected error #36602

Closed
AlexisBRENON opened this issue Nov 29, 2024 · 13 comments
Closed

Google Cloud exporter reports unexpected error #36602

AlexisBRENON opened this issue Nov 29, 2024 · 13 comments

Comments

@AlexisBRENON
Copy link

AlexisBRENON commented Nov 29, 2024

Component(s)

exporter/googlecloud
processor/interval

What happened?

Description

I send HTTP server requests duration metrics to an OTel collector which send them to Google Cloud Monitoring.
The Google Cloud Exporter may report errors while the Debug exporter outputs the metrics as expected.

Steps to Reproduce

Use the Python WSGI instrumentation to report HTTP requests duration.

from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)

@app.post("/")
async def post():
    return {"message": "Hello POST"}

@app.put("/")
async def put():
    return {"message": "Hello PUT"}

Expected Result

I expect to export the metrics to Google Cloud Monitoring without any error in the logs.

Actual Result

There is some cases (not always) where Google cloud exporter sends error messages about out of order data points.

Collector version

v0.114.0 custom distribution to add the interval processor

Environment information

Environment

Kubernetes
Collector deployed with the Kubernetes Operator.

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
processors:
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
  batch:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20
  resource/merge_instances:
    attributes:
      - key: service.instance.id
        action: delete
  interval:
    interval: 15s
  filter/in_duration_only:
    error_mode: ignore
    metrics:
      metric:
        - name != "http.server.duration"
  transform/interval_set_timestamp:
    metric_statements:
      - context: datapoint
        statements:
          - set(time, Now())
exporters:
  debug:
    verbosity: detailed # basic # detailed for debug
    sampling_initial: 5
  googlecloud/new:
    project: myproject
service:
  pipelines:
    metrics/new:
      receivers: [otlp]
      processors:
      - memory_limiter
      - batch
      - filter/in_duration_only
      # Aggregate metrics from different instances
      - resource/merge_instances
      - interval
      # Try to fix the out of order error
      - transform/interval_set_timestamp
      # Add GCP attributes
      - resourcedetection
      exporters:
      - googlecloud/new
      - debug

Log output

[...]
2024-11-29T14:18:01.905Z	info	Metrics	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 0, "metrics": 0, "data points": 0}
2024-11-29T14:18:01.906Z	info		{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-11-29T14:18:01.937Z	info	Metrics	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 2}
2024-11-29T14:18:01.938Z	info	ResourceMetrics #0
Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.27.0)
     -> service.name: Str(my-service-name)
     -> service.namespace: Str(greenbids)
     -> service.version: Str(0.2.5)
     -> cloud.provider: Str(gcp)
     -> cloud.account.id: Str(my-gcp-project)
     -> cloud.platform: Str(gcp_kubernetes_engine)
     -> cloud.region: Str(europe-west1)
     -> k8s.cluster.name: Str(my-gke-cluster)
     -> host.id: Str(7813299496046632564)
     -> host.name: Str(gke-my-gke-cluster-system-blue-8be1a7a8-x775)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope opentelemetry.instrumentation.fastapi 0.48b0
Metric #0
Descriptor:
     -> Name: http.server.duration
     -> Description: Duration of HTTP server requests.
     -> Unit: ms
     -> DataType: Histogram
     -> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
     -> http.scheme: Str(http)
     -> http.host: Str(127.0.0.1:8000)
     -> net.host.port: Int(8000)
     -> http.flavor: Str(1.1)
     -> http.method: Str(PUT)
     -> http.server_name: Str(localhost:8000)
     -> http.status_code: Int(200)
     -> http.target: Str(/)
StartTimestamp: 2024-11-29 13:58:12.570889662 +0000 UTC
Timestamp: 2024-11-29 14:18:01.937084789 +0000 UTC
Count: 6690
Sum: 3500.000000
Min: 0.000000
Max: 2.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 750.000000
ExplicitBounds #10: 1000.000000
ExplicitBounds #11: 2500.000000
ExplicitBounds #12: 5000.000000
ExplicitBounds #13: 7500.000000
ExplicitBounds #14: 10000.000000
Buckets #0, Count: 3200
Buckets #1, Count: 3490
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
Buckets #12, Count: 0
Buckets #13, Count: 0
Buckets #14, Count: 0
Buckets #15, Count: 0
HistogramDataPoints #1
Data point attributes:
     -> http.scheme: Str(http)
     -> http.host: Str(127.0.0.1:8000)
     -> net.host.port: Int(8000)
     -> http.flavor: Str(1.1)
     -> http.method: Str(POST)
     -> http.server_name: Str(localhost:8000)
     -> http.status_code: Int(200)
     -> http.target: Str(/)
StartTimestamp: 2024-11-29 13:58:12.577578762 +0000 UTC
Timestamp: 2024-11-29 14:18:01.93709106 +0000 UTC
Count: 1382
Sum: 1023.000000
Min: 0.000000
Max: 2.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 750.000000
ExplicitBounds #10: 1000.000000
ExplicitBounds #11: 2500.000000
ExplicitBounds #12: 5000.000000
ExplicitBounds #13: 7500.000000
ExplicitBounds #14: 10000.000000
Buckets #0, Count: 361
Buckets #1, Count: 1021
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0
Buckets #11, Count: 0
Buckets #12, Count: 0
Buckets #13, Count: 0
Buckets #14, Count: 0
Buckets #15, Count: 0

2024-11-29T14:18:01.962Z	error	internal/queue_sender.go:92	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "googlecloud/new", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[0,1]: write for resource=k8s_cluster{cluster_name:my-gke-cluster,location:europe-west1} failed with: Points must be written in order. One or more of the points specified had an older start time than the most recent point.\nerror details: name = Unknown  desc = total_point_count:2  errors:{status:{code:3}  point_count:2}", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper/internal.NewQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.114.0/exporterhelper/internal/queue_sender.go:92
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/exporter@v0.114.0/internal/queue/consumers.go:43
2024-11-29T14:18:03.711Z	info	Metrics	{"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 0, "metrics": 0, "data points": 0}
2024-11-29T14:18:03.711Z	info		{"kind": "exporter", "data_type": "metrics", "name": "debug"}
[...]

Additional context

As you may see, I have two timeseries, differentiated with the http.method data point attribute (POST and PUT). So I don't understand why the google cloud exporter complain about out of order points.

There are some cases where the error logs do not shows up. On the few examples that I found, it is when my datapoints have ordered timestamp.

But are the timestamps expected to be ordered across different timeseries ? If so, how can I make sure that the exporter sends the timeseries in the right order ?

This seems linked to the addition of the interval processor (and the resource/merge_instance one to merge multiple instances metrics) as I don't have such kind of errors if I remove them.

@AlexisBRENON AlexisBRENON added bug Something isn't working needs triage New item requiring triage labels Nov 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@AlexisBRENON
Copy link
Author

/label processor/interval

Copy link
Contributor

Pinging code owners for processor/interval: @RichieSams @sh0rez @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself. For example, comment '/label priority:p2 -needs-triaged' to set the priority and remove the needs-triaged label.

@dashpole
Copy link
Contributor

dashpole commented Dec 2, 2024

The out-of-order error can happen within a batch of metrics, or between two batches. Given the current batch doesn't conflict, it likely means that there was a previous batch sent with a later start time.

Note that this does not "merge" the instance label--it just deletes it:

  resource/merge_instances:
    attributes:
      - key: service.instance.id
        action: delete

That is probably part of the problem. If the instance id is required to differentiate between two series and you drop it, you will get errors like the one you got. You will need to use a different processor, like the metrics transform processor, to aggregate away the attribute, rather than removing it: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#aggregate-labels

I also see: resource=k8s_cluster{cluster_name:my-gke-cluster,location:europe-west1}. I don't see a namespace, or pod name, which will cause collisions if you have multiple pods writing the same metrics. If you use the k8sattributes processor, you can get these for all pods in the cluster.

@AlexisBRENON
Copy link
Author

@dashpole Thanks for the feedback.
Yes, the resource/merge_instances actually just delete the attributes, however, the series are then merged through the interval processor.
The metricstransformer processor seems to only affect the datapoints labels and not the ones of the resource (or maybe I didn't understand how to use it properly), that's why I tried to use the interval processor (let's see my original StackOverflow post)
What I finally want is to have timeseries aggregated across service.instance.id (to reduce the cardinality, because I don't actually need to plot charts splitted by instance).

Nice catch about the missing namespace or pod name. This should not have any impact for the moment, as I have only one pod, but I will fix it to have a clean setup.

@dashpole
Copy link
Contributor

dashpole commented Dec 2, 2024

I'm not sure if the interval processor aggregates over resource attributes. Maybe @RichieSams @sh0rez @djaglowski know

@RichieSams
Copy link
Contributor

The interval processor will take the "latest" datapoint over the aggregation period. And drops all others. It doesn't sum, average, or anything the values.

It uses the unique combination of scope info, resource info, metric info, and datapoint info to identify a given datapoint. If any of the above are different, then interval processor will not aggregate them. It will keep both.

@RichieSams
Copy link
Contributor

You could temporarily add a text exporter, in addition to the google cloud exporter. Which could help with debugging what values are being exported.

@AlexisBRENON
Copy link
Author

The interval processor will take the "latest" datapoint over the aggregation period. And drops all others. It doesn't sum, average, or anything the values.

Ah... So I completely misuse it 😅

You could temporarily add a text exporter, in addition to the google cloud exporter. Which could help with debugging what values are being exported.

This is what I did. You may see in the first post of the issue that there is the debug exporter in detailed mode.

@AlexisBRENON
Copy link
Author

To give a little bit more context about my use case. I have a service on which I don't have a lot of control (I cannot change the sent telemetry, neither the PeriodicExporter interval). Moreover, the service may scale horizontally, sometimes having a single instance, sometimes having +100. I'm looking for a way to control cardinality and time granularity on my end (through the collector) to reduce cost.
What may be the best processors to use ? Maybe the GitHub issue is not the right place to discuss this ?

@dashpole
Copy link
Contributor

dashpole commented Dec 2, 2024

I think you can move a resource attribute to a metric attribute using the transform processor, and then use the metrics transform processor to aggregate. The transform processor would use the datapoint context, and set attributes["service.instance.id"] to resource.attributes["service.instance.id"]

@crobert-1
Copy link
Member

@AlexisBRENON Has this issue been resolved, or is there something else necessary here?

Removing needs triage as it looks like a valid solution has been proposed here.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Jan 7, 2025
@AlexisBRENON
Copy link
Author

I currently have a setup that seems to work fine. the pipeline is as follow:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 65
    spike_limit_percentage: 20
  batch:
  interval:
    interval: 10s
  resourcedetection:
    detectors: [gcp]
    timeout: 10s
  cumulativetodelta:
    max_staleness: 24h
  transform/resource:
    error_mode: ignore
    metric_statements:
    - context: "resource"
      statements:
      - set(attributes["service.instance.id"], attributes["service.namespace"])  # Override instance ID to allow aggregation
      - set(attributes["k8s.namespace.name"], "my-namespace")  # Set namespace for Managed Prometheus export
  groupbyattrs:  # Group all metrics from the same client
    keys:
      - service.name
      - service.version
      - service.namespace
  transform/aggregate:  # Aggregate metrics from the same client
    error_mode: ignore
    metric_statements:
    - context: datapoint
      statements:
      - set(time, TruncateTime(Now(), Duration("10s")))  # Align timestamps to allow aggregation
      - set(start_time, TruncateTime(start_time, Duration("10s")))  # Align timestamps to allow aggregation
      - delete_key(attributes, "http.host")
      - delete_key(attributes, "net.host.port")
      - delete_key(attributes, "http.server_name")
      - delete_key(attributes, "server.address")
      - delete_key(attributes, "server.port")
    - context: metric
      statements:
      - aggregate_on_attributes("sum") where type != METRIC_DATA_TYPE_GAUGE
      - aggregate_on_attributes("mean") where type == METRIC_DATA_TYPE_GAUGE
  deltatocumulative:
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors:
      - memory_limiter
      - batch
      - interval
      - resourcedetection
      - cumulativetodelta
      - transform/resource
      - groupbyattrs
      - transform/aggregate
      - deltatocumulative
      exporters:
      - googlemanagedprometheus

I think we can close this issue 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants