Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<Question> Getting "One or more TimeSeries could not be written" when attempting to emit metrics following the Java MetricsExporter example here #339

Closed
ZiggyZaggyy opened this issue May 1, 2024 · 8 comments
Assignees
Labels
priority: p1 question Further information is requested

Comments

@ZiggyZaggyy
Copy link

ZiggyZaggyy commented May 1, 2024

Hi, I am using the exporter in CloudFunction, and I'm seeing the following two errors showing up in the log, particularly after any new deployment update to the CloudFunction:

One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point or
One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: generic_node{namespace:,node_id:,location:global} timeSeries[0,1]

The rough idea of my metrics and exporter configuration is as follows:

public class MetricService {
    Meter meter;
    SdkMeterProvider meterProvider;
    private static MetricService instance;
    private final Map<String, LongCounter> counters;

    private MetricService() {
        MetricExporter metricExporter = GoogleCloudMetricExporter.createWithDefaultConfiguration();
        meterProvider = SdkMeterProvider.builder()
            //the following 5 lines were added manually later to see if that helps with the exception issue
            .setResource(Resource.create(Attributes.of(
                AttributeKey.stringKey("cloud.availability_zone"), "us-east1",
                AttributeKey.stringKey("service.namespace"), "XYZ",
                AttributeKey.stringKey("service.name"), "abcd",
                AttributeKey.stringKey("service.instance.id"), MetadataConfig.getInstanceId()
            )))
            .registerMetricReader(
                PeriodicMetricReader.builder(metricExporter)
                    .setInterval(java.time.Duration.ofSeconds(20))
                    .build())
            .build();

        meter = meterProvider
            .meterBuilder("XYZ-metric-service")
            .setInstrumentationVersion("1.0.0")
            .build();
        counters = new HashMap<>();
    }

    public static MetricService getInstance() {
        if (instance == null) {
            instance = new MetricService();
        }

        return instance;
    }

    public void increment(final String metricName, final long count) {
        LongCounter counter = counters.getOrDefault(metricName, meter
            .counterBuilder(metricName)
            .build());
        counter.add(count);
        counters.put(metricName, counter);
    }

    public void increment(final String metricName) {
        increment(metricName, 1);
    }

    public void flush() {
        meterProvider.forceFlush();
    }
}

Cloud function is scheduled to run every minute and it's guaranteed to finish within that one minute. In fact in most cases it finishes within the first several seconds of starting. But the scheduled metricReader apparently goes off on its own thread every 20s even after the main program finishes its job it seems.

The errors happen most frequently within the first 15 mins of a new deployment. Typically it will be thrown about 3 or 4 such exceptions during its execution, 3 of them from scheduled MetricReader every 20s within the 60s execution window before next execution takes over, and 1 more from the forceFlush() method I call from shutdown() inside main execution function body. Sometimes, after running for a while, the exception frequency will be reduced to only once per execution, that once being from the forceFlush() method before shutdown.

I'm not sure if it exactly helped anything, but putting more manual resource attribute key configuration as below didn't exactly drive exceptions away but was consistently making it at most 1 exception per execution on forceFlush() whereas previously sometimes that frequency got reduced sometimes it didn't after a while.

.setResource(Resource.create(Attributes.of(
                AttributeKey.stringKey("cloud.availability_zone"), "us-east1",
                AttributeKey.stringKey("service.namespace"), "XYZ",
                AttributeKey.stringKey("service.name"), "abcd",
                AttributeKey.stringKey("service.instance.id"), MetadataConfig.getInstanceId()
            )))

Also the error type seems to be alternating between One or more points were written more frequently than the maximum sampling period and One or more of the points specified had an older start time than the most recent point depending on new deployments (maybe they were related to some slight change in metric setup/dependency configuration but probably not, can try to isolate more if that may be a useful troubleshooting route)

If I change the periodic metric reading interval from 20s to 120s, I won't see the above exceptions any more. Because my cloud function is bound to finish within one minute, setting to 120s means only during forceflush() is metrics getting emitted. So it seems like the exceptions are due to conflicts among scheduled metric reporting runs and forceflush()? I'd think exporter would take care of multiple consecutive runs of emitting metrics on the same name and instance id? (but it looks like every call always invokes createTimeSeries and according to google's doc you can't do this multiple times with same name on same instance)

As to the metric itself it does look getting published just fine as I compare it side by side with my logging-based metrics and they largely align, even during executions with 3,4 exceptions. So I'm not sure if there are actual metric loss, and of course the 2 won't exactly align due to accounting difference or some general background loss so I don't know if those exceptions could account for any very minor discrepancies but suffice to say the difference is very minimal if any.

Other than following the example usage in the MetricExporter I can't find any more examples to help me understand the root cause of such exceptions. Internet search would typically point to using Cloud Monitoring API itself or about confusing multiple instances etc but in our case it's just one Cloud Function one instance, no overlapping, with even my attempt above to specify execution-specific instance.id on resource. Hope you could help with some insights, thank you!

P.S. I found this issue attempting to add unique identifier to Java SDK but it got abandoned because they thought setting service.instance.id to SdkMeterProvider would be equivalent. As shown above I tried that but it didn't make the exception go away.

@dashpole dashpole added the bug Something isn't working label May 6, 2024
@aabmass
Copy link
Contributor

aabmass commented Jul 15, 2024

@psx95 can you take a look?

@dtd2512
Copy link

dtd2512 commented Sep 6, 2024

Hi, I'm also facing the same issue.

Basically, I took the example here and modified it a bit so that it can run in a simple Spring application to do the counter every time a Post request comes in.

private static void myUseCase(int n) {
    LongCounter counter =
            METER
                    .counterBuilder("test/long_counter")
                    .setDescription("Long Counter")
                    .setUnit("1")
                    .build();
    counter.add(n);
}

@RequestMapping(value = "/test", method = RequestMethod.POST)
public ResponseEntity<?> receiveRest(@RequestBody String body) throws InterruptedException {
    myUseCase(body.length());
    Thread.sleep(RANDOM.nextInt(1000));
    return new ResponseEntity<>(HttpStatus.OK);
}

When I deploy this app to Cloud Run, I always get a ton of these errors:

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point.: generic_node{location:global,node_id:,namespace:} timeSeries[0]

I tried to set the "maximum number of Cloud Run instances" to 1 but the error still occurs.

Hope we have a solution soon. Thank you.

@psx95
Copy link
Contributor

psx95 commented Sep 12, 2024

Hi @ZiggyZaggyy & @dtd2512,

Thank you for your detailed reports on this issue. I did some investigation on using the exporter in serverless environments (particularly Cloud Run) and based on your reports and my investigation, the root cause is probably that the timeseries being emitted from multiple instances of serverless environments (like Cloud Run/Cloud Functions) are indistinguishable.

This is due to a lack of an instance_id or similar identifier which can uniquely identify a given instance of Cloud run or Cloud Function since they are supposed to be short-lived.

When timeseries from two different instances is indistinguishable, it either leads them to be out of order or being written to too many times - these were the errors you were seeing.

@dtd2512 your workaround to set the max instances of cloud run to 1 should technically have worked, as there is only a single instance. However, Cloud Run does not make any hard guarantees and particularly at the time of deploying/re-deploying a service, there might be more than 1 instance active. If the metric export happens around this time, you could still experience this error. There is a metric that shows you the number of active instances in Cloud Run. Did you confirm there was indeed only a single active instance?

This can be fixed with proper resource detection. I was able to replicate this issue on my side when I was running a sample spring application with no resource detection.

However, after adding proper resource detection, I was able to see a resource_attribute named task_id which differentiates between points coming from different instances.
Screenshot 2024-09-11 at 2 32 57 PM.
I ran the sample application with a minimum of 3 Cloud Run instances - which resulted in multiple task_id values. I also did not see any TimeSeries errors.

I have a Draft PR with a sample that showcases this with a manually instrumented Spring application.

TL;DR - Proper resource detection is required to prevent such errors. This example shows resource detection in action.

@ZiggyZaggyy I haven't tried this personally with Cloud Functions yet, but the root cause should be the same - did you set up proper resource detection ?

Looking at your report, differentiating between the different instances of Cloud Function was the right intuition, but I'm not sure what value MetadataConfig.getInstanceId() yields. Could it be possible that it was returning the same value for all instances of your Cloud Function ?
Proper resource detection is supposed to attach such labels to your instrumentation automatically.

I will also try running a sample in Cloud Function once to make sure it's not something else.

Let me know if this helps.

@psx95
Copy link
Contributor

psx95 commented Sep 17, 2024

@ZiggyZaggyy, regarding Cloud Functions, the reliable way to get metrics out as per the current recommendations is using log-based metrics.

I did some experiments with Cloud Functions and with proper resource detection, you should be able to avoid Points must be written in order. One or more of the points specified had an older start time than the most recent point.

However, One or more points were written more frequently than the maximum sampling period configured for the metric. might still occur.

This is because when using exporter with Cloud Functions, you would need to force flush prior to exiting, since you may not have CPU allocated to perform the export once you exit the function.
This would work fine and you would see the metrics in Cloud Monitoring but if your functions are called very quickly in succession (at a rate higher than what's allowed by Cloud Monitoring), you would run into the points written too frequently error.

Currently, the minimum minimum resolution supported by Cloud Monitoring is 10 seconds. Theoretically, this means that if you attempt to write metrics more than once in under 10 seconds, you would run into this issue.

This is why the current recommendation is to use log-based metrics.

Let me know if this makes sense.

@dtd2512, I have updated the metrics example you were referring to with manual resource detection in #370. You should refer to that as well sample added in #368.

@psx95
Copy link
Contributor

psx95 commented Sep 17, 2024

Caveats when using SDK exporters in Cloud Run Service / Jobs

It is important to note that while the SDK exporters do work in Cloud Run environment if resource detection is setup properly, it is not the recommended approach.

The current recommendation for Cloud Run is to use the OpenTelemetry Sidecar approach.

Why the SDK exporters work ?

Using SDK exporters in Cloud Run works, but it relies on reporting the Cloud Run Monitored Resource as Generic Task. This reporting is automatically done by the GCP Resource detector and it is done because the cloud_run_revision or cloud_run_job does not have a field that can uniquely identify the cloud run instance from which the telemetry is being generated. In contrast, generic_task MR type has the field task_id which can be used to uniquely identify the instance from which the telemetry is being generated.

@psx95 psx95 added question Further information is requested and removed bug Something isn't working labels Sep 19, 2024
@psx95
Copy link
Contributor

psx95 commented Sep 28, 2024

Closing this issue, please feel free to reopen in case the comments above do not resolve the query or if there are any follow-up questions.

@psx95 psx95 closed this as completed Sep 28, 2024
@psx95
Copy link
Contributor

psx95 commented Oct 2, 2024

For Posterity: Here is an example of exporting metrics to Google Cloud from a Google Cloud Functions environment #377

@dtd2512
Copy link

dtd2512 commented Dec 9, 2024

Hi @psx95,
thank you very much for the research and suggestion with Resources Dections.
I was on a long vacation and just now got the chance to try your suggestion.

After testing with 100 CloudRun instances simultaneously, the metrics I setup were exported perfectly to GCP Monitoring without any errors 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants