-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cortex cluster generating "Duplicate sample for timestamp" errors constantly #2832
Comments
Can you also share your Prometheus config please, esp. remote write section? |
Is it possible that your Prometheus is restarting often? |
Here's the beginning of our Prometheus config. I need to look through the scrape rules and make sure I'm not giving up anything that might be confidential, though. And our Prometheus server is definitely not restarting frequently since it takes 2+ hours to replay the WAL.
|
Thank you. I think scrape rules won’t be necessary. I am slightly confused by seeing Do you use HA setup with samples deduplication on the Cortex side? |
Sorry, you're seeing the config snapshotted at different times while I've been debugging this. I had the HA tracker enabled, with an identically configured pair of Prometheus servers both writing to Cortex, using the default label of "cluster". I noticed that using "cluster" as the |
I also tried changing the replication factor from 3 down to 1, and significantly shrinking the cluster, from 15 ingesters to 3. This seemed to improve things by reducing the frequency of the "Duplicate" errors from ~1 per minute to ~1 per 10 minutes, but the errors are still appearing. |
Note that the configs were identical except for the value of Edit: I also tried changing the That change massively improved the performance of the Cortex cluster in terms of resources required to keep up with the rate of remote writes, but had no impact on the "Duplicate" errors. |
I think this still points to invalid HA configuration. Typically for misconfigured HA, one gets “out of order” error, and not duplicate timestamp, however, it turns out that CAdvisor metrics exporter also exports timestamp, which explains how different Prometheus servers would scrape same timestamp and end up with “duplicate timestamp” error instead. To help with HA debugging, you can go to distributor web port, and check /ha-tracker page. This should tell you whether Cortex is finding correct cluster and replica labels. |
Not sure if this may be related: prometheus/prometheus#7213 (still reading) |
I will also link to this excellent answer from Marco on similar issue: #2662 (comment) |
I've tried disabling one of the two paired Prometheus servers, and just now I also tried disabling the HA tracker entirely (while keeping one of the two Prometheus servers disabled), with no effect.
I did notice that almost all the error messages were coming from the |
Have you had any luck with dropping CAdvisor samples? Would it be possible to try with most recent Prometheus 2.19.2? |
It's surprising to me that "duplicate sample for timestamp" is the root cause of empty or (even worse) corrupted blocks. In my experience, "duplicate sample for timestamp" shouldn't cause corrupted blocks. Few questions, please:
|
Trying this today; will let you know.
This would actually be kind of painful, but if there's a post-2.18.1 change that you think would help, we can do that upgrade. |
I don't think this is related to my issue, but I have no idea why the following config is failing to load with an unmarshaling error:
I'm getting this in the logs every time I send a
|
I figured out the problem in my last comment: I was using
It seems like the initial corrupted blocks I encountered were unrelated to this timestamp issue, because I haven't seen any recurrences of the errors I was seeing from the compactor. The only recent log messages above level
But this might be expected?
We're going to try this as well, but it will take some work to roll this out without temporarily taking down our alerting/metrics. I'll let you know when it's done.
Unfortunately, I deleted that block since it appeared to be blocking the compactor from making progress. If I see a repeat of that particular error, I'll file a different issue. If I was able to capture a corrupt block "in the wild", it would be very unlikely that I could share it outside my org unless someone on your team was willing to sign an NDA. It would probably be easier to debug over Slack where I could send you the output of whatever commands you wanted me to run on the block, but I won't worry about this unless the error recurs. |
The "duplicate timestamp" error in Cortex is not a critical error. You can leave with that, if you can't find a way to fix it in the exporter. All duplicate timestamp samples within a single remote write request will be discarded, while valid samples will be correctly ingested.
Yes. Prometheus drops them too, but it does it more silently. Could you check
Overlapping blocks are expected in Cortex, because each ingester stores received samples in a block and blocks generated by different ingesters will have overlapping time range.
Sure. Please keep a backup of the corrupted block (if will happen again) and then ping me on Slack. I will build an ad-hoc analysis tool to run on it. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Our cortex cluster (running the TSDB storage engine) is receiving all its samples from a single Prometheus instance configured with a
remote_write
, but despite a sampling rate of ~300k metrics per second, most blocks end up empty or corrupt because the ingesters report "duplicate sample for timestamp" several times per minute. From the code, it looks like this error stops processing of the block, which explains why we're missing data, but I don't know why Prometheus is pushing that bad data: https://github.com/cortexproject/cortex/blob/master/vendor/github.com/prometheus/prometheus/scrape/scrape.go#L1252Example logs below:
Compactor logs:
Prometheus version:
Running the cortex 1.2.0 docker image.
Config: https://pastebin.com/rCissGtr
Jsonnet build file: https://pastebin.com/7EK19zAG
The text was updated successfully, but these errors were encountered: