Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cortex cluster generating "Duplicate sample for timestamp" errors constantly #2832

Closed
amckinley opened this issue Jul 2, 2020 · 18 comments
Closed
Labels

Comments

@amckinley
Copy link
Contributor

amckinley commented Jul 2, 2020

Our cortex cluster (running the TSDB storage engine) is receiving all its samples from a single Prometheus instance configured with a remote_write, but despite a sampling rate of ~300k metrics per second, most blocks end up empty or corrupt because the ingesters report "duplicate sample for timestamp" several times per minute. From the code, it looks like this error stops processing of the block, which explains why we're missing data, but I don't know why Prometheus is pushing that bad data: https://github.com/cortexproject/cortex/blob/master/vendor/github.com/prometheus/prometheus/scrape/scrape.go#L1252

Example logs below:

level=error ts=2020-07-02T21:47:43.927056908Z caller=push.go:38 org_id=fake traceID=23a9fee81dfed9d1 msg="push error" err="rpc error: code = Code(400) desc = user=fake: series={__name__=\"container_network_receive_packets_dropped_total\", cluster=\"development\", ec2_id=\"0af8b09933bb13827\", host=\"10.2.4.120\", id=\"/\", instance=\"10.2.4.120:9116\", interface=\"br-31821752497f\", job=\"cadvisor_exporter\", role=\"jenkins-agent\"}, timestamp=2020-07-02T21:47:24.904Z: duplicate sample for timestamp"
level=error ts=2020-07-02T21:49:23.396602468Z caller=push.go:38 org_id=fake traceID=2307cf3b3646379e msg="push error" err="rpc error: code = Code(400) desc = user=fake: series={__name__=\"container_network_transmit_errors_total\", cluster=\"development\", ec2_id=\"08b7dd66367c5c5f2\", host=\"10.2.4.39\", id=\"/\", instance=\"10.2.4.39:9116\", interface=\"br-14aa5caa581f\", job=\"cadvisor_exporter\", role=\"jenkins-agent\"}, timestamp=2020-07-02T21:49:06.516Z: duplicate sample for timestamp"
level=error ts=2020-07-02T21:49:23.400356092Z caller=push.go:38 org_id=fake traceID=92eb5c45e159496 msg="push error" err="rpc error: code = Code(400) desc = user=fake: series={__name__=\"container_spec_memory_swap_limit_bytes\", cluster=\"development\", ec2_id=\"08b7dd66367c5c5f2\", host=\"10.2.4.39\", id=\"/system.slice/snap.amazon-ssm-agent.amazon-ssm-agent.service\", instance=\"10.2.4.39:9116\", job=\"cadvisor_exporter\", role=\"jenkins-agent\"}, timestamp=2020-07-02T21:49:06.516Z: duplicate sample for timestamp"
level=error ts=2020-07-02T21:49:23.442017342Z caller=push.go:38 org_id=fake traceID=1c68c424d0ee55a9 msg="push error" err="rpc error: code = Code(400) desc = user=fake: series={__name__=\"container_network_receive_bytes_total\", cluster=\"development\", ec2_id=\"08b7dd66367c5c5f2\", host=\"10.2.4.39\", id=\"/\", instance=\"10.2.4.39:9116\", interface=\"br-1f66a6c86297\", job=\"cadvisor_exporter\", role=\"jenkins-agent\"}, timestamp=2020-07-02T21:49:06.516Z: duplicate sample for timestamp"
level=error ts=2020-07-02T21:49:23.481361649Z caller=push.go:38 org_id=fake traceID=1148fdd822977f5f msg="push error" err="rpc error: code = Code(400) desc = user=fake: series={__name__=\"container_spec_memory_limit_bytes\", cluster=\"development\", ec2_id=\"08b7dd66367c5c5f2\", host=\"10.2.4.39\", id=\"/system.slice/kmod-static-nodes.service\", instance=\"10.2.4.39:9116\", job=\"cadvisor_exporter\", role=\"jenkins-agent\"}, timestamp=2020-07-02T21:49:06.516Z: duplicate sample for timestamp"

Compactor logs:

level=info ts=2020-07-02T21:28:53.424880459Z caller=compact.go:943 component=compactor org_id=fake msg="start sync of metas"
level=info ts=2020-07-02T21:28:54.418034878Z caller=fetcher.go:452 component=compactor org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=993.11421ms cached=149 returned=149 partial=0
level=info ts=2020-07-02T21:28:54.418070948Z caller=compact.go:948 component=compactor org_id=fake msg="start of GC"
level=info ts=2020-07-02T21:28:54.418303502Z caller=compact.go:960 component=compactor org_id=fake msg="start of compactions"
level=info ts=2020-07-02T21:28:54.631891Z caller=compact.go:693 component=compactor org_id=fake group="0@{__org_id__=\"fake\"}" groupKey=0@5679675083797525161 msg="compaction available and planned; downloading blocks" plan="[/data/compact/0@5679675083797525161/01EC6TJJXG6RZGSJYY92B4HDMT /data/compact/0@5679675083797525161/01EC6TJVB2WSJQJ9HFS2JE2BAM /data/compact/0@5679675083797525161/01EC6TPA03TJQFDSRHCWBX2PFG /data/compact/0@5679675083797525161/01EC6TR4FCEWA14EQ8R1FKZXFF /data/compact/0@5679675083797525161/01EC6TS1NRKVCM9CHP4HHBTVJ0 /data/compact/0@5679675083797525161/01EC6TVAGKANNCHYZYRBJ1193D /data/compact/0@5679675083797525161/01EC6TZT3Q8MT8HF7B54WKVTXF /data/compact/0@5679675083797525161/01EC6V0VCRASFE7GQSD6JP89KK /data/compact/0@5679675083797525161/01EC6V66WDVC0A8YZWADZYFTPK /data/compact/0@5679675083797525161/01EC6VCVDJ87D4TS32YZY1BX86 /data/compact/0@5679675083797525161/01EC6VDJTJFBFFDAAQ3X98XB53 /data/compact/0@5679675083797525161/01EC6VG9T4W1RXCPH59SGDY7GS /data/compact/0@5679675083797525161/01EC6VN39GDC3242ZXMJ11F8M3 /data/compact/0@5679675083797525161/01EC6VNWYB7XN9VKR8KVHBZKNR]"
level=error ts=2020-07-02T21:29:35.751332714Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=fake err="compaction: group 0@5679675083797525161: block with not healthy index found /data/compact/0@5679675083797525161/01EC6TVAGKANNCHYZYRBJ1193D; Compaction level 1; Labels: map[__org_id__:fake]: 1/2394773 series have an average of 1.000 out-of-order chunks: 2.000 of these are exact duplicates (in terms of data and time range)"
level=info ts=2020-07-02T21:30:03.015963385Z caller=compactor.go:308 component=compactor msg="discovering users from bucket"
level=info ts=2020-07-02T21:30:03.067965858Z caller=compactor.go:314 component=compactor msg="discovered users from bucket" users=1
level=info ts=2020-07-02T21:30:03.068003282Z caller=compactor.go:336 component=compactor msg="starting compaction of user blocks" user=fake
level=info ts=2020-07-02T21:30:03.251748413Z caller=compact.go:943 component=compactor org_id=fake msg="start sync of metas"
level=info ts=2020-07-02T21:30:04.709152076Z caller=fetcher.go:452 component=compactor org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.457369126s cached=149 returned=149 partial=0
level=info ts=2020-07-02T21:30:04.709216088Z caller=compact.go:948 component=compactor org_id=fake msg="start of GC"
level=info ts=2020-07-02T21:30:04.710063714Z caller=compact.go:960 component=compactor org_id=fake msg="start of compactions"
level=info ts=2020-07-02T21:30:04.928369559Z caller=compact.go:693 component=compactor org_id=fake group="0@{__org_id__=\"fake\"}" groupKey=0@5679675083797525161 msg="compaction available and planned; downloading blocks" plan="[/data/compact/0@5679675083797525161/01EC6TJJXG6RZGSJYY92B4HDMT /data/compact/0@5679675083797525161/01EC6TJVB2WSJQJ9HFS2JE2BAM /data/compact/0@5679675083797525161/01EC6TPA03TJQFDSRHCWBX2PFG /data/compact/0@5679675083797525161/01EC6TR4FCEWA14EQ8R1FKZXFF /data/compact/0@5679675083797525161/01EC6TS1NRKVCM9CHP4HHBTVJ0 /data/compact/0@5679675083797525161/01EC6TVAGKANNCHYZYRBJ1193D /data/compact/0@5679675083797525161/01EC6TZT3Q8MT8HF7B54WKVTXF /data/compact/0@5679675083797525161/01EC6V0VCRASFE7GQSD6JP89KK /data/compact/0@5679675083797525161/01EC6V66WDVC0A8YZWADZYFTPK /data/compact/0@5679675083797525161/01EC6VCVDJ87D4TS32YZY1BX86 /data/compact/0@5679675083797525161/01EC6VDJTJFBFFDAAQ3X98XB53 /data/compact/0@5679675083797525161/01EC6VG9T4W1RXCPH59SGDY7GS /data/compact/0@5679675083797525161/01EC6VN39GDC3242ZXMJ11F8M3 /data/compact/0@5679675083797525161/01EC6VNWYB7XN9VKR8KVHBZKNR]"
level=error ts=2020-07-02T21:30:45.864103545Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=fake err="compaction: group 0@5679675083797525161: block with not healthy index found /data/compact/0@5679675083797525161/01EC6TVAGKANNCHYZYRBJ1193D; Compaction level 1; Labels: map[__org_id__:fake]: 1/2394773 series have an average of 1.000 out-of-order chunks: 2.000 of these are exact duplicates (in terms of data and time range)"
level=info ts=2020-07-02T21:41:56.742995614Z caller=blocks_cleaner.go:101 component=cleaner msg="started hard deletion of blocks marked for deletion"
level=info ts=2020-07-02T21:41:58.544633244Z caller=fetcher.go:452 component=cleaner org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=1.733125256s cached=149 returned=149 partial=0
level=info ts=2020-07-02T21:41:58.544674692Z caller=blocks_cleaner.go:43 component=cleaner org_id=fake msg="started cleaning of blocks marked for deletion"
level=info ts=2020-07-02T21:41:58.544683213Z caller=blocks_cleaner.go:57 component=cleaner org_id=fake msg="cleaning of blocks marked for deletion done"
level=info ts=2020-07-02T21:41:58.544692096Z caller=blocks_cleaner.go:105 component=cleaner msg="successfully completed hard deletion of blocks marked for deletion"

Prometheus version:

prometheus, version 2.18.1 (branch: HEAD, revision: ecee9c8abfd118f139014cb1b174b08db3f342cf)
  build user:       root@2117a9e64a7e
  build date:       20200507-16:51:47
  go version:       go1.14.2

Running the cortex 1.2.0 docker image.

Config: https://pastebin.com/rCissGtr
Jsonnet build file: https://pastebin.com/7EK19zAG

@pstibrany
Copy link
Contributor

Can you also share your Prometheus config please, esp. remote write section?

@pstibrany
Copy link
Contributor

Is it possible that your Prometheus is restarting often?

@amckinley
Copy link
Contributor Author

Here's the beginning of our Prometheus config. I need to look through the scrape rules and make sure I'm not giving up anything that might be confidential, though. And our Prometheus server is definitely not restarting frequently since it takes 2+ hours to replay the WAL.

global:
  scrape_interval:     15s # Scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.
  scrape_timeout: 15s # Scrape timeout, default is 10

  external_labels:
    cortex_ha_cluster: development
    __replica__: i-0025c357aa6e352eb

remote_write:
  - url: "http://cortex-tsdb.dev.robinhood.com/api/prom/push"

rule_files:
  - /etc/prometheus/recording_rules/*.yml
  - /etc/prometheus/alerting_rules/*.yml

@pstibrany
Copy link
Contributor

Thank you. I think scrape rules won’t be necessary.

I am slightly confused by seeing cortex_ha_cluster in your external labels setup, but not in Cortex config (ha_cluster_label is set to cluster there) but also not in log lines. On the other hand there is cluster label visible with development value.

Do you use HA setup with samples deduplication on the Cortex side?

@amckinley
Copy link
Contributor Author

Sorry, you're seeing the config snapshotted at different times while I've been debugging this. I had the HA tracker enabled, with an identically configured pair of Prometheus servers both writing to Cortex, using the default label of "cluster". I noticed that using "cluster" as the ha_cluster_label wasn't working because we already use that label internally, resulting in the HA tracker thinking we had a bunch of different clusters, each of which was getting their own leader. I tried both disabling one of the servers doing the remote_writes, as well as changing ha_cluster_label to a new value ( "cortex_ha_cluster"), neither of which changed the frequency of the "Duplicate sample for timestamp" errors.

@amckinley
Copy link
Contributor Author

I also tried changing the replication factor from 3 down to 1, and significantly shrinking the cluster, from 15 ingesters to 3. This seemed to improve things by reducing the frequency of the "Duplicate" errors from ~1 per minute to ~1 per 10 minutes, but the errors are still appearing.

@amckinley
Copy link
Contributor Author

amckinley commented Jul 3, 2020

...with an identically configured pair of Prometheus servers both writing to Cortex

Note that the configs were identical except for the value of __replica__, which is written into our Prometheus config dynamically by Salt.

Edit: I also tried changing the remote_write parameters by following some of the suggestions in this thread: prometheus/prometheus#5166 (comment)

That change massively improved the performance of the Cortex cluster in terms of resources required to keep up with the rate of remote writes, but had no impact on the "Duplicate" errors.

@pstibrany
Copy link
Contributor

I think this still points to invalid HA configuration.

Typically for misconfigured HA, one gets “out of order” error, and not duplicate timestamp, however, it turns out that CAdvisor metrics exporter also exports timestamp, which explains how different Prometheus servers would scrape same timestamp and end up with “duplicate timestamp” error instead.

To help with HA debugging, you can go to distributor web port, and check /ha-tracker page. This should tell you whether Cortex is finding correct cluster and replica labels.

@pstibrany
Copy link
Contributor

Not sure if this may be related: prometheus/prometheus#7213 (still reading)

@pstibrany
Copy link
Contributor

I will also link to this excellent answer from Marco on similar issue: #2662 (comment)

@amckinley
Copy link
Contributor Author

I think this still points to invalid HA configuration.

I've tried disabling one of the two paired Prometheus servers, and just now I also tried disabling the HA tracker entirely (while keeping one of the two Prometheus servers disabled), with no effect.

however, it turns out that CAdvisor metrics exporter also exports timestamp

I did notice that almost all the error messages were coming from the cadvisor_exporter job. Maybe I can work around this temporarily by dropping those samples in the remote_write relabel config...

@pstibrany
Copy link
Contributor

Have you had any luck with dropping CAdvisor samples?

Would it be possible to try with most recent Prometheus 2.19.2?

@pracucci
Copy link
Contributor

pracucci commented Jul 6, 2020

most blocks end up empty or corrupt because the ingesters report "duplicate sample for timestamp" several times per minute

It's surprising to me that "duplicate sample for timestamp" is the root cause of empty or (even worse) corrupted blocks. In my experience, "duplicate sample for timestamp" shouldn't cause corrupted blocks.

Few questions, please:

  1. How many corrupted blocks have you got over the time? Don't need an exact number, just to understand if it happens frequently or not
  2. Would you be able to privately share the index of the block 01EC6TVAGKANNCHYZYRBJ1193D to run an analysis on it? Please keep in mind the index contains all metrics label names and values, so sharing it may reveal sensitive information. We can talk about it privately on CNCF Slack workspace, I'm "Marco Pracucci" there.

@amckinley
Copy link
Contributor Author

amckinley commented Jul 6, 2020

Have you had any luck with dropping CAdvisor samples?

Trying this today; will let you know.

Would it be possible to try with most recent Prometheus 2.19.2?

This would actually be kind of painful, but if there's a post-2.18.1 change that you think would help, we can do that upgrade.

@amckinley
Copy link
Contributor Author

Have you had any luck with dropping CAdvisor samples?

I don't think this is related to my issue, but I have no idea why the following config is failing to load with an unmarshaling error:

global:
  scrape_interval:     15s # Scrape targets every 15 seconds.
  evaluation_interval: 15s # Evaluate rules every 15 seconds.
  scrape_timeout: 15s # Scrape timeout, default is 10


  external_labels:
    cortex_ha_cluster: development
    __replica__: i-0025c357aa6e352eb

remote_write:
  - url: "http://cortex-tsdb.dev.robinhood.com/api/prom/push"
    writeRelabelConfigs:
      - sourceLabels: ["job"]
        regex: 'cadvisor_exporter'
        action: drop

rule_files:
  - /etc/prometheus/recording_rules/*.yml
  - /etc/prometheus/alerting_rules/*.yml

I'm getting this in the logs every time I send a SIGHUP:

level=error ts=2020-07-07T01:01:57.485Z caller=main.go:604 msg="Error reloading config" err="couldn't load configuration (--config.file=\"/etc/prometheus/config.yml\"): parsing YAML file /etc/prometheus/config.yml: yaml: unmarshal errors:\n  line 15: field writeRelabelConfigs not found in type config.plain"

@amckinley
Copy link
Contributor Author

Have you had any luck with dropping CAdvisor samples?

I figured out the problem in my last comment: I was using camelCase instead of underscore_case. Dropping all the cadvisor_exporter samples has, as expected, prevented the "duplicate timestamp" errors, but in practice this would block us from adopting Cortex because we rely on metrics from that exporter. But it sounds like the issue with cadvisor_exporter is a Prometheus issue, where these samples get written to the WAL but dropped by the local TSDB?

How many corrupted blocks have you got over the time? Don't need an exact number, just to understand if it happens frequently or not

It seems like the initial corrupted blocks I encountered were unrelated to this timestamp issue, because I haven't seen any recurrences of the errors I was seeing from the compactor. The only recent log messages above level info are related to overlapping blocks:

level=warn ts=2020-07-07T21:19:16.411656604Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECNJBB3SK9TA50AYKX74RYQF
level=warn ts=2020-07-07T23:14:34.980559457Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECNRYFQM1JJJHTYYDW5WM73R
level=warn ts=2020-07-08T01:10:43.085754519Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECNZK4HDG7WV0TPGQRCD66QG
level=warn ts=2020-07-08T03:34:56.09410845Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECP7V6S7SN47TJTAWES9ZXCJ
level=warn ts=2020-07-08T05:18:29.592474752Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECPDRTNBBQPKRQZW3JBDJ5R5
level=warn ts=2020-07-08T08:00:05.078755664Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECPQ0PXFWNB2PTX13XT9SAND
level=warn ts=2020-07-08T09:21:00.147599384Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECPVMVZ11HBF10347X17V88C
level=warn ts=2020-07-08T11:16:11.335137859Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECQ27SBQBAK18XKFJS41YZQC
level=warn ts=2020-07-08T13:11:54.256543898Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECQ8VNH6ZMAMVJ0ETZHY2DSS
level=warn ts=2020-07-08T15:07:43.384554185Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECQFFQVHQ83256FATPMGHE53
level=warn ts=2020-07-08T17:32:15.995584506Z caller=compact.go:677 msg="Found overlapping blocks during compaction" ulid=01ECQQRD6NDJ3M1PR6WHB4YW6A

But this might be expected?

Would it be possible to try with most recent Prometheus 2.19.2?

We're going to try this as well, but it will take some work to roll this out without temporarily taking down our alerting/metrics. I'll let you know when it's done.

Would you be able to privately share the index of the block 01EC6TVAGKANNCHYZYRBJ1193D to run an analysis on it?

Unfortunately, I deleted that block since it appeared to be blocking the compactor from making progress. If I see a repeat of that particular error, I'll file a different issue. If I was able to capture a corrupt block "in the wild", it would be very unlikely that I could share it outside my org unless someone on your team was willing to sign an NDA. It would probably be easier to debug over Slack where I could send you the output of whatever commands you wanted me to run on the block, but I won't worry about this unless the error recurs.

@pracucci
Copy link
Contributor

pracucci commented Jul 9, 2020

I figured out the problem in my last comment: I was using camelCase instead of underscore_case. Dropping all the cadvisor_exporter samples has, as expected, prevented the "duplicate timestamp" errors, but in practice this would block us from adopting Cortex because we rely on metrics from that exporter.

The "duplicate timestamp" error in Cortex is not a critical error. You can leave with that, if you can't find a way to fix it in the exporter. All duplicate timestamp samples within a single remote write request will be discarded, while valid samples will be correctly ingested.

But it sounds like the issue with cadvisor_exporter is a Prometheus issue, where these samples get written to the WAL but dropped by the local TSDB?

Yes. Prometheus drops them too, but it does it more silently. Could you check prometheus_tsdb_out_of_order_samples_total in your Prometheus?

The only recent log messages above level info are related to overlapping blocks:

Overlapping blocks are expected in Cortex, because each ingester stores received samples in a block and blocks generated by different ingesters will have overlapping time range.

Unfortunately, I deleted that block since it appeared to be blocking the compactor from making progress. If I see a repeat of that particular error, I'll file a different issue. If I was able to capture a corrupt block "in the wild", it would be very unlikely that I could share it outside my org unless someone on your team was willing to sign an NDA. It would probably be easier to debug over Slack where I could send you the output of whatever commands you wanted me to run on the block, but I won't worry about this unless the error recurs.

Sure. Please keep a backup of the corrupted block (if will happen again) and then ping me on Slack. I will build an ad-hoc analysis tool to run on it.

@stale
Copy link

stale bot commented Sep 7, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants