Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

kohrapha · 2020-11-11T19:06:36Z

Description:
There is a race condition that occurs in the Prometheus receiver which causes the problem described in #1909. After investigation with @hdj630, we found that the race condition was caused when Prometheus scrape targets are dropped in the middle of a scrape loop. In this scenario, a deadlock occurs with the scrapePool mutex. This deadlock blocks and causes the "Discovery receiver's channel was full so will retry the next cycle" error mentioned in the issue above. More details of this error can be found below.

This PR upgrades the Prometheus version to v2.22.1 to include a commit upstream that makes the mutex locks more granular, which was able to resolve this issue. However, due to the upgraded Prometheus version, there was an error with the github.com/shirou/gopsutil dependency which had to be resolved by upgrading its version to v3.20.10.

Testing:
Ran make to ensure all unit tests pass.

Further details of race condition

Hypothesis

I tried replicating the issue and realized that the error mentioned might be caused by a race condition occurring when scrape targets are removed. This does not always occur, but occurs in the specific situation that @hdj630 described here. We provide a more complete walkthrough of the error:

ScrapeManager is alerted of dropped targets via ScrapeManager.reload(). scrapePool.Sync() is called on each scrape pool which acquires the scrape pool mutex lock.
scrapePool.sync() is called which will iterate over all old targets and stop the scrape loops that have dropped targets, calling scrapeLoop.stop() on them.
The scrape pool will then wait on all dropped scrape loops to stop before releasing the mutex lock it acquired in step 1.
scrapeLoop.stop() calls scrapeLoop.cancel() which is received in scrapeLoop.run()
However, if the scrape loop happens to be in scrapeLoop.scrapeAndReport, this is when the deadlock occurs
The scrape loop will create a new storage.Appender which is an ocaStore defined in Otel Collector. This creates a new Transaction.
scrapeAndReport then calls append on the new storage.Appender to add metrics to it.
Transaction.Add is called, and since the transaction is new, it will call transaction.initTransaction. This gets the target from the metadata service which will call scrapeManager.TargetsAll.
scrapeManager.TargetsAll will iterate through each scrape pool and gets the active and dropped targets for each. However, scrapePool.ActiveTargets wants to acquire the scrape pool mutex lock which is still being held in step 1, since it won’t release the lock until all scrape loops are done. This causes the deadlock.

Note that in this hypothesis, the deadlock occurs with the scrape pool’s mutex lock and not the scrape manager’s.

Proof

Since we claim that the issue arises when scrape targets are dropped and not when they are restarted, it’s sufficient to test what happens when we stop scrape targets while the collector is in the middle of its scrape cycle. This doesn’t occur all the time since the deadlock happens in a very specific scenario listed above. To increase the chances of triggering it, set your scrape interval low (e.g. 15s) and the number of scrape targets high (e.g. 15 replicas).

Setup

I have a cluster with the collector and 15 replicas of a sample app that emits Prometheus metrics. The Prometheus receiver is configured with a scrape interval of 15s. I also use a logging exporter to verify that metrics are being received. I wait for the receiver to be right in the middle of a scrape loop before I scale down the number of replicas to 0. If we are lucky, we can trigger the race condition and get the following logs:

Proof 1

2020-11-11T01:14:38.114Z INFO scrape/scrape.go:397 {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"} This is a custom log message I placed right after scrapePool.Sync acquires the lock in step 1 of my hypothesis.
2020-11-11T01:14:45.768Z INFO internal/metadata.go:45 TargetsAll acquired lock {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"} This is a custom log message I placed right after scrapeManager.TargetsAll acquires the scrapeManager mutex lock.
2020-11-11T01:14:45.768Z INFO scrape/scrape.go:268 {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"} This is a custom log message I placed right before scrapePool.ActiveTargets tries to acquire the scrape pool’s mutex lock.
If there is no deadlock, we should get the log message: "ActiveTargets acquired sp lock", but we don’t. Likewise, we should also get the log message: "Sync released sp lock" when scrapePool.Sync releases the lock it acquired, but we don’t.

Proof 2

Within scrapeLoop.stop, I added logs before and after <-sl.stopped, which will execute when the scrape loop is successfully stopped by scrapeLoop.run. We should expect to see both "scrape loop waiting to stop..." and "scrape loop stopped!" if all scrape loops stop successfully. However, we get the following logs:

2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:397            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.63.125:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.28.85:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.28.85:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.6.71:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.66.239:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.66.239:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.38.172:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.38.172:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.77.212:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.77.212:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.79.71:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.79.71:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.115Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop stopped!": "(MISSING)"}

Each scrape loop is defined by its target. We notice that all scrape loops log out the pair of messages except the one with target 192.168.63.125. We wait for some time and still don’t get the second message. Moreover, we get:

2020-11-11T01:14:47.260Z DEBUG scrape/scrape.go:1096 Scrape failed {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.6.71:8081/metrics", "err": "Get \"http://192.168.6.71:8081/metrics\": context deadline exceeded"}

This might point to the fact that sl.cancel() was called but the scrape loop didn’t receive it with case <-sl.ctx.Done() in the run loop.

Post-fix

After the fix, we don't see the above errors occuring since the new version of Prometheus made the mutex lock acquisition more granular. Specifically, we see the following logs:

2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:416            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:434            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync released sp lock": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.22:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.22:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.58.117:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.58.117:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.14.190:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.14.190:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.101:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.101:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.93.90:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.93.90:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:11.263Z        DEBUG   scrape/scrape.go:1129   Scrape failed   {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "err": "Get \"http://192.168.37.228:8081/metrics\": dial tcp 192.168.37.228:8081: connect: no route to host"}
2020-11-11T17:46:11.264Z        INFO    internal/metadata.go:45 TargetsAll acquired lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:282            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:284            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets acquired sp lock": "(MISSING)"}
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:292            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets released sp lock": "(MISSING)"}
2020-11-11T17:46:11.264Z        INFO    internal/metadata.go:45 TargetsAll released lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:11.264Z        WARN    scrape/scrape.go:1096   Appending scrape report failed  {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "err": "unable to find a target with job=test, and instance=192.168.37.228:8081"}
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1096
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1155
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1041
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:13.063Z        DEBUG   scrape/scrape.go:1129   Scrape failed   {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "err": "Get \"http://192.168.89.78:8081/metrics\": dial tcp 192.168.89.78:8081: connect: no route to host"}
2020-11-11T17:46:13.064Z        INFO    internal/metadata.go:45 TargetsAll acquired lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:282            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:284            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets acquired sp lock": "(MISSING)"}
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:292            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets released sp lock": "(MISSING)"}
2020-11-11T17:46:13.064Z        INFO    internal/metadata.go:45 TargetsAll released lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:13.064Z        WARN    scrape/scrape.go:1096   Appending scrape report failed  {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "err": "unable to find a target with job=test, and instance=192.168.89.78:8081"}
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1096
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1155
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1041
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:18.506Z        DEBUG   scrape/scrape.go:1129   Scrape failed   {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "err": "Get \"http://192.168.21.139:8081/metrics\": context deadline exceeded"}
2020-11-11T17:46:18.507Z        INFO    internal/metadata.go:45 TargetsAll acquired lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:282            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:284            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets acquired sp lock": "(MISSING)"}
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:292            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets released sp lock": "(MISSING)"}
2020-11-11T17:46:18.507Z        INFO    internal/metadata.go:45 TargetsAll released lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:18.507Z        WARN    scrape/scrape.go:1096   Appending scrape report failed  {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "err": "unable to find a target with job=test, and instance=192.168.21.139:8081"}
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1096
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1155
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1041
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "scrape loop stopped!": "(MISSING)"}

Note that scrapePool.Sync releases the lock it acquired right away, allowing scrapePool.ActiveTargets to pick up the lock, resolving the race condition.

codecov · 2020-11-11T19:14:42Z

Codecov Report

Merging #2121 (2d94ad9) into master (fcc5852) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2121   +/-   ##
=======================================
  Coverage   92.11%   92.11%           
=======================================
  Files         279      279           
  Lines       16992    16992           
=======================================
  Hits        15652    15652           
  Misses        921      921           
  Partials      419      419

Impacted Files	Coverage Δ
translator/internaldata/resource_to_oc.go	`91.54% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcc5852...2d94ad9. Read the comment docs.

go.mod

kohrapha requested a review from a team November 11, 2020 19:06

bogdandrutu approved these changes Nov 11, 2020

View reviewed changes

Upgrade Prometheus version to fix race condition in Prometheus receiver

2d94ad9

kohrapha force-pushed the kohrapha/prom-version-upgrade branch from 3e73d38 to 2d94ad9 Compare November 11, 2020 20:04

bogdandrutu reviewed Nov 11, 2020

View reviewed changes

go.mod Show resolved Hide resolved

bogdandrutu merged commit 699005f into open-telemetry:master Nov 12, 2020

kohrapha deleted the kohrapha/prom-version-upgrade branch November 12, 2020 16:44

oktocat mentioned this pull request Nov 18, 2020

Prometheus receiver stops scraping all targets when Kubernetes SD change or become unreachable #1909

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

kohrapha commented Nov 11, 2020 •

edited

Loading

codecov bot commented Nov 11, 2020 •

edited

Loading

Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

Conversation

kohrapha commented Nov 11, 2020 • edited Loading

Further details of race condition

Hypothesis

Proof

Setup

Proof 1

Proof 2

Post-fix

codecov bot commented Nov 11, 2020 • edited Loading

Codecov Report

kohrapha commented Nov 11, 2020 •

edited

Loading

codecov bot commented Nov 11, 2020 •

edited

Loading