Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Prometheus version to fix race condition in Prometheus receiver #2121

Merged

Conversation

kohrapha
Copy link
Contributor

@kohrapha kohrapha commented Nov 11, 2020

Description:
There is a race condition that occurs in the Prometheus receiver which causes the problem described in #1909. After investigation with @hdj630, we found that the race condition was caused when Prometheus scrape targets are dropped in the middle of a scrape loop. In this scenario, a deadlock occurs with the scrapePool mutex. This deadlock blocks and causes the "Discovery receiver's channel was full so will retry the next cycle" error mentioned in the issue above. More details of this error can be found below.

This PR upgrades the Prometheus version to v2.22.1 to include a commit upstream that makes the mutex locks more granular, which was able to resolve this issue. However, due to the upgraded Prometheus version, there was an error with the github.com/shirou/gopsutil dependency which had to be resolved by upgrading its version to v3.20.10.

Testing:
Ran make to ensure all unit tests pass.

Further details of race condition

Hypothesis

I tried replicating the issue and realized that the error mentioned might be caused by a race condition occurring when scrape targets are removed. This does not always occur, but occurs in the specific situation that @hdj630 described here. We provide a more complete walkthrough of the error:

  1. ScrapeManager is alerted of dropped targets via ScrapeManager.reload(). scrapePool.Sync() is called on each scrape pool which acquires the scrape pool mutex lock.
  2. scrapePool.sync() is called which will iterate over all old targets and stop the scrape loops that have dropped targets, calling scrapeLoop.stop() on them.
  3. The scrape pool will then wait on all dropped scrape loops to stop before releasing the mutex lock it acquired in step 1.
  4. scrapeLoop.stop() calls scrapeLoop.cancel() which is received in scrapeLoop.run()
  5. However, if the scrape loop happens to be in scrapeLoop.scrapeAndReport, this is when the deadlock occurs
  6. The scrape loop will create a new storage.Appender which is an ocaStore defined in Otel Collector. This creates a new Transaction.
  7. scrapeAndReport then calls append on the new storage.Appender to add metrics to it.
  8. Transaction.Add is called, and since the transaction is new, it will call transaction.initTransaction. This gets the target from the metadata service which will call scrapeManager.TargetsAll.
  9. scrapeManager.TargetsAll will iterate through each scrape pool and gets the active and dropped targets for each. However, scrapePool.ActiveTargets wants to acquire the scrape pool mutex lock which is still being held in step 1, since it won’t release the lock until all scrape loops are done. This causes the deadlock.

Note that in this hypothesis, the deadlock occurs with the scrape pool’s mutex lock and not the scrape manager’s.

Proof

Since we claim that the issue arises when scrape targets are dropped and not when they are restarted, it’s sufficient to test what happens when we stop scrape targets while the collector is in the middle of its scrape cycle. This doesn’t occur all the time since the deadlock happens in a very specific scenario listed above. To increase the chances of triggering it, set your scrape interval low (e.g. 15s) and the number of scrape targets high (e.g. 15 replicas).

Setup

I have a cluster with the collector and 15 replicas of a sample app that emits Prometheus metrics. The Prometheus receiver is configured with a scrape interval of 15s. I also use a logging exporter to verify that metrics are being received. I wait for the receiver to be right in the middle of a scrape loop before I scale down the number of replicas to 0. If we are lucky, we can trigger the race condition and get the following logs:

Proof 1

  1. 2020-11-11T01:14:38.114Z INFO scrape/scrape.go:397 {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"} This is a custom log message I placed right after scrapePool.Sync acquires the lock in step 1 of my hypothesis.
  2. 2020-11-11T01:14:45.768Z INFO internal/metadata.go:45 TargetsAll acquired lock {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"} This is a custom log message I placed right after scrapeManager.TargetsAll acquires the scrapeManager mutex lock.
  3. 2020-11-11T01:14:45.768Z INFO scrape/scrape.go:268 {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"} This is a custom log message I placed right before scrapePool.ActiveTargets tries to acquire the scrape pool’s mutex lock.
  4. If there is no deadlock, we should get the log message: "ActiveTargets acquired sp lock", but we don’t. Likewise, we should also get the log message: "Sync released sp lock" when scrapePool.Sync releases the lock it acquired, but we don’t.

Proof 2

Within scrapeLoop.stop, I added logs before and after <-sl.stopped, which will execute when the scrape loop is successfully stopped by scrapeLoop.run. We should expect to see both "scrape loop waiting to stop..." and "scrape loop stopped!" if all scrape loops stop successfully. However, we get the following logs:

2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:397            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.63.125:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.28.85:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.28.85:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.6.71:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.66.239:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.66.239:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.38.172:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.38.172:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.77.212:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.77.212:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.79.71:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.79.71:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T01:14:38.114Z        INFO    scrape/scrape.go:1203           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T01:14:38.115Z        INFO    scrape/scrape.go:1205           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop stopped!": "(MISSING)"}

Each scrape loop is defined by its target. We notice that all scrape loops log out the pair of messages except the one with target 192.168.63.125. We wait for some time and still don’t get the second message. Moreover, we get:

2020-11-11T01:14:47.260Z DEBUG scrape/scrape.go:1096 Scrape failed {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.6.71:8081/metrics", "err": "Get \"http://192.168.6.71:8081/metrics\": context deadline exceeded"}

This might point to the fact that sl.cancel() was called but the scrape loop didn’t receive it with case <-sl.ctx.Done() in the run loop.

Post-fix

After the fix, we don't see the above errors occuring since the new version of Prometheus made the mutex lock acquisition more granular. Specifically, we see the following logs:

2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:416            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:434            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync released sp lock": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.5.195:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.26.65:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.30.40:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.22:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.22:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.65.35:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.58.117:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.58.117:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.617Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.60.191:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.44:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.14.190:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.14.190:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.101:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.69.101:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.93.90:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.93.90:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1236           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop waiting to stop...": "(MISSING)"}
2020-11-11T17:46:10.618Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.48.248:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:11.263Z        DEBUG   scrape/scrape.go:1129   Scrape failed   {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "err": "Get \"http://192.168.37.228:8081/metrics\": dial tcp 192.168.37.228:8081: connect: no route to host"}
2020-11-11T17:46:11.264Z        INFO    internal/metadata.go:45 TargetsAll acquired lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:282            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:284            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets acquired sp lock": "(MISSING)"}
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:292            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets released sp lock": "(MISSING)"}
2020-11-11T17:46:11.264Z        INFO    internal/metadata.go:45 TargetsAll released lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:11.264Z        WARN    scrape/scrape.go:1096   Appending scrape report failed  {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "err": "unable to find a target with job=test, and instance=192.168.37.228:8081"}
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1096
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1155
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1041
2020-11-11T17:46:11.264Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.37.228:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:13.063Z        DEBUG   scrape/scrape.go:1129   Scrape failed   {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "err": "Get \"http://192.168.89.78:8081/metrics\": dial tcp 192.168.89.78:8081: connect: no route to host"}
2020-11-11T17:46:13.064Z        INFO    internal/metadata.go:45 TargetsAll acquired lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:282            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:284            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets acquired sp lock": "(MISSING)"}
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:292            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets released sp lock": "(MISSING)"}
2020-11-11T17:46:13.064Z        INFO    internal/metadata.go:45 TargetsAll released lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:13.064Z        WARN    scrape/scrape.go:1096   Appending scrape report failed  {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "err": "unable to find a target with job=test, and instance=192.168.89.78:8081"}
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1096
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1155
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1041
2020-11-11T17:46:13.064Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.89.78:8081/metrics", "scrape loop stopped!": "(MISSING)"}
2020-11-11T17:46:18.506Z        DEBUG   scrape/scrape.go:1129   Scrape failed   {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "err": "Get \"http://192.168.21.139:8081/metrics\": context deadline exceeded"}
2020-11-11T17:46:18.507Z        INFO    internal/metadata.go:45 TargetsAll acquired lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:282            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:284            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets acquired sp lock": "(MISSING)"}
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:292            {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets released sp lock": "(MISSING)"}
2020-11-11T17:46:18.507Z        INFO    internal/metadata.go:45 TargetsAll released lock        {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
2020-11-11T17:46:18.507Z        WARN    scrape/scrape.go:1096   Appending scrape report failed  {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "err": "unable to find a target with job=test, and instance=192.168.21.139:8081"}
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func2
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1096
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1155
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
        github.com/prometheus/prometheus@v1.8.2-0.20200827201422-1195cc24e3c8/scrape/scrape.go:1041
2020-11-11T17:46:18.507Z        INFO    scrape/scrape.go:1238           {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "target": "http://192.168.21.139:8081/metrics", "scrape loop stopped!": "(MISSING)"}

Note that scrapePool.Sync releases the lock it acquired right away, allowing scrapePool.ActiveTargets to pick up the lock, resolving the race condition.

@kohrapha kohrapha requested a review from a team November 11, 2020 19:06
@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #2121 (2d94ad9) into master (fcc5852) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2121   +/-   ##
=======================================
  Coverage   92.11%   92.11%           
=======================================
  Files         279      279           
  Lines       16992    16992           
=======================================
  Hits        15652    15652           
  Misses        921      921           
  Partials      419      419           
Impacted Files Coverage Δ
translator/internaldata/resource_to_oc.go 91.54% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcc5852...2d94ad9. Read the comment docs.

@kohrapha kohrapha force-pushed the kohrapha/prom-version-upgrade branch from 3e73d38 to 2d94ad9 Compare November 11, 2020 20:04
go.mod Show resolved Hide resolved
@bogdandrutu bogdandrutu merged commit 699005f into open-telemetry:master Nov 12, 2020
@kohrapha kohrapha deleted the kohrapha/prom-version-upgrade branch November 12, 2020 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants