Upgrade Prometheus version to fix race condition in Prometheus receiver #2121
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
There is a race condition that occurs in the Prometheus receiver which causes the problem described in #1909. After investigation with @hdj630, we found that the race condition was caused when Prometheus scrape targets are dropped in the middle of a scrape loop. In this scenario, a deadlock occurs with the
scrapePool
mutex. This deadlock blocks and causes the"Discovery receiver's channel was full so will retry the next cycle"
error mentioned in the issue above. More details of this error can be found below.This PR upgrades the Prometheus version to
v2.22.1
to include a commit upstream that makes the mutex locks more granular, which was able to resolve this issue. However, due to the upgraded Prometheus version, there was an error with thegithub.com/shirou/gopsutil
dependency which had to be resolved by upgrading its version tov3.20.10
.Testing:
Ran
make
to ensure all unit tests pass.Further details of race condition
Hypothesis
I tried replicating the issue and realized that the error mentioned might be caused by a race condition occurring when scrape targets are removed. This does not always occur, but occurs in the specific situation that @hdj630 described here. We provide a more complete walkthrough of the error:
ScrapeManager
is alerted of dropped targets viaScrapeManager.reload()
.scrapePool.Sync()
is called on each scrape pool which acquires the scrape pool mutex lock.scrapePool.sync()
is called which will iterate over all old targets and stop the scrape loops that have dropped targets, callingscrapeLoop.stop()
on them.scrapeLoop.stop()
callsscrapeLoop.cancel()
which is received inscrapeLoop.run()
scrapeLoop.scrapeAndReport
, this is when the deadlock occursstorage.Appender
which is anocaStore
defined in Otel Collector. This creates a newTransaction
.scrapeAndReport
then calls append on the newstorage.Appender
to add metrics to it.Transaction.Add
is called, and since the transaction is new, it will calltransaction.initTransaction
. This gets the target from the metadata service which will callscrapeManager.TargetsAll
.scrapeManager.TargetsAll
will iterate through each scrape pool and gets the active and dropped targets for each. However,scrapePool.ActiveTargets
wants to acquire the scrape pool mutex lock which is still being held in step 1, since it won’t release the lock until all scrape loops are done. This causes the deadlock.Note that in this hypothesis, the deadlock occurs with the scrape pool’s mutex lock and not the scrape manager’s.
Proof
Since we claim that the issue arises when scrape targets are dropped and not when they are restarted, it’s sufficient to test what happens when we stop scrape targets while the collector is in the middle of its scrape cycle. This doesn’t occur all the time since the deadlock happens in a very specific scenario listed above. To increase the chances of triggering it, set your scrape interval low (e.g. 15s) and the number of scrape targets high (e.g. 15 replicas).
Setup
I have a cluster with the collector and 15 replicas of a sample app that emits Prometheus metrics. The Prometheus receiver is configured with a scrape interval of 15s. I also use a logging exporter to verify that metrics are being received. I wait for the receiver to be right in the middle of a scrape loop before I scale down the number of replicas to 0. If we are lucky, we can trigger the race condition and get the following logs:
Proof 1
2020-11-11T01:14:38.114Z INFO scrape/scrape.go:397 {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "Sync acquired sp lock": "(MISSING)"}
This is a custom log message I placed right after scrapePool.Sync acquires the lock in step 1 of my hypothesis.2020-11-11T01:14:45.768Z INFO internal/metadata.go:45 TargetsAll acquired lock {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus"}
This is a custom log message I placed right after scrapeManager.TargetsAll acquires the scrapeManager mutex lock.2020-11-11T01:14:45.768Z INFO scrape/scrape.go:268 {"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "scrape_pool": "test", "ActiveTargets trying to acquire sp lock": "(MISSING)"}
This is a custom log message I placed right before scrapePool.ActiveTargets tries to acquire the scrape pool’s mutex lock."ActiveTargets acquired sp lock"
, but we don’t. Likewise, we should also get the log message:"Sync released sp lock"
whenscrapePool.Sync
releases the lock it acquired, but we don’t.Proof 2
Within
scrapeLoop.stop
, I added logs before and after<-sl.stopped
, which will execute when the scrape loop is successfully stopped byscrapeLoop.run
. We should expect to see both"scrape loop waiting to stop..."
and"scrape loop stopped!"
if all scrape loops stop successfully. However, we get the following logs:Each scrape loop is defined by its target. We notice that all scrape loops log out the pair of messages except the one with target
192.168.63.125
. We wait for some time and still don’t get the second message. Moreover, we get:This might point to the fact that
sl.cancel()
was called but the scrape loop didn’t receive it withcase <-sl.ctx.Done()
in the run loop.Post-fix
After the fix, we don't see the above errors occuring since the new version of Prometheus made the mutex lock acquisition more granular. Specifically, we see the following logs:
Note that
scrapePool.Sync
releases the lock it acquired right away, allowingscrapePool.ActiveTargets
to pick up the lock, resolving the race condition.