scrape + discovery: Required changes for serverless use #123

bwplotka · 2023-11-22T14:17:13Z

Rebases on top of @ridwanmsharif changes #122 with following updates:

Simplified discovery, I think we can always skip wait and make it more robust.
Adjusted TestManagerStopAfterScrapeAttempt to simulate bit more how external users might use it.
Removed one test, we test it TestManagerStopAfterScrapeAttempt already
Cleaned up comments.
Fixed lock bug that was here already.

More details on locking:

Checked -race on TestManagerStopAfterScrapeAttempt and indeed without your locks.

It produced RACE, I would argue it's not possible as everything is guarded by mtxScrape:

// reload -> ApplyConfig mtxScrape
// stop -> Stop/StopAfterScrapeAttempt/ApplyConfig mtxScrape
// sync -> Sync -> manager reload mtxScrape

Turned out - there was bug in our locking (mtxScrape Unlock was too soon). Fixed, and now no more locks are needed.

BTW @ridwanmsharif epic work on DiscoveryReloadOnStartup and SkipInitialWait in discovery. Super nice solutions! I changed it only slightly, but feel free to reject 🤗

This config option is quite useful on serverless environments where we are sensitive to the start up latencies of the scraper. The serverless instance might only last for a few seconds and may not be able to afford the minimum 5s reload interval before the scrape pools are created. Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

This change adds an option that will allow users to skip the initial wait before sending target sets to the scrape manager. This is particularly useful in environments where the startup latency is required to be low just as in serverless deployments. Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>

* Simplified discovery, I think we can always skip wait and make it more robust. * Adjusted TestManagerStopAfterScrapeAttempt to simulate bit more how external users might use it. * Removed one test, we test it TestManagerStopAfterScrapeAttempt already * Cleaned up comments. * Fixed lock bug that was here already. More details on locking: Checked -race on TestManagerStopAfterScrapeAttempt and indeed without your locks. It produces RACE e.g. 0c00011c868 by goroutine 47: github.com/prometheus/prometheus/scrape.(*Manager).StopAfterScrapeAttempt() /Users/bwplotka/Repos/prometheus/scrape/manager.go:311 +0x238 github.com/prometheus/prometheus/scrape.TestManagerStopAfterScrapeAttempt.func8() /Users/bwplotka/Repos/prometheus/scrape/manager_test.go:781 +0x38 github.com/prometheus/prometheus/scrape.TestManagerStopAfterScrapeAttempt.func10() /Users/bwplotka/Repos/prometheus/scrape/manager_test.go:840 +0xa08 testing.tRunner() /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1576 +0x188 testing.(*T).Run.func1() /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1629 +0x40 Previous write at 0x00c00011c868 by goroutine 53: github.com/prometheus/prometheus/scrape.(*scrapePool).sync() /Users/bwplotka/Repos/prometheus/scrape/scrape.go:585 +0x834 github.com/prometheus/prometheus/scrape.(*scrapePool).Sync() /Users/bwplotka/Repos/prometheus/scrape/scrape.go:522 +0x4ac github.com/prometheus/prometheus/scrape.(*Manager).reload.func1() /Users/bwplotka/Repos/prometheus/scrape/manager.go:255 +0x48 github.com/prometheus/prometheus/scrape.(*Manager).reload.func2() /Users/bwplotka/Repos/prometheus/scrape/manager.go:257 +0x6c Goroutine 47 (running) created at: testing.(*T).Run() /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1629 +0x5e4 github.com/prometheus/prometheus/scrape.TestManagerStopAfterScrapeAttempt() /Users/bwplotka/Repos/prometheus/scrape/manager_test.go:792 +0x130 testing.tRunner() /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1576 +0x188 testing.(*T).Run.func1() /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1629 +0x40 Goroutine 53 (finished) created at: github.com/prometheus/prometheus/scrape.(*Manager).reload() /Users/bwplotka/Repos/prometheus/scrape/manager.go:254 +0x774 github.com/prometheus/prometheus/scrape.(*Manager).reloader() /Users/bwplotka/Repos/prometheus/scrape/manager.go:210 +0x128 github.com/prometheus/prometheus/scrape.(*Manager).Run.func1() /Users/bwplotka/Repos/prometheus/scrape/manager.go:183 +0x34 ================== I would argue it's not possible as everything is guarded by mtxScrape. // reload -> ApplyConfig mtxScrape // stop -> Stop/StopAfterScrapeAttempt/ApplyConfig mtxScrape // sync -> Sync -> manager reload mtxScrape Indeed--there was a bug in previous code, those locks are now not needed. Signed-off-by: bwplotka <bwplotka@gmail.com>

bwplotka · 2023-11-22T14:20:33Z

Fuzzing issue seems unrelated, I would ignore.

bwplotka · 2023-11-22T14:39:29Z

Looks like races on my discovery changes. They might also occur on your changes, just we don't test those with this special flag 🤔

bwplotka · 2023-11-22T18:11:40Z

Merged #122 for now as this needs some love.

The race is on test (using discoveryManager.targets without lock while .Run is till there), but I need to verify we won't have some other consequences for non sidecar cases... t

ridwanmsharif and others added 4 commits November 20, 2023 21:26

scrape: use locks when shutting down the loops

4cddaf0

This was referenced Nov 22, 2023

scrape: use locks when shutting down the loops #122

Merged

scrape: add option to Manager to skip initial discovery reload #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrape + discovery: Required changes for serverless use #123

scrape + discovery: Required changes for serverless use #123

bwplotka commented Nov 22, 2023

bwplotka commented Nov 22, 2023

bwplotka commented Nov 22, 2023

bwplotka commented Nov 22, 2023

scrape + discovery: Required changes for serverless use #123

Are you sure you want to change the base?

scrape + discovery: Required changes for serverless use #123

Conversation

bwplotka commented Nov 22, 2023

bwplotka commented Nov 22, 2023

bwplotka commented Nov 22, 2023

bwplotka commented Nov 22, 2023