Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrape + discovery: Required changes for serverless use #123

Open
wants to merge 4 commits into
base: feature/shutdownscrape
Choose a base branch
from

Conversation

bwplotka
Copy link
Collaborator

Rebases on top of @ridwanmsharif changes #122 with following updates:

  • Simplified discovery, I think we can always skip wait and make it more robust.
  • Adjusted TestManagerStopAfterScrapeAttempt to simulate bit more how external users might use it.
  • Removed one test, we test it TestManagerStopAfterScrapeAttempt already
  • Cleaned up comments.
  • Fixed lock bug that was here already.

More details on locking:

Checked -race on TestManagerStopAfterScrapeAttempt and indeed without your locks.

It produced RACE, I would argue it's not possible as everything is guarded by mtxScrape:

// reload -> ApplyConfig mtxScrape
// stop -> Stop/StopAfterScrapeAttempt/ApplyConfig mtxScrape
// sync -> Sync -> manager reload mtxScrape

Turned out - there was bug in our locking (mtxScrape Unlock was too soon). Fixed, and now no more locks are needed.

BTW @ridwanmsharif epic work on DiscoveryReloadOnStartup and SkipInitialWait in discovery. Super nice solutions! I changed it only slightly, but feel free to reject 🤗

ridwanmsharif and others added 4 commits November 20, 2023 21:26
This config option is quite useful on serverless environments where
we are sensitive to the start up latencies of the scraper. The
serverless instance might only last for a few seconds and may not be
able to afford the minimum 5s reload interval before the scrape pools
are created.

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>
This change adds an option that will allow users to skip the initial
wait before sending target sets to the scrape manager. This is
particularly useful in environments where the startup latency is
required to be low just as in serverless deployments.

Signed-off-by: Ridwan Sharif <ridwanmsharif@google.com>
* Simplified discovery, I think we can always skip wait and make it more robust.
* Adjusted TestManagerStopAfterScrapeAttempt to simulate bit more how external users might use it.
* Removed one test, we test it TestManagerStopAfterScrapeAttempt already
* Cleaned up comments.
* Fixed lock bug that was here already.

More details on locking:

Checked -race on TestManagerStopAfterScrapeAttempt and indeed without your locks.

It produces RACE e.g.

0c00011c868 by goroutine 47:
  github.com/prometheus/prometheus/scrape.(*Manager).StopAfterScrapeAttempt()
      /Users/bwplotka/Repos/prometheus/scrape/manager.go:311 +0x238
  github.com/prometheus/prometheus/scrape.TestManagerStopAfterScrapeAttempt.func8()
      /Users/bwplotka/Repos/prometheus/scrape/manager_test.go:781 +0x38
  github.com/prometheus/prometheus/scrape.TestManagerStopAfterScrapeAttempt.func10()
      /Users/bwplotka/Repos/prometheus/scrape/manager_test.go:840 +0xa08
  testing.tRunner()
      /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1629 +0x40

Previous write at 0x00c00011c868 by goroutine 53:
  github.com/prometheus/prometheus/scrape.(*scrapePool).sync()
      /Users/bwplotka/Repos/prometheus/scrape/scrape.go:585 +0x834
  github.com/prometheus/prometheus/scrape.(*scrapePool).Sync()
      /Users/bwplotka/Repos/prometheus/scrape/scrape.go:522 +0x4ac
  github.com/prometheus/prometheus/scrape.(*Manager).reload.func1()
      /Users/bwplotka/Repos/prometheus/scrape/manager.go:255 +0x48
  github.com/prometheus/prometheus/scrape.(*Manager).reload.func2()
      /Users/bwplotka/Repos/prometheus/scrape/manager.go:257 +0x6c

Goroutine 47 (running) created at:
  testing.(*T).Run()
      /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1629 +0x5e4
  github.com/prometheus/prometheus/scrape.TestManagerStopAfterScrapeAttempt()
      /Users/bwplotka/Repos/prometheus/scrape/manager_test.go:792 +0x130
  testing.tRunner()
      /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1576 +0x188
  testing.(*T).Run.func1()
      /Users/bwplotka/.gvm/gos/go1.20.3/src/testing/testing.go:1629 +0x40

Goroutine 53 (finished) created at:
  github.com/prometheus/prometheus/scrape.(*Manager).reload()
      /Users/bwplotka/Repos/prometheus/scrape/manager.go:254 +0x774
  github.com/prometheus/prometheus/scrape.(*Manager).reloader()
      /Users/bwplotka/Repos/prometheus/scrape/manager.go:210 +0x128
  github.com/prometheus/prometheus/scrape.(*Manager).Run.func1()
      /Users/bwplotka/Repos/prometheus/scrape/manager.go:183 +0x34
==================

I would argue it's not possible as everything is guarded by mtxScrape.

// reload -> ApplyConfig mtxScrape
// stop -> Stop/StopAfterScrapeAttempt/ApplyConfig mtxScrape
// sync -> Sync -> manager reload mtxScrape

Indeed--there was a bug in previous code, those locks are now not needed.

Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka
Copy link
Collaborator Author

Fuzzing issue seems unrelated, I would ignore.

@bwplotka
Copy link
Collaborator Author

Looks like races on my discovery changes. They might also occur on your changes, just we don't test those with this special flag 🤔

@bwplotka
Copy link
Collaborator Author

Merged #122 for now as this needs some love.

The race is on test (using discoveryManager.targets without lock while .Run is till there), but I need to verify we won't have some other consequences for non sidecar cases... t

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants