-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dont start http server in Scheduler.__init__ #4928
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -248,8 +248,8 @@ def test_Client_unused_kwargs_with_address(loop): | |
|
||
|
||
def test_Client_twice(loop): | ||
with Client(loop=loop, silence_logs=False, dashboard_address=None) as c: | ||
with Client(loop=loop, silence_logs=False, dashboard_address=None) as f: | ||
with Client(loop=loop, silence_logs=False, dashboard_address=":0") as c: | ||
with Client(loop=loop, silence_logs=False, dashboard_address=":0") as f: | ||
assert c.cluster.scheduler.port != f.cluster.scheduler.port | ||
|
||
|
||
|
@@ -1048,7 +1048,9 @@ async def test_no_workers(cleanup): | |
|
||
@pytest.mark.asyncio | ||
async def test_cluster_names(): | ||
async with LocalCluster(processes=False, asynchronous=True) as unnamed_cluster: | ||
async with LocalCluster( | ||
processes=False, asynchronous=True, dashboard_address=":0" | ||
) as unnamed_cluster: | ||
async with LocalCluster( | ||
processes=False, asynchronous=True, name="mycluster" | ||
) as named_cluster: | ||
|
@@ -1070,12 +1072,15 @@ async def test_local_cluster_redundant_kwarg(nanny): | |
# Extra arguments are forwarded to the worker class. Depending on | ||
# whether we use the nanny or not, the error treatment is quite | ||
# different and we should assert that an exception is raised | ||
async with await LocalCluster( | ||
typo_kwarg="foo", processes=nanny, n_workers=1 | ||
async with LocalCluster( | ||
typo_kwarg="foo", | ||
processes=nanny, | ||
n_workers=1, | ||
asynchronous=True, | ||
) as cluster: | ||
|
||
# This will never work but is a reliable way to block without hard | ||
# coding any sleep values | ||
async with Client(cluster) as c: | ||
async with Client(cluster, asynchronous=True) as c: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice catch |
||
f = c.submit(sleep, 0) | ||
await f |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,10 +25,14 @@ def test_submit(self): | |
assert future.result() == 2 | ||
|
||
def test_context_manager(self): | ||
with self.Cluster(**self.kwargs) as c: | ||
kwargs = self.kwargs.copy() | ||
kwargs.pop("dashboard_address") | ||
Comment on lines
+28
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would silently ignore any There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I think I changed this before I thought about the idea to simply use a random port which ignores the warning. I'll fix that |
||
with self.Cluster(dashboard_address=":54321", **kwargs) as c: | ||
with Client(c) as e: | ||
assert e.nthreads() | ||
|
||
def test_no_workers(self): | ||
with self.Cluster(0, scheduler_port=0, **self.kwargs): | ||
kwargs = self.kwargs.copy() | ||
kwargs.pop("dashboard_address") | ||
Comment on lines
+35
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment here |
||
with self.Cluster(0, dashboard_address=":54321", scheduler_port=0, **kwargs): | ||
pass |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -83,6 +83,13 @@ | |||||||||
from .utils_perf import disable_gc_diagnosis, enable_gc_diagnosis | ||||||||||
from .variable import VariableExtension | ||||||||||
|
||||||||||
try: | ||||||||||
import bokeh # noqa: F401 | ||||||||||
|
||||||||||
HAS_BOKEH = True | ||||||||||
except ImportError: | ||||||||||
HAS_BOKEH = False | ||||||||||
Comment on lines
+86
to
+91
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: This is subjective and not worth spending much time on, but more commonly throughout the codebase we will catch the distributed/distributed/utils.py Lines 33 to 36 in 7d0f010
Then later on we would do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we do this in other places already, I'll change it. I think keeping a consistent style in a codebase is worth the effort |
||||||||||
|
||||||||||
try: | ||||||||||
from cython import compiled | ||||||||||
except ImportError: | ||||||||||
|
@@ -3438,24 +3445,11 @@ def __init__( | |||||||||
default_port=self.default_port, | ||||||||||
) | ||||||||||
|
||||||||||
http_server_modules = dask.config.get("distributed.scheduler.http.routes") | ||||||||||
show_dashboard = dashboard or (dashboard is None and dashboard_address) | ||||||||||
missing_bokeh = False | ||||||||||
# install vanilla route if show_dashboard but bokeh is not installed | ||||||||||
if show_dashboard: | ||||||||||
try: | ||||||||||
import distributed.dashboard.scheduler | ||||||||||
except ImportError: | ||||||||||
missing_bokeh = True | ||||||||||
http_server_modules.append("distributed.http.scheduler.missing_bokeh") | ||||||||||
routes = get_handlers( | ||||||||||
server=self, modules=http_server_modules, prefix=http_prefix | ||||||||||
self._show_dashboard: bool = dashboard or ( | ||||||||||
dashboard is None and dashboard_address | ||||||||||
) | ||||||||||
self.start_http_server(routes, dashboard_address, default_port=8787) | ||||||||||
if show_dashboard and not missing_bokeh: | ||||||||||
distributed.dashboard.scheduler.connect( | ||||||||||
self.http_application, self.http_server, self, prefix=http_prefix | ||||||||||
) | ||||||||||
self._dashboard_address = dashboard_address | ||||||||||
self._http_prefix = http_prefix | ||||||||||
|
||||||||||
# Communication state | ||||||||||
self.loop = loop or IOLoop.current() | ||||||||||
|
@@ -3759,6 +3753,23 @@ async def start(self): | |||||||||
|
||||||||||
self.clear_task_state() | ||||||||||
|
||||||||||
http_server_modules = dask.config.get("distributed.scheduler.http.routes") | ||||||||||
assert isinstance(http_server_modules, list) | ||||||||||
|
||||||||||
# install vanilla route if show_dashboard but bokeh is not installed | ||||||||||
if self._show_dashboard and not HAS_BOKEH: | ||||||||||
http_server_modules.append("distributed.http.scheduler.missing_bokeh") | ||||||||||
routes = get_handlers( | ||||||||||
server=self, modules=http_server_modules, prefix=self._http_prefix | ||||||||||
) | ||||||||||
self.start_http_server(routes, self._dashboard_address, default_port=8787) | ||||||||||
if self._show_dashboard and HAS_BOKEH: | ||||||||||
import distributed.dashboard.scheduler | ||||||||||
|
||||||||||
distributed.dashboard.scheduler.connect( | ||||||||||
self.http_application, self.http_server, self, prefix=self._http_prefix | ||||||||||
) | ||||||||||
|
||||||||||
with suppress(AttributeError): | ||||||||||
for c in self._worker_coroutines: | ||||||||||
c.cancel() | ||||||||||
|
@@ -5463,7 +5474,7 @@ async def gather(self, comm=None, keys=None, serializers=None): | |||||||||
def clear_task_state(self): | ||||||||||
# XXX what about nested state such as ClientState.wants_what | ||||||||||
# (see also fire-and-forget...) | ||||||||||
logger.info("Clear task state") | ||||||||||
logger.debug("Clear task state") | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed this log level since it felt kind of wrong. This prints a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I don't have a strong opinion on this |
||||||||||
for collection in self._task_state_collections: | ||||||||||
collection.clear() | ||||||||||
|
||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,7 @@ | |
|
||
import pytest | ||
from tornado import gen | ||
from tornado.ioloop import IOLoop | ||
|
||
from distributed import Client, Nanny, Scheduler, Worker, config, default_client | ||
from distributed.core import rpc | ||
|
@@ -18,6 +19,7 @@ | |
gen_test, | ||
inc, | ||
new_config, | ||
start_cluster, | ||
tls_only_security, | ||
wait_for_port, | ||
) | ||
|
@@ -267,3 +269,27 @@ def test_tls_cluster(tls_client): | |
async def test_tls_scheduler(security, cleanup): | ||
async with Scheduler(security=security, host="localhost") as s: | ||
assert s.address.startswith("tls") | ||
|
||
|
||
from distributed.core import Status | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: Can you move this import to the top of the module with existing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, I'm somehow used for either linters or formatters to make me aware of this. for some reason our config considers this perfectly fine 🤷♂️ |
||
|
||
|
||
@pytest.mark.asyncio | ||
@pytest.mark.parametrize("w_cls", [Worker, Nanny]) | ||
async def test_start_cluster_closes_scheduler_worker_failure(w_cls): | ||
nthreads = [("127.0.0.1", 0)] | ||
scheduler = "127.0.0.1" | ||
loop = IOLoop.current() | ||
for _ in range(2): | ||
with pytest.raises(TypeError, match="got an unexpected keyword argument"): | ||
await start_cluster( | ||
nthreads, | ||
scheduler, | ||
loop, | ||
security=None, | ||
Worker=w_cls, | ||
scheduler_kwargs={}, | ||
worker_kwargs={"dont": "start"}, | ||
) | ||
assert all([s.status == Status.closed for s in Scheduler._instances]) | ||
assert all([w.status == Status.closed for w in Worker._instances]) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,6 +45,9 @@ parentdir_prefix = distributed- | |
addopts = -v -rsxfE --durations=20 | ||
filterwarnings = | ||
error:Since distributed.*:PendingDeprecationWarning | ||
|
||
# See https://github.com/dask/distributed/issues/4806 | ||
error:Port:UserWarning:distributed.node | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Woo There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might be a bit too optimistic but let's wait and see. It may definitely interfere if the code base is ran using |
||
minversion = 4 | ||
markers = | ||
slow: marks tests as slow (deselect with '-m "not slow"') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very interesting change. I encountered race condition where this
await
would actually trigger the warning after I moved the HTTP server startup code toScheduler.start
. Apparently the start is not idempotent and sometimes we'd restart / start a scheduler twice but only under some strange timing sensitive conditions (I believe the handler was already active while the scheduler is still starting...). I can dig deeper here but I figured there shouldn't be a reason toawait self
here since this handler should only be registered after/during the start anyhow.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hope would be that
await self
would be idempotent. If in the future we find that there is an easy way to make this true I would be in support.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd assume a
if status==running; return
would already do the trick. If we prefer this route, I can revert theawait self
thingy hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, combined with using an async lock to avoid many
await self
calls at the same time. I think that we do this in other situations, like Client and Cluster. I'm surprised that we don't do it in Server.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idempotence you are referring to is implemented once for
Server
, seedistributed/distributed/core.py
Lines 264 to 269 in d9bc3c6
for a reason I haven't understood, yet, this still caused issues for me. my suspicion is that the scheduler already arrived in a non running but not properly closed state when I see this message. This would then try to revive a kind of dead scheduler and cause this warning. I'll let CI run on this a few times and see if I can reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remembered running into this await before. I had a similar problem over in #4734
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it turns out a properly timed connection request to a dead or dying scheduler can revive it. hrhr, I guess this is merely an edge case relevant to our async test suite. Regardless, imho this is much more cleanly fixed in #4734 and I suggest to merge that one before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note after I became a bit smarter: This await is here to ensure that the server is actually up. problem is that the start is not idempotent, stumbled over this in #4734 as well