Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

Increase http timeout default values #35

Merged
merged 1 commit into from
May 8, 2019

Conversation

kevin-bates
Copy link
Member

The current defaults are 20 seconds which matches the defaults imposed
by tornado. Since it doesn't hurt to increase these and since many
kernel starts take just around 20-30 seconds in Yarn clusters, etc.,
it is better to have these values be longer - so this PR increases them
to 60 seconds. This also allows server-side kernel startup issues to
occur that might have otherwise happened after the previous default or
the kernel startup succeed only to have the request timeout.

The current defaults are 20 seconds which matches the defaults imposed
by tornado.  Since it doesn't hurt to increase these and since many
kernel starts take just around 20-30 seconds in Yarn clusters, etc.,
it is better to have these values be longer - so this PR increases them
to 60 seconds.  This also allows server-side kernel startup issues to
occur that might have otherwise happened after the previous default or
the kernel startup succeed only to have the request timeout.
@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/notebook-failed-to-connect-with-enterprice-gateway-kernel-using-websocket-exception-in-kernelgatewaywsclient-connection-done/1025/2

Copy link
Member

@lresende lresende left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lresende lresende merged commit 2d3095a into jupyter:master May 8, 2019
@ArindamHalder7
Copy link

@kevin-bates
This on the context of the above issue [https://discourse.jupyter.org/t/notebook-failed-to-connect-with-enterprice-gateway-kernel-using-websocket-exception-in-kernelgatewaywsclient-connection-done/1025/2]
I am not able post message in the jupyter community. So providing you EG log here. Also i will try with increasing the KG_REQUEST_TIMEOUT.
Here is the log.

cto@cto:~/jupyter_enterprise_gateway$ jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0
[I 2019-05-09 12:21:06.172 EnterpriseGatewayApp] Jupyter Enterprise Gateway 1.2.0 is available at http://0.0.0.0:8888
[W 190509 12:21:22 web:1782] 404 DELETE /api/kernels/9fd2feb9-f9df-48a9-a775-f6c82044c1a6 (10.100.11.9): Kernel does not exist: 9fd2feb9-f9df-48a9-a775-f6c82044c1a6
[W 190509 12:21:22 web:2246] 404 DELETE /api/kernels/9fd2feb9-f9df-48a9-a775-f6c82044c1a6 (10.100.11.9) 1.88ms
[I 190509 12:25:53 web:2246] 200 GET /api/kernelspecs (10.100.11.9) 860.85ms
[I 190509 12:26:23 web:2246] 200 GET /api/kernelspecs (10.100.11.9) 3.40ms
[I 2019-05-09 12:26:25.054 EnterpriseGatewayApp] Local kernel launched on '10.100.10.55', pid: 31136, pgid: 31136, KernelID: b59bcef8-774f-4e4b-a93d-01420d089b02, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', '--RemoteProcessProxy.kernel-id', '{kernel_id}', '--RemoteProcessProxy.response-address', '{response_address}', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'

Starting IPython kernel for Spark in Yarn Client mode on behalf of user cto

  • eval exec '' /home/cto/spark-2.3.1-bin-hadoop2.7/bin/spark-submit '--master spark://10.100.11.9:7077 --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false' /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py '' --RemoteProcessProxy.kernel-id '{kernel_id}' --RemoteProcessProxy.response-address '{response_address}' --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
    ++ exec /home/cto/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --master spark://10.100.11.9:7077 --name b59bcef8-774f-4e4b-a93d-01420d089b02 --conf spark.yarn.submit.waitAppCompletion=false /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py --RemoteProcessProxy.kernel-id '{kernel_id}' --RemoteProcessProxy.response-address '{response_address}' --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
    [I 2019-05-09 12:26:25.058 EnterpriseGatewayApp] Kernel started: b59bcef8-774f-4e4b-a93d-01420d089b02
    [I 190509 12:26:25 web:2246] 201 POST /api/kernels (10.100.11.9) 25.03ms
    [W 190509 12:26:25 web:2246] 404 GET /kernelspecs/spark_python_yarn_client/logo-64x64.png (10.100.11.9) 1.56ms
    [I 190509 12:26:25 web:2246] 200 GET /api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02 (10.100.11.9) 0.68ms
    [W 2019-05-09 12:26:25.805 EnterpriseGatewayApp] No session ID specified
    2019-05-09 12:26:26 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    [D 2019-05-09 12:26:27,186.186 launch_ipykernel] Using connection file '/tmp/kernel-{kernel_id}_jujb491y.json'.
    [E 2019-05-09 12:26:27,187.187 launch_ipykernel] Invalid format for response address '{response_address}'. Assuming 'pull' mode...
    2019-05-09 12:26:27 INFO SparkContext:54 - Running Spark version 2.3.1
    2019-05-09 12:26:27 INFO SparkContext:54 - Submitted application: b59bcef8-774f-4e4b-a93d-01420d089b02
    2019-05-09 12:26:27 INFO SecurityManager:54 - Changing view acls to: cto
    2019-05-09 12:26:27 INFO SecurityManager:54 - Changing modify acls to: cto
    2019-05-09 12:26:27 INFO SecurityManager:54 - Changing view acls groups to:
    2019-05-09 12:26:27 INFO SecurityManager:54 - Changing modify acls groups to:
    2019-05-09 12:26:27 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cto); groups with view permissions: Set(); users with modify permissions: Set(cto); groups with modify permissions: Set()
    2019-05-09 12:26:27 INFO Utils:54 - Successfully started service 'sparkDriver' on port 35453.
    2019-05-09 12:26:27 INFO SparkEnv:54 - Registering MapOutputTracker
    2019-05-09 12:26:27 INFO SparkEnv:54 - Registering BlockManagerMaster
    2019-05-09 12:26:27 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    2019-05-09 12:26:27 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
    2019-05-09 12:26:27 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-570e9b16-f17c-43ed-933f-1826bb3dc81e
    2019-05-09 12:26:27 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
    2019-05-09 12:26:27 INFO SparkEnv:54 - Registering OutputCommitCoordinator
    2019-05-09 12:26:27 INFO log:192 - Logging initialized @2619ms
    2019-05-09 12:26:27 INFO Server:346 - jetty-9.3.z-SNAPSHOT
    2019-05-09 12:26:27 INFO Server:414 - Started @2682ms
    2019-05-09 12:26:27 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    2019-05-09 12:26:27 INFO AbstractConnector:278 - Started ServerConnector@d50c159{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
    2019-05-09 12:26:27 INFO Utils:54 - Successfully started service 'SparkUI' on port 4041.
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@610d26df{/jobs,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29e480e1{/jobs/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b58372{/jobs/job,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7d5d58f0{/jobs/job/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@70b32c32{/stages,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5f328f45{/stages/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2c140304{/stages/stage,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921c8f5{/stages/stage/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15a1e4f6{/stages/pool,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@64ec7ed8{/stages/pool/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27e78fa3{/storage,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@30968dca{/storage/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a8a597a{/storage/rdd,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2d97e84d{/storage/rdd/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3616f0cb{/environment,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@326ac31e{/environment/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ad30da0{/executors,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58144137{/executors/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@59856256{/executors/threadDump,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3d93384b{/executors/threadDump/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f7f9c54{/static,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b1ffd2b{/,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@497bd345{/api,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@257f10c9{/jobs/job/kill,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6d6f1501{/stages/stage/kill,null,AVAILABLE,@spark}
    2019-05-09 12:26:27 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://cto:4041
    2019-05-09 12:26:28 INFO SparkContext:54 - Added file file:/usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py at spark://cto:35453/files/launch_ipykernel.py with timestamp 1557384988220
    2019-05-09 12:26:28 INFO Utils:54 - Copying /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py to /tmp/spark-3bdad42c-8410-4139-95da-1212686e0b6e/userFiles-610eaa83-a6c2-4dbd-bc2a-7651a0fb0c5e/launch_ipykernel.py
    2019-05-09 12:26:28 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://10.100.11.9:7077...
    2019-05-09 12:26:28 INFO TransportClientFactory:267 - Successfully created connection to /10.100.11.9:7077 after 30 ms (0 ms spent in bootstraps)
    2019-05-09 12:26:29 INFO StandaloneSchedulerBackend:54 - Connected to Spark cluster with app ID app-20190509122628-0000
    2019-05-09 12:26:29 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34071.
    2019-05-09 12:26:29 INFO NettyBlockTransferService:54 - Server created on cto:34071
    2019-05-09 12:26:29 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    2019-05-09 12:26:29 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, cto, 34071, None)
    2019-05-09 12:26:29 INFO BlockManagerMasterEndpoint:54 - Registering block manager cto:34071 with 366.3 MB RAM, BlockManagerId(driver, cto, 34071, None)
    2019-05-09 12:26:29 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, cto, 34071, None)
    2019-05-09 12:26:29 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, cto, 34071, None)
    2019-05-09 12:26:29 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b373c2d{/metrics/json,null,AVAILABLE,@spark}
    2019-05-09 12:26:29 INFO StandaloneSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
    2019-05-09 12:26:29 INFO StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190509122628-0000/0 on worker-20190509121919-10.100.11.9-37547 (10.100.11.9:37547) with 1 core(s)
    2019-05-09 12:26:29 INFO StandaloneSchedulerBackend:54 - Granted executor ID app-20190509122628-0000/0 on hostPort 10.100.11.9:37547 with 1 core(s), 1024.0 MB RAM
    2019-05-09 12:26:30 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190509122628-0000/0 is now RUNNING
    2019-05-09 12:26:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.100.11.9:38660) with ID 0
    2019-05-09 12:26:55 INFO BlockManagerMasterEndpoint:54 - Registering block manager 10.100.11.9:45239 with 366.1 MB RAM, BlockManagerId(0, 10.100.11.9, 45239, None)
    [W 2019-05-09 12:27:25.807 EnterpriseGatewayApp] Timeout waiting for kernel_info reply from b59bcef8-774f-4e4b-a93d-01420d089b02
    [I 190509 12:27:25 web:2246] 101 GET /api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02/channels (10.100.11.9) 60005.94ms
    [E 190509 12:27:25 web:1788] Uncaught exception GET /api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02/channels (10.100.11.9)
    HTTPServerRequest(protocol='http', host='cto:8888', method='GET', uri='/api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02/channels', version='HTTP/1.1', remote_ip='10.100.11.9')
    Traceback (most recent call last):
    File "/opt/conda/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
    result = await result
    File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info) # type: ignore
    File "/opt/conda/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 241, in get
    yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
    File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
    File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info) # type: ignore
    File "/opt/conda/lib/python3.6/site-packages/notebook/base/zmqhandlers.py", line 297, in get
    yield gen.maybe_future(res)
    File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
    File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 278, in get
    await self.ws_connection.accept_connection(self)
    File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 881, in accept_connection
    await self._accept_connection(handler)
    File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 964, in _accept_connection
    await self._receive_frame_loop()
    File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 1118, in _receive_frame_loop
    await self._receive_frame()
    File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 1130, in _receive_frame
    data = await self._read_bytes(2)
    File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 1124, in _read_bytes
    data = await self.stream.read_bytes(n)
    File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 436, in read_bytes
    future = self._start_read()
    File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 797, in _start_read
    assert self._read_future is None, "Already reading"
    AssertionError: Already reading

@kevin-bates
Copy link
Member Author

Moved this latest question to a new issue in Enterprise Gateway (jupyter-server/enterprise_gateway#660) where we can work on thins there.

@kevin-bates kevin-bates deleted the increase-timeout-defaults branch June 19, 2019 20:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants