Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook failed to connect with enterprice gateway kernel using websocket. Exception in KernelGatewayWSClient._connection_done #660

Closed
kevin-bates opened this issue May 9, 2019 · 3 comments

Comments

@kevin-bates
Copy link
Member

Creating this issue on behalf of @ArindamHalder7

@kevin-bates
This on the context of the above issue [https://discourse.jupyter.org/t/notebook-failed-to-connect-with-enterprice-gateway-kernel-using-websocket-exception-in-kernelgatewaywsclient-connection-done/1025/2]
I am not able post message in the jupyter community. So providing you EG log here. Also i will try with increasing the KG_REQUEST_TIMEOUT.
Here is the log.

cto@cto:~/jupyter_enterprise_gateway$ jupyter enterprisegateway --ip=0.0.0.0 --port_retries=0

[I 2019-05-09 12:21:06.172 EnterpriseGatewayApp] Jupyter Enterprise Gateway 1.2.0 is available at http://0.0.0.0:8888
[W 190509 12:21:22 web:1782] 404 DELETE /api/kernels/9fd2feb9-f9df-48a9-a775-f6c82044c1a6 (10.100.11.9): Kernel does not exist: 9fd2feb9-f9df-48a9-a775-f6c82044c1a6
[W 190509 12:21:22 web:2246] 404 DELETE /api/kernels/9fd2feb9-f9df-48a9-a775-f6c82044c1a6 (10.100.11.9) 1.88ms
[I 190509 12:25:53 web:2246] 200 GET /api/kernelspecs (10.100.11.9) 860.85ms
[I 190509 12:26:23 web:2246] 200 GET /api/kernelspecs (10.100.11.9) 3.40ms
[I 2019-05-09 12:26:25.054 EnterpriseGatewayApp] Local kernel launched on '10.100.10.55', pid: 31136, pgid: 31136, KernelID: b59bcef8-774f-4e4b-a93d-01420d089b02, cmd: '['/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh', '--RemoteProcessProxy.kernel-id', '{kernel_id}', '--RemoteProcessProxy.response-address', '{response_address}', '--RemoteProcessProxy.port-range', '0..0', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'

Starting IPython kernel for Spark in Yarn Client mode on behalf of user cto

eval exec '' /home/cto/spark-2.3.1-bin-hadoop2.7/bin/spark-submit '--master spark://10.100.11.9:7077 --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false' /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py '' --RemoteProcessProxy.kernel-id '{kernel_id}' --RemoteProcessProxy.response-address '{response_address}' --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
++ exec /home/cto/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --master spark://10.100.11.9:7077 --name b59bcef8-774f-4e4b-a93d-01420d089b02 --conf spark.yarn.submit.waitAppCompletion=false /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py --RemoteProcessProxy.kernel-id '{kernel_id}' --RemoteProcessProxy.response-address '{response_address}' --RemoteProcessProxy.port-range 0..0 --RemoteProcessProxy.spark-context-initialization-mode lazy
[I 2019-05-09 12:26:25.058 EnterpriseGatewayApp] Kernel started: b59bcef8-774f-4e4b-a93d-01420d089b02
[I 190509 12:26:25 web:2246] 201 POST /api/kernels (10.100.11.9) 25.03ms
[W 190509 12:26:25 web:2246] 404 GET /kernelspecs/spark_python_yarn_client/logo-64x64.png (10.100.11.9) 1.56ms
[I 190509 12:26:25 web:2246] 200 GET /api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02 (10.100.11.9) 0.68ms
[W 2019-05-09 12:26:25.805 EnterpriseGatewayApp] No session ID specified
2019-05-09 12:26:26 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[D 2019-05-09 12:26:27,186.186 launch_ipykernel] Using connection file '/tmp/kernel-{kernel_id}_jujb491y.json'.
[E 2019-05-09 12:26:27,187.187 launch_ipykernel] Invalid format for response address '{response_address}'. Assuming 'pull' mode...
2019-05-09 12:26:27 INFO SparkContext:54 - Running Spark version 2.3.1
2019-05-09 12:26:27 INFO SparkContext:54 - Submitted application: b59bcef8-774f-4e4b-a93d-01420d089b02
2019-05-09 12:26:27 INFO SecurityManager:54 - Changing view acls to: cto
2019-05-09 12:26:27 INFO SecurityManager:54 - Changing modify acls to: cto
2019-05-09 12:26:27 INFO SecurityManager:54 - Changing view acls groups to:
2019-05-09 12:26:27 INFO SecurityManager:54 - Changing modify acls groups to:
2019-05-09 12:26:27 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cto); groups with view permissions: Set(); users with modify permissions: Set(cto); groups with modify permissions: Set()
2019-05-09 12:26:27 INFO Utils:54 - Successfully started service 'sparkDriver' on port 35453.
2019-05-09 12:26:27 INFO SparkEnv:54 - Registering MapOutputTracker
2019-05-09 12:26:27 INFO SparkEnv:54 - Registering BlockManagerMaster
2019-05-09 12:26:27 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-05-09 12:26:27 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-05-09 12:26:27 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-570e9b16-f17c-43ed-933f-1826bb3dc81e
2019-05-09 12:26:27 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2019-05-09 12:26:27 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2019-05-09 12:26:27 INFO log:192 - Logging initialized @2619ms
2019-05-09 12:26:27 INFO Server:346 - jetty-9.3.z-SNAPSHOT
2019-05-09 12:26:27 INFO Server:414 - Started @2682ms
2019-05-09 12:26:27 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2019-05-09 12:26:27 INFO AbstractConnector:278 - Started ServerConnector@d50c159{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2019-05-09 12:26:27 INFO Utils:54 - Successfully started service 'SparkUI' on port 4041.
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@610d26df{/jobs,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29e480e1{/jobs/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b58372{/jobs/job,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7d5d58f0{/jobs/job/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@70b32c32{/stages,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5f328f45{/stages/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2c140304{/stages/stage,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921c8f5{/stages/stage/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15a1e4f6{/stages/pool,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@64ec7ed8{/stages/pool/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27e78fa3{/storage,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@30968dca{/storage/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a8a597a{/storage/rdd,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2d97e84d{/storage/rdd/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3616f0cb{/environment,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@326ac31e{/environment/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ad30da0{/executors,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58144137{/executors/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@59856256{/executors/threadDump,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3d93384b{/executors/threadDump/json,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6f7f9c54{/static,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b1ffd2b{/,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@497bd345{/api,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@257f10c9{/jobs/job/kill,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6d6f1501{/stages/stage/kill,null,AVAILABLE,@spark}
2019-05-09 12:26:27 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://cto:4041
2019-05-09 12:26:28 INFO SparkContext:54 - Added file file:/usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py at spark://cto:35453/files/launch_ipykernel.py with timestamp 1557384988220
2019-05-09 12:26:28 INFO Utils:54 - Copying /usr/local/share/jupyter/kernels/spark_python_yarn_client/scripts/launch_ipykernel.py to /tmp/spark-3bdad42c-8410-4139-95da-1212686e0b6e/userFiles-610eaa83-a6c2-4dbd-bc2a-7651a0fb0c5e/launch_ipykernel.py
2019-05-09 12:26:28 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://10.100.11.9:7077...
2019-05-09 12:26:28 INFO TransportClientFactory:267 - Successfully created connection to /10.100.11.9:7077 after 30 ms (0 ms spent in bootstraps)
2019-05-09 12:26:29 INFO StandaloneSchedulerBackend:54 - Connected to Spark cluster with app ID app-20190509122628-0000
2019-05-09 12:26:29 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34071.
2019-05-09 12:26:29 INFO NettyBlockTransferService:54 - Server created on cto:34071
2019-05-09 12:26:29 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-05-09 12:26:29 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, cto, 34071, None)
2019-05-09 12:26:29 INFO BlockManagerMasterEndpoint:54 - Registering block manager cto:34071 with 366.3 MB RAM, BlockManagerId(driver, cto, 34071, None)
2019-05-09 12:26:29 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, cto, 34071, None)
2019-05-09 12:26:29 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, cto, 34071, None)
2019-05-09 12:26:29 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b373c2d{/metrics/json,null,AVAILABLE,@spark}
2019-05-09 12:26:29 INFO StandaloneSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2019-05-09 12:26:29 INFO StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190509122628-0000/0 on worker-20190509121919-10.100.11.9-37547 (10.100.11.9:37547) with 1 core(s)
2019-05-09 12:26:29 INFO StandaloneSchedulerBackend:54 - Granted executor ID app-20190509122628-0000/0 on hostPort 10.100.11.9:37547 with 1 core(s), 1024.0 MB RAM
2019-05-09 12:26:30 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190509122628-0000/0 is now RUNNING
2019-05-09 12:26:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.100.11.9:38660) with ID 0
2019-05-09 12:26:55 INFO BlockManagerMasterEndpoint:54 - Registering block manager 10.100.11.9:45239 with 366.1 MB RAM, BlockManagerId(0, 10.100.11.9, 45239, None)
[W 2019-05-09 12:27:25.807 EnterpriseGatewayApp] Timeout waiting for kernel_info reply from b59bcef8-774f-4e4b-a93d-01420d089b02
[I 190509 12:27:25 web:2246] 101 GET /api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02/channels (10.100.11.9) 60005.94ms
[E 190509 12:27:25 web:1788] Uncaught exception GET /api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02/channels (10.100.11.9)
HTTPServerRequest(protocol='http', host='cto:8888', method='GET', uri='/api/kernels/b59bcef8-774f-4e4b-a93d-01420d089b02/channels', version='HTTP/1.1', remote_ip='10.100.11.9')
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tornado/web.py", line 1699, in _execute
result = await result
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/opt/conda/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 241, in get
yield super(ZMQChannelsHandler, self).get(kernel_id=kernel_id)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/opt/conda/lib/python3.6/site-packages/notebook/base/zmqhandlers.py", line 297, in get
yield gen.maybe_future(res)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 278, in get
await self.ws_connection.accept_connection(self)
File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 881, in accept_connection
await self._accept_connection(handler)
File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 964, in _accept_connection
await self._receive_frame_loop()
File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 1118, in _receive_frame_loop
await self._receive_frame()
File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 1130, in _receive_frame
data = await self._read_bytes(2)
File "/opt/conda/lib/python3.6/site-packages/tornado/websocket.py", line 1124, in _read_bytes
data = await self.stream.read_bytes(n)
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 436, in read_bytes
future = self._start_read()
File "/opt/conda/lib/python3.6/site-packages/tornado/iostream.py", line 797, in _start_read
assert self._read_future is None, "Already reading"
AssertionError: Already reading
@kevin-bates
Copy link
Member Author

Hi @ArindamHalder7

First off, I see you're using 2.x kernelspec files with 1.2 EG. The kernelspec tar files must be within the same major release number (at a minimum). I was going to suggest you download the kernelspecs tar from the 1.2.0 release page, but noticed the kernelspecs tar file isn't present for that release! Sorry about that. Looks like the original upload must have failed. I have built and posted it to the 1.2.0 release page.

Second, I see you're using Spark standalone. It looks like you've made the necessary adjustments to the kernelspecs you are using. However, because those are using parameters that 1.2 EG doesn't know about, you'll need to make the same adjustments to the newly pulled kernelspecs.

Finally, I would strongly recommend starting EG with --debug and redirect to a log file until you get your kernels working. We provide a fairly rich set of debug logging that is useful to both you and anyone attempting to troubleshoot issues. Since EG is more service-like, we typically start things from script. Here's an ansible template that provides the basis for such a script. You'll need to remove the jinja-style attributes, but it gives a good idea of what the script should contain.

Do you intend to spread your kernels across your Spark cluster? If so, you'll need to configure password-less ssh across the nodes and set either the EG_REMOTE_HOSTS environment variable or --EnterpriseGatewayApp.remote_hosts command-line option. EG will then perform a simple round-robin across that host set.

We recommend moving to Yarn and using cluster mode at some point. Then the YARN resource manager will better decide where the kernel should run within the cluster. In addition, no password-less ssh or remote-hosts are required - so its much easier to get configured.

Since you're using standalone mode, it would be great if you could contribute the necessary documentation changes for that as people have asked about that previously.

Thank you for your interest in EG!

@ArindamHalder7
Copy link

Hi @kevin-bates

Thanks for your suggestion and sharing the EG 1.2.
I already made necessary changes e.g password less ssh etc. I am using Spark standalone for initial POC but i will move to Yarn. EG is working now :). Debug mode helps me lot.

I found some changes in the kernel spec as well.
Yes , I have plan to spread kernels on Spark with kerberos authentication. So i will be using Yarn cluster on later part of my experiments.
I have following plan now.

  1. Deploy with EG version 2.0. I saw documentation link has been updated. I will download the same.
  2. Currently running with single kernel, i will create cluster of kernels.
  3. Deploy Hadoop and Spark with Yarn including kerberos authentication.
  4. Connect Yarn cluster from EG 2.0.
  5. Scale Notebook UI with docker swarm and connect EG 2.0. Then Kernel will connect Yarn and Hadoop.

Thanks for your help.

@kevin-bates
Copy link
Member Author

Excellent! This sounds like an awesome approach based on your configuration.

I'm going to go ahead and close this issue.

Keep in touch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants