CHTC Docker

WIP PAGE

This page documents how to make a docker image for DeepProfiler to be able to run on all OS and instances.

CHTC resources

https://chtc.cs.wisc.edu/helloworld.shtml https://htcondor.readthedocs.io/en/latest/users-manual/quick-start-guide.html https://www.youtube.com/watch?v=p2X6s_7e51k&list=PLO7gMRGDPNumCuo3pCdRk23GDLNKFVjHn&index=3&ab_channel=CenterforHighThroughputComputing

Add to submit file for access to server:

requirements = (Machine == "coba2000.chtc.wisc.edu")

Connect to server username=XXXXXXXXX ssh ${username}@submit1.chtc.wisc.edu uwmadison.vpn.wisc.edu

Introduction pages: https://hub.docker.com/r/tensorflow/tensorflow/ https://www.tensorflow.org/install/docker https://hub.docker.com/r/tensorflow/tensorflow/tags?page=1&ordering=last_updated&name=1.15

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/gpu.Dockerfile

docker run -it --rm -v ~/dp:/dp michaelbornholdt/deep_profiler:v1

Other helpful sites: https://github.com/CellProfiler/distribution/blob/master/docker/Dockerfile https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dockerfiles/dockerfiles/gpu.Dockerfile https://github.com/NVIDIA/nvidia-docker

Location of files on the CHTC: /local_group_storage/broad_data/michael

AWS

mkdir -p /tmp/.keras/models/
cp /local_group_storage/broad_data/michael/DeepProfiler/efficientnet-b0_weights_tf_dim_ordering_tf_kernels_autoaugment_notop.h5 /tmp/.keras/models/
cd /local_group_storage/broad_data/michael/DeepProfiler
python3 deepprofiler --root=/local_group_storage/broad_data/michael/training --config=config_train.json --metadata=top20_moa.csv --exp=728_train --sample=top20 train 2>&1 | tee log_728_train.txt

tf-docker /local_group_storage/broad_data/michael/DeepProfiler > ls /tmp/.keras/models/ efficientnet-b0_weights_tf_dim_ordering_tf_kernels_autoaugment_notop.h5

Conda problems:

tf-docker /local_group_storage/broad_data/michael > python3 test_tf.py 
WARNING:tensorflow:From test_tf.py:35: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From test_tf.py:45: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From test_tf.py:45: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-07-15 18:37:58.715103: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-07-15 18:37:58.736694: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3200140000 Hz
2021-07-15 18:37:58.743214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50858a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-07-15 18:37:58.743281: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-07-15 18:37:58.748203: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-07-15 18:38:00.263599: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x50f4480 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-07-15 18:38:00.263680: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): A100-SXM4-40GB, Compute Capability 8.0
2021-07-15 18:38:00.263701: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): A100-SXM4-40GB, Compute Capability 8.0
2021-07-15 18:38:00.263720: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): A100-SXM4-40GB, Compute Capability 8.0
2021-07-15 18:38:00.263739: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): A100-SXM4-40GB, Compute Capability 8.0
2021-07-15 18:38:00.282820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:01:00.0
2021-07-15 18:38:00.284834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:41:00.0
2021-07-15 18:38:00.286770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:81:00.0
2021-07-15 18:38:00.288689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41
pciBusID: 0000:c1:00.0
2021-07-15 18:38:00.289076: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-07-15 18:38:00.290952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-07-15 18:38:00.294976: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-07-15 18:38:00.295684: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-07-15 18:38:00.297998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-07-15 18:38:00.299560: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-07-15 18:38:00.307330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-07-15 18:38:00.322826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3
2021-07-15 18:38:00.322977: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-07-15 18:38:00.330808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-15 18:38:00.330868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 2 3 
2021-07-15 18:38:00.330924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N Y Y Y 
2021-07-15 18:38:00.330972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   Y N Y Y 
2021-07-15 18:38:00.331020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2:   Y Y N Y 
2021-07-15 18:38:00.331067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3:   Y Y Y N 
2021-07-15 18:38:00.340519: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38113 MB memory) -> physical GPU (device: 0, name: A100-SXM4-40GB, pci bus id: 0000:01:00.0, compute capability: 8.0)
2021-07-15 18:38:00.344586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 38113 MB memory) -> physical GPU (device: 1, name: A100-SXM4-40GB, pci bus id: 0000:41:00.0, compute capability: 8.0)
2021-07-15 18:38:00.347548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 38113 MB memory) -> physical GPU (device: 2, name: A100-SXM4-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0)
2021-07-15 18:38:00.350341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 38113 MB memory) -> physical GPU (device: 3, name: A100-SXM4-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:1 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:2 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:3 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: A100-SXM4-40GB, pci bus id: 0000:01:00.0, compute capability: 8.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: A100-SXM4-40GB, pci bus id: 0000:41:00.0, compute capability: 8.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: A100-SXM4-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: A100-SXM4-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0
2021-07-15 18:38:00.366591: I tensorflow/core/common_runtime/direct_session.cc:359] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:1 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:2 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:3 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: A100-SXM4-40GB, pci bus id: 0000:01:00.0, compute capability: 8.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: A100-SXM4-40GB, pci bus id: 0000:41:00.0, compute capability: 8.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: A100-SXM4-40GB, pci bus id: 0000:81:00.0, compute capability: 8.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: A100-SXM4-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0

MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369597: I tensorflow/core/common_runtime/placer.cc:54] MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_1: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369634: I tensorflow/core/common_runtime/placer.cc:54] MatMul_1: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_2: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369663: I tensorflow/core/common_runtime/placer.cc:54] MatMul_2: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_3: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369692: I tensorflow/core/common_runtime/placer.cc:54] MatMul_3: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_4: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369720: I tensorflow/core/common_runtime/placer.cc:54] MatMul_4: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_5: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369751: I tensorflow/core/common_runtime/placer.cc:54] MatMul_5: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_6: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369792: I tensorflow/core/common_runtime/placer.cc:54] MatMul_6: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_7: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369821: I tensorflow/core/common_runtime/placer.cc:54] MatMul_7: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_8: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369849: I tensorflow/core/common_runtime/placer.cc:54] MatMul_8: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_9: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369877: I tensorflow/core/common_runtime/placer.cc:54] MatMul_9: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_10: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369906: I tensorflow/core/common_runtime/placer.cc:54] MatMul_10: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_11: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369935: I tensorflow/core/common_runtime/placer.cc:54] MatMul_11: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_12: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369965: I tensorflow/core/common_runtime/placer.cc:54] MatMul_12: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_13: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.369993: I tensorflow/core/common_runtime/placer.cc:54] MatMul_13: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_14: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370021: I tensorflow/core/common_runtime/placer.cc:54] MatMul_14: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_15: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370049: I tensorflow/core/common_runtime/placer.cc:54] MatMul_15: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_16: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370078: I tensorflow/core/common_runtime/placer.cc:54] MatMul_16: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_17: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370106: I tensorflow/core/common_runtime/placer.cc:54] MatMul_17: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_18: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370144: I tensorflow/core/common_runtime/placer.cc:54] MatMul_18: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
MatMul_19: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370176: I tensorflow/core/common_runtime/placer.cc:54] MatMul_19: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
AddN: (AddN): /job:localhost/replica:0/task:0/device:CPU:0
2021-07-15 18:38:00.370208: I tensorflow/core/common_runtime/placer.cc:54] AddN: (AddN): /job:localhost/replica:0/task:0/device:CPU:0
Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370241: I tensorflow/core/common_runtime/placer.cc:54] Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
Placeholder_1: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:00.370269: I tensorflow/core/common_runtime/placer.cc:54] Placeholder_1: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0
2021-07-15 18:38:01.079451: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

Speeds

19:53 hours run. 1 714 988 cells sampled. => 800k / hour 22:06 hours run. 1 952 557 cells sampled. 1+17:48: 3 591 000 cells sampled

Speed of sampling for CHTC is around 100.000 per hour. So for a full set of 8 Mil cells, it will take 80 hours.

Limits of GPU use.

While using 811_index with 1100 sites of testing, I was able to run validation batch size 40 / learning batch size 64

Using learning batch 80 was too big and 64 for validation as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHTC Docker

WIP PAGE

CHTC resources

AWS

Speeds

Limits of GPU use.

Clone this wiki locally