Skip to content

Everything about how to deploy projects on the cloud, run ML workloads on the HPC cluster and on the cloud and the efficient configuration and management of related collaborative platforms [e.g. container orchestration].

Notifications You must be signed in to change notification settings

ashwinpn/Containers--ML-and-Cloud-Computing

Repository files navigation

Containers, ML and Cloud Computing

Ashwin Nalwade.

  • Containers have become very important to deep learning as it is critical to leverage them to ensure scalability, now that computer scientists are capable of developing ML applications which can run seamlessly on smartphones too.

cont

Technologies

  • Google Cloud Platform [GCP], Amazon Web Services [AWS], IBM Cloud.
  • Docker, Kubernetes, Vagrant, Slurm [For High Perfromance Cluster Computing - Workload Management].
  • Python, Gunicorn, Flask, PyTorch, Pandas, Tensorflow, Jupyter Notebook, CUDA.
  • CSS, HTML, JavaScript, Bash.

Docker

Troubleshooting

  • Docker Daemon Permission Denied :
    sudo usermod -aG docker ${USER}
    su -${USER}
  • Cannot connect to the docker daemon :
    systemctl start docker
  • Error processing Tar / No space left on device:
    docker rmi -f $(docker images | grep "<none>" | awk "{print \$3}")
    OR
    docker system prune, docker image prune
  • Conflict => already in use:
    Find CONTAINER_ID using docker ps -a
    
    Use docker start <CONTAINER_ID> instead of docker run
  • I initially used alpine for my dockerfile, but ran into a lot of issues when I started building the docker image. Specifically, I faced networking issues, memory errors and had to spend a lot of time with the pip install commands because they were not running smoothly. Thus, I decided against using alpine as I felt that the issues that came up due to its use defeated the purpose of using containers in the first place - which is to make our job easier while building environments and running code. Also, I felt that the time spent debugging and dealing with errors does not compensate for the small docker image size which alpine provides.

Cases

pytorch_mnist

  • Run
docker build -t asn/pyt .

docker run -i -t asn/pyt:latest
  • Check container status using
docker ps -a

Vagrant

  • Starting up a vagrant environment via a virtualbox. Run
vagrant up

vagrant ssh

[If it says "Machine already provisioned"]

vagrant provision
  • Configuring the vagrantfile [vagrantfiles are written in Ruby]. Here, we provision docker - which means that we can start working with docker as soon as we start up using vagrant up and vagrant ssh command.
Vagrant.configure("2") do |config|

  config.vm.box = "ubuntu/bionic64"

  config.vagrant.plugins = "vagrant-docker-compose"

  # install docker and docker-compose
  config.vm.provision :docker
  config.vm.provision :docker_compose

  config.vm.provider "virtualbox" do |vb|
    vb.customize ["modifyvm", :id, "--ioapic", "on"]
    vb.customize ["modifyvm", :id, "--memory", "8000"]
    vb.customize ["modifyvm", :id, "--cpus", "2"]
  end

end
  • Provisioning in vagrant using the shell.

Use the -y option with apt-get to enable non-interactive installation. That is, no more "X amount of data would be downloaded on installing Y. Do you still want to continue? [Y|N]"

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|

  config.vm.box = "ubuntu/bionic64"

  
   config.vm.provider "virtualbox" do |vb|
     # Display the VirtualBox GUI when booting the machine
     vb.gui = false

     # Customize the amount of memory on the VM:
     vb.memory = "2048"
   end
 
   config.vm.provision "shell", inline: <<-SHELL
     sudo apt-get -y upgrade apt-get update
     sudo apt-get -y install python3.8 python3.8-dev python3.8-distutils python3.8-venv
     sudo apt-get -y install git-all python-dev 
     curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
     python3.8 get-pip.py
     pip install future nose mock coverage numpy flake8  
     pip --no-cache-dir install torch torchvision
   SHELL
end

Troubleshooting

Kubernetes

kubectl create -f dply.yaml

kubectl get pods

kubectl get deployments

kubectl expose deployment sentiment-inference-service --type=LoadBalancer --port 80 --target-port 8080

For the training part I used a pod object, but I also experimented using a job object since in real-life production environments, we deal with batch jobs too. For inference, I used a deployment object because we can have ReplicaSets and can thus ensure that the service is reliable [multiple replicas] and can also be scaled [That is, our deployments can stay up, remain healthy, and keep running automatically without the requirement of some manual intervention].

Other advantages are:

  • We can rollout and rollback our services during deployment.
  • We can observe the status of each pod.
  • There is a facility for performing updates in a rolling manner.
  • Unlike a deployment, pods are not rescheduled in the case of termination of the pod or node failure.

We create a PersistentVolumeClaim and Kubernetes automatically provisions a persistent disk for us. We request a 2GiB disk and configure the access mode so that it can be mounted in a read-write-once manner by one node. When we create a PersistentVolumeClaim object, Kubernetes dynamically creates a corresponding PersistentVolume object.

This PersistentVolume is backed by an empty and new Compute Engine persistent disk. We thus use this disk in a pod by using the claim as a volume. Since we use PersistentVolumes, we can transfer data between pods [since they have access to the same storage space - depending on the access mode we have selected, either a single pod or multiple pods can have ReadWrite access at the same time], and along with git, docker, and Docker Hub, we were able to use the trained model file in the inference service

Running on GCP with Google Kubernetes Engine

k8s

Observations and Experimentations

  • I worked with kubernetes on my local machine (via minikube),

on the terminal present at the kubernetes website (Katacoda), and also on Google Kubernetes Engine. While working with Kubernetes on my local machine and also on the kubernetes website terminal (Katacoda), there were some issues with the LoadBalancer type. It was always showing the EXTERNAL-IP as <pending>

and the service cannot process any test image without the EXTERNAL-IP, so the progress had stalled. I tried the following to try and get the IP:

But, it did not work. Another method that I tried was to add an externalIPs specification in the .yaml file, but that did not work either. After trying some more commands and going through the kubernetes website, I learned that something related was mentioned here :

https://kubernetes.io/docs/tutorials/stateless-application/expose-external-ip-address/

And that’s why I decided to work on Google Kubernetes Engine - did not face any problems there (If the external IP address is still shown as <pending>, wait for a minute and enter the same command again, the issue is resolved).

  • Different types of services - ClusterIP, LoadBalancer, ExternalName, NodePort.

    1] ClusterIP is the default kubernetes service, and applications within the cluster can access it, but external applications cannot.

    2] With NodePort, we can direct any access requests to some particular port that has been opened on all nodes, but we can only have one service corresponding to one port.

    3] With LoadBalancer, we can directly expose the service. This is the one which I used.

    4] Also read about ingress, using which we can manage external access requests without the need for creating LoadBalancer / exposing all services present on a node.

  • Became familiar with secrets which are useful for dealing with containers (pulling containers) which are present in private Docker Hub repositories. Using secrets, we can also manage secure communication channels, keep and handle confidential data like ssh keys, passwords, and access tokens for OAuth.

Troubleshooting / Discussion

  • The difference betwen kubectl create and kubectl apply

kubectl create is an Imperative Management approach. Here we tell the Kubernetes API what we want to create/replace/delete, not how we want our K8s cluster to look like. On the other hand, kubectl apply is an Declarative Management approach, where changes that you may have applied to a live object (i.e. through scale) are "maintained" even if other changes are applied to the object. Also, When running in a CI script, we can potentially face trouble with the imperative approach as create will raise an error if the resource already exists. To put it simply,

apply makes incremental changes to an existing object.

create creates a whole new object (previously non-existing / deleted).

  • If pods are stuck in terminiating status

Delete pods forcefully using:

kubectl delete pod --grace-period=0 --force --namespace <NAMESPACE> <PODNAME>

Vagrant v/s Docker Comparison

Setting up the project to work in vagrant and docker for the first time took a while for both docker and vagrant, but for the subsequent times, starting up and running our project in docker was much more faster than vagrant - however, setting up the project workflow was better in vagrant.

Both vagrant and docker have dedicated communities if you require any help for troubleshooting any issues, but the docker documentation and forums are more user friendly and docker in general has many resources, tutorials, and guides available. For instance, one major disadvantage with vagrant is that there is a lesser scope for collaboration [even with Vagrant Cloud] - when working with docker, we can easily push our docker images to our Docker Hub repo, and your teammates/others can pull the container and run the application. Docker also has support for functions similar to git, where we can track different versions of a container, see the changes (diff) between the versions, and provides support for updating via commits.

While provisioning can potentially get messy in both vagrant and docker [e.g. pip install instructions - need to check conflicting dependencies and updates to eliminate probable errors], vagrant based provisioning can be a big headache because I personally had to spend about half an hour for debugging the provisioner shell in the vagrantfile, making sure that all packages had the suitable versions to enable working with each other.

Regarding vagrant, while a completely virtualized system provides more isolation, it also comes at the cost of more resources being allocated to it [it is heavier], and has minimum sharing capability. With docker we have less isolation [it has sufficient support to isolate processes from each other though], but the containers are lightweight (require fewer resources, and they share the same kernel).

We cannot run completely different operating systems in containers like we do in virtual machines, however it is possible to run different distros [distributions] of Linux since they do share the same kernel. Talking about booting times, when we load a container we don’t have to start a new copy of the operating system like in a virtual machine, so booting times are smaller. Unlike virtual machines, we also do not have to pre-allocate a considerable amount of memory and hardware resources when working with containers.

Summary

  • Docker containers are faster to start (and also to stop). One of the reasons behind this is probably the fact that docker leverages the existing host OS [which already has major system processes initialized], whereas with vagrant we have to load a whole vm image [and also initialize major system processes].

  • With regards to isolation, vagrant fares better. But with docker we can still get isolation at the user level since a docker container runs as an isolated process. Also, we can collaborate better by using Docker Hub.

  • Vagrant in general is heavy, as compared to docker which is lightweight because it includes only those libraries which are extremely necessary to the application as a part of the container image.

  • Docker will require fewer amount of resources than vagrant, since we only need to load the libraries which are necessary/essential for the application. Therefore, for a given value for computing capacity, we can have more applications which are running. On the other hand, vagrant has to load a whole OS in the memory, and thus it will always lead to more consumption of resources.

Case Study

We derive insights on performance by using the profile, cProfile, and pstats modules for profiling purposes.

  • With docker it took seconds to start up, whereas for vagrant it took minutes.

  • The total execution time on docker was 24m 38s 19ms, compared to the total execution time on vagrant, which was 27m 08s 34ms, and thus on docker it was smaller.

  • Average training time per epoch on docker was 2m 27s 78ms, while the average training time per epoch on vagrant was 2m 42s 79ms. Best of 3 average training time on docker was 2m 25s 41ms, and on vagrant it was 2m 41s 07ms.

  • Average memory usage per epoch on docker was peak memory: 450.03 MiB, whereas for vagrant it was peak memory: 448.63 MiB. The difference here was negligible.

  • Regarding line-by-line analysis for the application code: for docker we had 4616095 function calls (4612290 primitive calls) in 11.988 seconds, and for vagrant we had 4616095 function calls (4612290 primitive calls) in 16.632 seconds.

Machine Learning

Running GPU profilers on the cloud [Google Colab Pro]

  1. Change runtime type selecting GPU as hardware accelerator
  2. Git clone this repository:
!git clone https://github.com/ashwinpn/imgnet.git
  1. Change permissions:
!chmod 755 imgnet/INSTALL.sh
  1. Install cuda, nvcc, gcc, and g++:
!./imgnet/INSTALL.sh
  1. Add /usr/local/cuda/bin to PATH:
import os
os.environ['PATH'] += ':/usr/local/cuda/bin'
  1. Run [Example, imagenet]
!nvprof python main.py -a alexnet -b 8 --epochs 1 --lr 0.01 <Training and Validation folders parent directory>

Accessing GPU's on the NYU Cluster

  • A GPU request [Unless you have already obtained access to specifc ones]
[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash

gpu

Activating environments, installing requirements, training on NYU Prince Cluster.

nyup

  • First, login to the bastion host, and then ssh into the cluster.
ssh NYUNetID@gw.hpc.nyu.edu
ssh NYUNetID@prince.hpc.nyu.edu
  • You can get acces to three filesystems: /home, /scratch, and /archive.

Scratch is a file system mounted on Prince that is connected to the compute nodes where we can upload files faster. Note that the content gets periodically flushed. /home and /scratch are separate filesystems in separate places, but you should use /scratch to store your files.

About

Everything about how to deploy projects on the cloud, run ML workloads on the HPC cluster and on the cloud and the efficient configuration and management of related collaborative platforms [e.g. container orchestration].

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published