Distributed training & Suggestive resource allocation

This project is aimed to study the impact of distributed trained of Deep Learning models to understand if a predictive model can be designed to predict the epoch speed and time to accuracy. We selected image classification as the application and CNN models to conduct our experiments.

We collected the training logs for around 75 configurations in which we varied model type, batch size, GPU type, number of GPUs, number of data loaders. Once the predictive model (also referred to as the recommender model) is trained, if the test error is low, we aim to make this model available to the end-user by hosting it over a Kubernetes cluster as a web application.

Finally, this can become a prescriptive solution that can suggest a configuration to the user involving the least training cost before they consider investing in hardware.

Trainers

Single GPU Trainer

trainer_pytorch.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
                        [--configuration-file CONFIGURATION_FILE] [-d DATA]
                        [--dataset DATASET] [-e EPOCHS] 
                        [-lr LEARNING_RATE] [-m MODEL_NAME]                          
                        [-w NUM_WORKERS] [-s SAVE_LOCATION]

Distributed Trainer (Multi-GPU)

distributed_trainer.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
                            [--configuration-file CONFIGURATION_FILE]
                            [-d DATA] [--dataset DATASET]
                            [--distribute-data] [-da DISTRIBUTED_ADDRESS]
                            [-dp DISTRIBUTED_PORT]
                            [--distributed-backend DISTRIBUTED_BACKEND]
                            [-e EPOCHS] [-g GLOO_FILE] [-lr LEARNING_RATE]
                            [-m MODEL_NAME] [--num-nodes NUM_NODES]
                            [--num-gpus NUM_GPUS] [-w NUM_WORKERS]
                            [-s SAVE_LOCATION]

Evaluations

The following graph shows the epoch timings for various configurations. In this experiment, each GPU trained on the entire dataset, leading to an increase in the epoch time with a larger decrease in the number of epochs required to reach a certain accuracy.

Recommender model

Time per epoch (in seconds)

MAE	RMSE
1.84	4.60

Accuracy for an epoch

MAE	RMSE
0.047	0.10

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
Cluster		Cluster
Data Collection		Data Collection
Optimus Prime		Optimus Prime
Resources		Resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed training & Suggestive resource allocation

Trainers

Single GPU Trainer

Distributed Trainer (Multi-GPU)

Evaluations

Recommender model

Time per epoch (in seconds)

Accuracy for an epoch

Frameworks & Libraries

Environments

About

Releases

Packages

Contributors 2

Languages

License

sukumarh/distributed-training

Folders and files

Latest commit

History

Repository files navigation

Distributed training & Suggestive resource allocation

Trainers

Single GPU Trainer

Distributed Trainer (Multi-GPU)

Evaluations

Recommender model

Time per epoch (in seconds)

Accuracy for an epoch

Frameworks & Libraries

Environments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages