Skip to content

Distributed training using PyTorch DDP & Suggestive resource allocation

License

Notifications You must be signed in to change notification settings

sukumarh/distributed-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed training & Suggestive resource allocation

License

This project is aimed to study the impact of distributed trained of Deep Learning models to understand if a predictive model can be designed to predict the epoch speed and time to accuracy. We selected image classification as the application and CNN models to conduct our experiments.

We collected the training logs for around 75 configurations in which we varied model type, batch size, GPU type, number of GPUs, number of data loaders. Once the predictive model (also referred to as the recommender model) is trained, if the test error is low, we aim to make this model available to the end-user by hosting it over a Kubernetes cluster as a web application.

Finally, this can become a prescriptive solution that can suggest a configuration to the user involving the least training cost before they consider investing in hardware.

Trainers

Single GPU Trainer

trainer_pytorch.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
                        [--configuration-file CONFIGURATION_FILE] [-d DATA]
                        [--dataset DATASET] [-e EPOCHS] 
                        [-lr LEARNING_RATE] [-m MODEL_NAME]                          
                        [-w NUM_WORKERS] [-s SAVE_LOCATION]

Distributed Trainer (Multi-GPU)

distributed_trainer.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
                            [--configuration-file CONFIGURATION_FILE]
                            [-d DATA] [--dataset DATASET]
                            [--distribute-data] [-da DISTRIBUTED_ADDRESS]
                            [-dp DISTRIBUTED_PORT]
                            [--distributed-backend DISTRIBUTED_BACKEND]
                            [-e EPOCHS] [-g GLOO_FILE] [-lr LEARNING_RATE]
                            [-m MODEL_NAME] [--num-nodes NUM_NODES]
                            [--num-gpus NUM_GPUS] [-w NUM_WORKERS]
                            [-s SAVE_LOCATION]

Evaluations

The following graph shows the epoch timings for various configurations. In this experiment, each GPU trained on the entire dataset, leading to an increase in the epoch time with a larger decrease in the number of epochs required to reach a certain accuracy.

Evaluations

Recommender model

Time per epoch (in seconds)
MAE RMSE
1.84 4.60
Accuracy for an epoch
MAE RMSE
0.047 0.10

Frameworks & Libraries

  1. PyTorch
  2. LightGBM

Environments

  1. Jupyter Notebooks
  2. PyCharm
  3. Spyder
  4. Visual Studio Code

Releases

No releases published

Packages

No packages published