This project is aimed to study the impact of distributed trained of Deep Learning models to understand if a predictive model can be designed to predict the epoch speed and time to accuracy. We selected image classification as the application and CNN models to conduct our experiments.
We collected the training logs for around 75 configurations in which we varied model type, batch size, GPU type, number of GPUs, number of data loaders. Once the predictive model (also referred to as the recommender model) is trained, if the test error is low, we aim to make this model available to the end-user by hosting it over a Kubernetes cluster as a web application.
Finally, this can become a prescriptive solution that can suggest a configuration to the user involving the least training cost before they consider investing in hardware.
trainer_pytorch.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
[--configuration-file CONFIGURATION_FILE] [-d DATA]
[--dataset DATASET] [-e EPOCHS]
[-lr LEARNING_RATE] [-m MODEL_NAME]
[-w NUM_WORKERS] [-s SAVE_LOCATION]
distributed_trainer.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
[--configuration-file CONFIGURATION_FILE]
[-d DATA] [--dataset DATASET]
[--distribute-data] [-da DISTRIBUTED_ADDRESS]
[-dp DISTRIBUTED_PORT]
[--distributed-backend DISTRIBUTED_BACKEND]
[-e EPOCHS] [-g GLOO_FILE] [-lr LEARNING_RATE]
[-m MODEL_NAME] [--num-nodes NUM_NODES]
[--num-gpus NUM_GPUS] [-w NUM_WORKERS]
[-s SAVE_LOCATION]
The following graph shows the epoch timings for various configurations. In this experiment, each GPU trained on the entire dataset, leading to an increase in the epoch time with a larger decrease in the number of epochs required to reach a certain accuracy.
MAE | RMSE |
---|---|
1.84 | 4.60 |
MAE | RMSE |
---|---|
0.047 | 0.10 |