an example of the training file supporting distributed training and curriculum learning with Catalyst
- Create and populate the environment
MYNEWENV="" # write a name of your environment in quotes, like "torch"
conda create --name ${MYNEWENV} python=3.9
conda activate ${MYNEWENV}
pip install -r requirements.txt
- Training Setup
- The main training script is
curriculum_training.py
- Mainly make sure that
get_model
method of theCustomRunner
class initializes your model - Create or modify the config file (e.g.
conf/vanilla_3class_gn_11chan32.16.1_exp01.yaml
) as follows:- Set
wandb.team
to your team name for proper logging
- Set
- The main training script is
-
Configure
submit-job.sh
by changing the following:# Required changes: #SBATCH --job-name # Set meaningful job name #SBATCH --mail-user=your.email@domain.com #SBATCH -p your_partition #SBATCH -A your_account MYNEWENV="" # Set to your conda environment name CONFIG_NAME="your_config" CONFIG_PATH="path/to/config"
-
Submit the job:
sbatch submit-job.sh
-
Wandb Connection Issues
- Create a wandb account and obtain an API token
- Run
wandb login
with your token - Update
wandb.team
andwandb.project
in your config file
-
MongoDB Connection
- Development node: Uses "10.245.12.58"
- Slurm nodes: Uses "arctrdcn018.rs.gsu.edu"
- Connection is automatically handled based on
SLURM_JOB_ID
environment variable but you can also switch things up under themongo
section of the config file when you run into issues.