Skip to content

Latest commit

 

History

History
502 lines (353 loc) · 24.1 KB

SETUP.md

File metadata and controls

502 lines (353 loc) · 24.1 KB

Setup Instructions

Table of Contents:

1 - Initial Setup

This section will guide you through your initial project setup. Use your local machine for development and debugging, and reserve the cluster primarily for training and minor configurations.

Create your Git Repository from the Template

  • Navigate to the template repository on GitHub.
  • Click Use this templateCreate a new repository.
  • Configure the repository settings as needed.
  • Clone your new repository:
git clone git@github.com:<github_user>/<repository_name>.git

Note: Replace <github_user> and <repository_name> with your actual GitHub username and repository name, or copy the URL of your repository from GitHub.

Change the Project Name

In your Git repository open the file global.env and modify the following variable (the others can be changed later):

TUSTU_PROJECT_NAME: Your Project Name

Set up a Virtual Environment

Go to your repository, create a virtual environment with the python version of your choice and install the required dependencies:

cd <repository_name>
python3.xx -m venv venv
source venv/bin/activate
pip install dvc torch tensorboard 

Note: If you choose a different virtual environment name, update it in .gitignore.

Save the python version of your virtual environment to the global environment file global.env (this is necessary for the Docker image build later):

Info: Check your current version with python --version.

TUSTU_PYTHON_VERSION: The Python version for your project

Configure your DVC Remote

Choose a supported storage type and install the required DVC plugin (e.g., for WebDAV):

WebDAV

pip install dvc_webdav

Quick configuration: Uses existing config file and overwrites only required parts.

dvc remote add -d myremote webdavs://example.com/path/to/storage --force
dvc remote modify --local myremote user 'yourusername'
dvc remote modify --local myremote password 'yourpassword'

Full configuration: Reinitializes DVC repository and adds all configurations from scratch.

rm -rf .dvc/
dvc init 
dvc remote add -d myremote webdavs://example.com/path/to/storage
dvc remote modify --local myremote user 'yourusername'
dvc remote modify --local myremote password 'yourpassword'
dvc remote modify myremote timeout 600
dvc config cache.shared group
dvc config cache.type symlink

Info: For detailed information regarding other storage types, refer to the DVC documentation.

SSH

pip install dvc_ssh
dvc remote add -d myremote ssh://<ssh-alias>:/path/to/storage --force
# optional:
dvc remote modify --local myremote keyfile /path/to/keyfile
# and (if needed)
dvc remote modify myremote ask_passphrase true
# or
dvc remote modify --local myremote passphrase mypassphrase

Note: If you encounter an error while pushing or pulling from the ssh remote with this error code: ERROR: unexpected error - SSHClientConfig.__init__() missing 2 required positional arguments: 'host' and 'port', you need to downgrade the asyncssh packjage to version 2.18.0. This is a known issue that should be fixed soon.

Configure Docker Registry

  • Sign Up for Docker Hub: If you do not have an account, register at Docker Hub.
  • Configure GitHub Secrets: In your GitHub repository, go to SettingsSecuritySecrets and variablesActionsNew repository secret, and add secrets for:
    • DOCKER_USERNAME: Your Docker Hub username
    • DOCKER_PASSWORD: Your Docker Hub password
  • Update Global Environment File: Edit global.env to set:
    • TUSTU_DOCKERHUB_USERNAME: Your Docker Hub username

Connect SSH Host for TensorBoard (Optional)

Open your SSH configuration (~/.ssh/config) and add your SSH host:

Host yourserveralias
   HostName yourserver.domain.com
   User yourusername
   IdentityFile ~/.ssh/your_identity_file

Log in to your server. Enter your password and confirm. Once you logged in succesfully, log out again:

ssh yourserveralias
exit

Copy your public SSH key to the remote server:

ssh-copy-id -i ~/.ssh/your_identity_file yourserveralias

You should now be able to log in without password authentication:

ssh yourserveralias

Modify the following variable in your global.env:

TUSTU_TENSORBOARD_HOST: yourserveralias

2 - Create a Docker Image

Install and Freeze Dependencies

Install all the necessary dependencies in your local virtual environment:

source venv/bin/activate
pip install dependency1 dependency2 ... 

Update the requirements.txt file with fixed versions from your virtual environment:

pip freeze > requirements.txt

Build the Docker Image

To debug your Docker image locally, install Docker for your operating system / distro. For Windows and macOS, you can install Docker Desktop.

To build your Docker image, use the following command in your project directory. Substitute the placeholder <your_image_name> with a name for your image:

docker build -t <your_image_name> .

Info: The Dockerfile provided in the template will install the specified Python version (see Set Up a Virtual Environment) and all dependencies from the requirements.txt file on a minimal Debian image.

Test the Docker Image

Run the Docker image locally in an interactive shell to test that everything works as expected:

docker run -it --rm <your_image_name> /bin/bash

Automated Image Builds with GitHub Actions

After testing your initial Docker image locally, use the GitHub Actions workflow for automatic builds:

  • Make sure your dependency versions are fixed in requirements.txt.
  • Push your changes to GitHub and the provided workflow docker_image.yml builds the Docker image and pushes it to your configured Docker registry.
  • It is triggered whenever the Dockerfile, the requirements.txt or the workflow itself is modified.

Note: For the free docker/build-push-action, there is a 14GB storage limit for free public repositories on GitHub runners (About GitHub runners). Therefore, the Docker image must not exceed this size.

Info: At the moment the images are only built on ubuntu-latest runners for the x86_64 architecture. Modify docker_image.yml if other architectures are required.

3 - DVC Experiment Pipeline

This section guides you through setting up the DVC experiment pipeline. The DVC experiment pipeline allows you to manage and version your machine learning workflows, making it easier to track, reproduce, and share your experiments. It also optimizes computational and storage costs by using an internal cache storage to avoid redundant computation of pipeline stages.

Info: For a deeper understanding of DVC, refer to the DVC Documentation.

Add Dataset to the DVC Repository / Remote

Add data to your experiment pipeline (e.g., raw data) and push it to your DVC remote:

dvc add data/raw
# follow the instructions and add the .dvc file to git
git add data/raw.dvc
dvc push

Info: The files added with DVC should be Git-ignored, but adding with DVC will automatically create .gitignore files. What Git tracks are references with a .dvc suffix (e.g. data/raw.dvc). Make sure you add and push the .dvc files to the Git remote at the end of this section.

Modularize your Codebase

If your project started with a Jupyter Notebook or a single python script, split it into separate Python scripts that represent different stages of your pipeline (e.g.: preprocess.py, train.py, export.py, ...) and dependencies (e.g.: model.py, ...). This modular structure is necessary for integrating a DVC pipeline. You can find an example implementation in the source directory.

Integrate Hyperparameter Configuration

  • Identify the hyperparameters in your scripts that should be configurable.
  • Add the hyperparameters to the params.yaml file, organized by stage or module. Use a general: section for shared hyperparameters.
  • Access the required parameters in dict notation after instantiating a Params object
# train.py
from utils import config

def main():
   params = config.Params()
   random_seed = params['general']['random_seed']
   batch_size = params['train']['batch_size']

Create a DVC Experiment Pipeline

Manually add your stages to the dvc.yaml file:

  • cmd: Specify the command to run the stage.
  • deps: Decide which dependencies should launch the stage execution on a change.
  • params: Include the hyperparameters from params.yaml that should launch the stage execution on a change.
  • out: Add output directories.
  • The last stage should be left as save_logs, which will copy the logs to the DVC experiment branch before the experiment ends and is pushed to the remote.

Note: The stage scripts should be able to recreate the output directories, as DVC deletes them at the beginning of each stage.

For reproducibility, it's essential that the script outputs remain deterministic. To achieve this, ensure that all random number generators in your imported libraries use fixed random seeds. You can do this by using our utility function as follows:

from utils import config
config.set_random_seeds(random_seed)

Note: This function only sets seeds for the random, numpy, torch and scipy libraries. You can modify this function to include seed setting for any additional libraries or devices that your script relies on.

4 - TensorBoard Metrics

To log your machine learning metrics using TensorBoard, and to enable overview and comparison of DVC experiments, follow the steps below:

Initialize TensorBoard Logging

  • In your training script, import the logs module from the utils package.
  • Create a logs directory for TensorBoard logs by calling logs.return_tensorboard_path().

Info: This function generates a path under logs/tensorboard/<time_stamp>_<dvc_exp_name> within the main repository directory and returns the absolute path required to instantiate the logs.CustomSummaryWriter.

  • If you plan to use TensorBoard’s HParams plugin for hyperparameter tuning, initialize a dictionary with the names of the metrics you intend to log. This setup will allow you to easily monitor and compare hyperparameter performance.
  • Create an instance of logs.CustomSummaryWriter, which extends the standard TensorBoard SummaryWriter class to better support the workflow system of the template. When instantiating, pass the Params object (as defined in your training script; see Integrate Hyperparameter Configuration) to the params argument. This ensures that the hyperparameters are automatically logged along with other metrics in the same TensorBoard log file, making them available for visualization in TensorBoard.
# train.py
from utils import config
from utils import logs

def main():
   params = config.Params()
   # Create a CustomSummaryWriter object to write the TensorBoard logs
   tensorboard_path = logs.return_tensorboard_path()
   metrics = {'Epoch_Loss/train': None, 'Epoch_Loss/test': None, 'Batch_Loss/train': None} # optional
   writer = logs.CustomSummaryWriter(log_dir=tensorboard_path, params=params, metrics=metrics) # metrics optional

Log Metrics

For detailed information on how to write different types of log data, refer to the official PyTorch TensorBoard SummaryWriter Class Documentation.

The following example shows how to log scalar metrics and audio examples in the training loop. Make sure that the metric names used with add_scalar match those in the previously initialized metrics dictionary, especially if you want them to appear in the HParams tab of TensorBoard. If you want to log data within a function, pass the writer as an argument.

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    epoch_loss_train = train_epoch(training_dataloader, model, loss_fn, optimizer, device, writer, epoch=t)
    epoch_loss_test = test_epoch(testing_dataloader, model, loss_fn, device, writer)
    epoch_audio_example = generate_audio_example(model, device, testing_dataloader)
    writer.add_scalar("Epoch_Loss/train", epoch_loss_train, t)
    writer.add_scalar("Epoch_Loss/test", epoch_loss_test, t)
    writer.add_audio("Audio_Pred/test", epoch_audio_example, t, sample_rate=44100)
    writer.step() # optional for remote syncing (next section)

Note: The writer.add_hparams function has been overwritten to avoid writing the hyperparameters to a separate logfile. It is automatically called by the constructor when the params are passed.

Enable Remote Syncing (Optional)

If you want to use the CustomSummaryWriter's ability to transfer data to a remote TensorBoard host via SSH at regular intervals, follow these steps:

  • Ensure your SSH host is configured as described in Connect SSH Host for Tensorboard (Optional).
  • In your global.env, set the TUSTU_SYNC_INTERVAL to a value greater than 0. This enables data transfer via rsync to your remote SSH TensorBoard host.
  • Add writer.step() in your epoch train loop to count epochs and trigger syncing at defined intervals.

This process creates a directory (including parent directories) under <tustu_tensorboard_logs_dir>/<tustu_project_name>/logs/tensorboard/ on your SSH server and synchronises the log file and its updates to this directory. You can change the base directory in the global.env file by setting TUSTU_TENSORBOARD_LOGS_DIR to a different location.

5 - Test and Debug Locally

We recommend that you test and debug your DVC experiment pipeline locally before running it on the HPC cluster. This process will help you identify and resolve any problems that may occur during pipeline execution.

Run the DVC Experiment Pipeline Natively

Execute the following command to run the pipeline:

./exp_workflow.sh

This shell script runs the experiment pipeline (dvc exp run) and performs some extra steps such as importing the global environment variables and duplicating the repository into a temporary directory to avoid conflicts with the DVC cache when running multiple experiments simultaneously.

Run the DVC Experiment Pipeline in a Docker Container

In order to run the DVC experiment pipeline in a Docker container we need to first setup a docker volume containing our local ssh setup. Since the local ssh setup is not visible to the docker container, we will then mount the volume to the container. A simple bind mount will allways work, because the .ssh folder ownership is not changed.

To create the Docker volume, use the following command:

docker volume create --name ssh-config

In order to copy the local ssh setup to the Docker volume, we are obliged to create a temporary container that binds the volume.

docker run -it --rm -v ssh-config:/root/.ssh -v $HOME/.ssh:/local-ssh alpine:latest
# Inside the container
cp -r /local-ssh/* /root/.ssh/
# Copying the files will change the ownership to root
# Check your the files
ls -la /root/.ssh/

# Optional - in case you have a config file in your dotfiles something similar to this might be needed as well
docker run -it --rm -v ssh-config:/root/.dotfiles/ssh/.ssh -v $HOME/.ssh:/local-ssh alpine:latest
# Inside the container
cp -r /local-ssh/* /root/.ssh/

Info: This will not change the ownership of the files on your local machine.

Next as dvc needs the git username and email to be set, we will create a local.env file in the repository root directory with the following content:

TUSTU_GIT_USERNAME="Your Name"
TUSTU_GIT_EMAIL="name@domain.com"

Info: This file is git-ignored and is read by the exp_workflow.sh script. It will then configure git with the provided username and email every time the script is run. Your local git configuration will not be changed, as this happens only if the exp_workflow.sh script is run from within a Docker container.

We can now run the experiment within the docker container with repository and SSH volume mounted:

docker run --rm \
  --mount type=bind,source="$(pwd)",target=/home/app \
  --mount type=volume,source=ssh-config,target=/root/.ssh \
  <your_image_name> \
  /home/app/exp_workflow.sh

In case you want to interact with the container, you can run it in interactive mode. docker run --help shows you all available options.

docker run -it --rm \
  --mount type=bind,source="$(pwd)",target=/home/app \
  --mount type=volume,source=ssh-config,target=/root/.ssh \
  <your_image_name>

6 - SLURM Job Configuration

This section covers setting up SLURM jobs for the HPC cluster. SLURM manages resource allocation for your task, which we will specify in a batch job script. Our goal is to run the DVC experiment pipeline inside a Singularity Container on the nodes that have been pulled and converted from your DockerHub image. The batch job script template slurm_job.sh handles these processes and requires minimal configuration.

For single GPU nodes, modify the SBATCH directives for your project name, memory usage and time limit shown in the example below in slurm_job.sh. Also add your email address to receive notifications about the job status:

#SBATCH -J <your_project_name>
#SBATCH --mem=100GB
#SBATCH --time=10:00:00
#SBATCH --mail-user=<your-email-address>

Tip: For initial testing, consider using lower time and memory settings to get higher priority in the queue.

Note: SBATCH directives are executed first and can not be easily configured with environment variables.

Info: For detailed information, consult the official SLURM Documentation. See HPC Documentation for information regarding the HPC Cluster - ZECM, TU Berlin.

7 - HPC Cluster Setup

This section shows you how to set up your project on the HPC Cluster. It assumes that previous configurations have already been pushed to the Git remote, so it focuses on reconfiguring Git-ignored items and SSH keys. It also covers general filesystem and storage configurations that are not project specific.

SSH into the HPC Cluster

ssh hpc

Tip: We recommend using an SSH config for faster access. For general info on accessing the HPC Cluster - ZECM, TU Berlin info see HPC Documentation.

Initial Setup

Create a personal subdirectory on /scratch, since space is limited on the user home directory:

cd /scratch
mkdir <username>

Update the global environment file global.env with the path to your HPC scratch directory:

TUSTU_HPC_DIR=/scratch/<username>

Info: See HPC Documentation for general information about the filesystem on HPC Cluster - ZECM, TU Berlin.

Set up a temporary directory on /scratch to get more space for temporary files. Then add the TMPDIR environment variable to your .bashrc so that singularity and other applications use this directory for temporary files. These can get quite large as singularity uses them to extract the image and run the container. Then change also the cache directory from singularity with the SINGULARITY_CACHEDIR environment variable.

mkdir <username>/tmp
echo 'export TMPDIR=/scratch/<username>/tmp' >> ~/.bashrc
echo 'export SINGULARITY_CACHEDIR=/scratch/<username>/.singularity' >> ~/.bashrc
source ~/.bashrc

Restrict permissions on your subdirectory (Optional):

chmod 700 <username>/

Assuming you have already configured Git on the HPC cluster, clone your Git repository to /scratch/<username>:

cd <username>
git clone git@github.com:<github_user>/<repository_name>.git

Info: On the hpc cluster, the first time you log in a ssh key is generated for you (~/.ssh/id_rsa). You can use this key to access your git repository.

Set up a virtual environment:

cd <REPOSITORY_NAME>
module load python
python3 -m venv venv
module unload python
source venv/bin/activate
pip install dvc <dvc_storage_plugin> # e.g. dvc_webdav

Warning: If you don't unload the Python environment module, the libraries won't be pip-installed into your virtual environment but into your user site directory!

Configure DVC remote if local configuration is required:

dvc remote modify --local myremote user 'yourusername'
dvc remote modify --local myremote password 'yourpassword'

Connect Tensorboard Host (Optional): Repeat Steps 1-4 of the Section Connect SSH Host for Tensorboard (Optional)

8 - Test and Debug on the HPC Cluster

You can run the DVC experiment pipeline on the HPC Cluster by submitting a single SLURM job:

sbatch slurm_job.sh

The logs are stored in the logs directory of your repository. You can monitor the job status with squeue -u <username> and check the logs with cat logs/slurm/slurm-<job_id>.out or follow the tail with tail -f logs/slurm/slurm-<job_id>.out.

The first time you run a job, the Singularity image is pulled from DockerHub and converted to a Singularity image. This process can take some time, but it is only done once. The image is then saved in the repository directory on the HPC cluster and can be reused for subsequent jobs. If you update the Docker image, can force the image to be pulled again by adding the flag --rebuild-image.

sbatch slurm_job.sh --rebuild-image

You can checkout all the available options of the slurm_job.sh script by running:

slurm_job.sh --help

To run multiple submissions with a parameter grid or predefined parameter sets, modify multi_submission.py and run:

python multi_submission.py

For more information on running and monitoring jobs, refer to the User Guide. Flags that are passed to the multi_submission.py script are forwarded to the slurm_job.sh script. So you could run this command to submit multiple jobs that will all force the image to be rebuilt:

python multi_submission.py --rebuild-image

Info: Singularity is used for containerization on the cluster. In the slurm_job.sh the image is pulled from DockerHub and converted to a Singularity image. Unlike docker, singularity by default binds the complete home directory of the executing user to the container. Also, when entering a singularity container, the user in a singularity container is the same as the user on the host system. Therefore, we do not get the same permission issues as with docker.