-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
271963f
commit 3431b9e
Showing
1 changed file
with
259 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,259 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Using `f3dasm` on a High-Performance Cluster Computer\n", | ||
"\n", | ||
"Your `f3dasm` workflow can be seamlessly translated to a high-performance computing cluster. \n", | ||
"The advantage is that you can parallelize the total number of experiments among the nodes of the cluster. \n", | ||
"This is especially useful when you have a large number of experiments to run.\n", | ||
"\n", | ||
"> This example has been tested on the following high-performance computing cluster systems:\n", | ||
"> \n", | ||
"> - The [hpc06 cluster of Delft University of Technology](https://hpcwiki.tudelft.nl/index.php/Main_Page), using the [TORQUE resource manager](https://en.wikipedia.org/wiki/TORQUE).\n", | ||
"> - The [DelftBlue: TU Delft supercomputer](https://www.tudelft.nl/dhpc/system), using the [SLURM resource manager](https://slurm.schedmd.com/documentation.html).\n", | ||
"> - The [OSCAR compute cluster from Brown University](https://docs.ccv.brown.edu/oscar/getting-started), using the [SLURM resource manager](https://slurm.schedmd.com/documentation.html).\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from time import sleep\n", | ||
"\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"from f3dasm import HPC_JOBID, ExperimentData\n", | ||
"from f3dasm.design import make_nd_continuous_domain" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# We will create the following data-driven process:\n", | ||
"\n", | ||
"- Create a 20D continuous `Domain`.\n", | ||
"- Sample from the domain using a Latin-hypercube sampler.\n", | ||
"- With multiple nodes, use a data generation function, which will be the `\"Ackley\"` function from the benchmark functions.\n", | ||
"\n", | ||
"<div style=\"text-align: center;\">\n", | ||
" <img src=\"../../img/f3dasm-workflow-example-cluster.png\" alt=\"Block\" title=\"Block\" width=\"60%\" />\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We want to ensure that the sampling is done only once, and that the data generation is performed in parallel. \n", | ||
"Therefore, we can divide the different nodes into two categories:\n", | ||
"\n", | ||
"- The first node (`f3dasm.HPC_JOBID == 0`) will be the **master** node, responsible for creating the design-of-experiments and sampling (the `create_experimentdata` function).\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def create_experimentdata():\n", | ||
" \"\"\"Design of Experiment\"\"\"\n", | ||
" # Create a domain object\n", | ||
" domain = make_nd_continuous_domain(\n", | ||
" bounds=np.tile([0.0, 1.0], (20, 1)), dimensionality=20)\n", | ||
"\n", | ||
" # Create the ExperimentData object\n", | ||
" data = ExperimentData(domain=domain)\n", | ||
"\n", | ||
" # Sampling from the domain\n", | ||
" data.sample(sampler='latin', n_samples=10)\n", | ||
"\n", | ||
" # Store the data to disk\n", | ||
" data.store()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"- All the other nodes (`f3dasm.HPC_JOBID > 0`) will be **process** nodes, which will retrieve the `ExperimentData` from disk and proceed directly to the data generation function.\n", | ||
"\n", | ||
"<div style=\"text-align: center;\">\n", | ||
" <img src=\"../../img/f3dasm-workflow-cluster-roles.png\" alt=\"Block\" title=\"Block\" width=\"60%\" />\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def worker_node():\n", | ||
" # Extract the experimentdata from disk\n", | ||
" data = ExperimentData.from_file(project_dir='.')\n", | ||
"\n", | ||
" \"\"\"Data Generation\"\"\"\n", | ||
" # Use the data-generator to evaluate the initial samples\n", | ||
" data.evaluate(data_generator='Ackley', mode='cluster')\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The entrypoint of the script can now check the jobid of the current node and decide whether to create the experiment data or to run the data generation function:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"if __name__ == '__main__':\n", | ||
" # Check the jobid of the current node\n", | ||
" if HPC_JOBID is None:\n", | ||
" # If the jobid is none, we are not running anything now\n", | ||
" pass\n", | ||
"\n", | ||
" elif HPC_JOBID == 0:\n", | ||
" create_experimentdata()\n", | ||
" worker_node()\n", | ||
" elif HPC_JOBID > 0:\n", | ||
" # Asynchronize the jobs in order to omit racing conditions\n", | ||
" sleep(HPC_JOBID)\n", | ||
" worker_node()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Running the Program\n", | ||
"\n", | ||
"You can run the workflow by submitting the bash script to the HPC queue. \n", | ||
"Make sure you have [miniconda3](https://docs.anaconda.com/free/miniconda/index.html) installed on the cluster and that you have created a conda environment (in this example named `f3dasm_env`) with the necessary packages.\n", | ||
"\n", | ||
"### TORQUE Example\n", | ||
"\n", | ||
"---\n", | ||
"\n", | ||
"```bash\n", | ||
"#!/bin/bash\n", | ||
"# Torque directives (#PBS) must always be at the start of a job script!\n", | ||
"#PBS -N ExampleScript\n", | ||
"#PBS -q mse\n", | ||
"#PBS -l nodes=1:ppn=12,walltime=12:00:00\n", | ||
"\n", | ||
"# Make sure I'm the only one that can read my output\n", | ||
"umask 0077\n", | ||
"\n", | ||
"# The PBS_JOBID looks like 1234566[0].\n", | ||
"# With the following line, we extract the PBS_ARRAYID, the part in the brackets []:\n", | ||
"PBS_ARRAYID=$(echo \"${PBS_JOBID}\" | sed 's/\\[[^][]*\\]//g')\n", | ||
"\n", | ||
"module load use.own\n", | ||
"module load miniconda3\n", | ||
"cd $PBS_O_WORKDIR\n", | ||
"\n", | ||
"# Activate my conda environment:\n", | ||
"source activate f3dasm_env\n", | ||
"\n", | ||
"# Limit the number of threads\n", | ||
"OMP_NUM_THREADS=12\n", | ||
"export OMP_NUM_THREADS=12\n", | ||
"\n", | ||
"# If the PBS_ARRAYID is not set, set it to None\n", | ||
"if ! [ -n \"${PBS_ARRAYID+1}\" ]; then\n", | ||
" PBS_ARRAYID=None\n", | ||
"fi\n", | ||
"\n", | ||
"# Execute my Python program with the jobid flag\n", | ||
"python main.py --jobid=${PBS_ARRAYID}\n", | ||
"```\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### SLURM Example\n", | ||
"\n", | ||
"---\n", | ||
"\n", | ||
"```bash\n", | ||
"#!/bin/bash -l\n", | ||
"\n", | ||
"#SBATCH -J \"ExampleScript\" # Name of the job\n", | ||
"#SBATCH --get-user-env # Set environment variables\n", | ||
"\n", | ||
"#SBATCH --partition=compute\n", | ||
"#SBATCH --time=12:00:00\n", | ||
"#SBATCH --nodes=1\n", | ||
"#SBATCH --ntasks-per-node=12\n", | ||
"#SBATCH --cpus-per-task=1\n", | ||
"#SBATCH --mem=0\n", | ||
"#SBATCH --account=research-eemcs-me\n", | ||
"#SBATCH --array=0-2\n", | ||
"\n", | ||
"source activate f3dasm_env\n", | ||
"\n", | ||
"# Execute my Python program with the jobid flag\n", | ||
"python3 main.py --jobid=${SLURM_ARRAY_TASK_ID}\n", | ||
"```\n", | ||
"\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You can run the workflow by submitting the bash script to the HPC queue. \n", | ||
"The following command submits an array job with 3 jobs where `f3dasm.HPC_JOBID` takes values of 0, 1, and 2.\n", | ||
"\n", | ||
"### TORQUE Example\n", | ||
"\n", | ||
"```bash\n", | ||
"qsub pbsjob.sh -t 0-2\n", | ||
"```\n", | ||
"\n", | ||
"### SLURM Example\n", | ||
"\n", | ||
"```bash\n", | ||
"sbatch --array 0-2 pbsjob.sh\n", | ||
"```" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "f3dasm_env3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.17" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |