Skip to content

Commit

Permalink
added cluster.ipynb
Browse files Browse the repository at this point in the history
  • Loading branch information
mpvanderschelling committed Jan 17, 2025
1 parent 271963f commit 3431b9e
Showing 1 changed file with 259 additions and 0 deletions.
259 changes: 259 additions & 0 deletions docs/source/notebooks/data-driven/cluster.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using `f3dasm` on a High-Performance Cluster Computer\n",
"\n",
"Your `f3dasm` workflow can be seamlessly translated to a high-performance computing cluster. \n",
"The advantage is that you can parallelize the total number of experiments among the nodes of the cluster. \n",
"This is especially useful when you have a large number of experiments to run.\n",
"\n",
"> This example has been tested on the following high-performance computing cluster systems:\n",
"> \n",
"> - The [hpc06 cluster of Delft University of Technology](https://hpcwiki.tudelft.nl/index.php/Main_Page), using the [TORQUE resource manager](https://en.wikipedia.org/wiki/TORQUE).\n",
"> - The [DelftBlue: TU Delft supercomputer](https://www.tudelft.nl/dhpc/system), using the [SLURM resource manager](https://slurm.schedmd.com/documentation.html).\n",
"> - The [OSCAR compute cluster from Brown University](https://docs.ccv.brown.edu/oscar/getting-started), using the [SLURM resource manager](https://slurm.schedmd.com/documentation.html).\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from time import sleep\n",
"\n",
"import numpy as np\n",
"\n",
"from f3dasm import HPC_JOBID, ExperimentData\n",
"from f3dasm.design import make_nd_continuous_domain"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# We will create the following data-driven process:\n",
"\n",
"- Create a 20D continuous `Domain`.\n",
"- Sample from the domain using a Latin-hypercube sampler.\n",
"- With multiple nodes, use a data generation function, which will be the `\"Ackley\"` function from the benchmark functions.\n",
"\n",
"<div style=\"text-align: center;\">\n",
" <img src=\"../../img/f3dasm-workflow-example-cluster.png\" alt=\"Block\" title=\"Block\" width=\"60%\" />\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to ensure that the sampling is done only once, and that the data generation is performed in parallel. \n",
"Therefore, we can divide the different nodes into two categories:\n",
"\n",
"- The first node (`f3dasm.HPC_JOBID == 0`) will be the **master** node, responsible for creating the design-of-experiments and sampling (the `create_experimentdata` function).\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def create_experimentdata():\n",
" \"\"\"Design of Experiment\"\"\"\n",
" # Create a domain object\n",
" domain = make_nd_continuous_domain(\n",
" bounds=np.tile([0.0, 1.0], (20, 1)), dimensionality=20)\n",
"\n",
" # Create the ExperimentData object\n",
" data = ExperimentData(domain=domain)\n",
"\n",
" # Sampling from the domain\n",
" data.sample(sampler='latin', n_samples=10)\n",
"\n",
" # Store the data to disk\n",
" data.store()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- All the other nodes (`f3dasm.HPC_JOBID > 0`) will be **process** nodes, which will retrieve the `ExperimentData` from disk and proceed directly to the data generation function.\n",
"\n",
"<div style=\"text-align: center;\">\n",
" <img src=\"../../img/f3dasm-workflow-cluster-roles.png\" alt=\"Block\" title=\"Block\" width=\"60%\" />\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def worker_node():\n",
" # Extract the experimentdata from disk\n",
" data = ExperimentData.from_file(project_dir='.')\n",
"\n",
" \"\"\"Data Generation\"\"\"\n",
" # Use the data-generator to evaluate the initial samples\n",
" data.evaluate(data_generator='Ackley', mode='cluster')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The entrypoint of the script can now check the jobid of the current node and decide whether to create the experiment data or to run the data generation function:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"if __name__ == '__main__':\n",
" # Check the jobid of the current node\n",
" if HPC_JOBID is None:\n",
" # If the jobid is none, we are not running anything now\n",
" pass\n",
"\n",
" elif HPC_JOBID == 0:\n",
" create_experimentdata()\n",
" worker_node()\n",
" elif HPC_JOBID > 0:\n",
" # Asynchronize the jobs in order to omit racing conditions\n",
" sleep(HPC_JOBID)\n",
" worker_node()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running the Program\n",
"\n",
"You can run the workflow by submitting the bash script to the HPC queue. \n",
"Make sure you have [miniconda3](https://docs.anaconda.com/free/miniconda/index.html) installed on the cluster and that you have created a conda environment (in this example named `f3dasm_env`) with the necessary packages.\n",
"\n",
"### TORQUE Example\n",
"\n",
"---\n",
"\n",
"```bash\n",
"#!/bin/bash\n",
"# Torque directives (#PBS) must always be at the start of a job script!\n",
"#PBS -N ExampleScript\n",
"#PBS -q mse\n",
"#PBS -l nodes=1:ppn=12,walltime=12:00:00\n",
"\n",
"# Make sure I'm the only one that can read my output\n",
"umask 0077\n",
"\n",
"# The PBS_JOBID looks like 1234566[0].\n",
"# With the following line, we extract the PBS_ARRAYID, the part in the brackets []:\n",
"PBS_ARRAYID=$(echo \"${PBS_JOBID}\" | sed 's/\\[[^][]*\\]//g')\n",
"\n",
"module load use.own\n",
"module load miniconda3\n",
"cd $PBS_O_WORKDIR\n",
"\n",
"# Activate my conda environment:\n",
"source activate f3dasm_env\n",
"\n",
"# Limit the number of threads\n",
"OMP_NUM_THREADS=12\n",
"export OMP_NUM_THREADS=12\n",
"\n",
"# If the PBS_ARRAYID is not set, set it to None\n",
"if ! [ -n \"${PBS_ARRAYID+1}\" ]; then\n",
" PBS_ARRAYID=None\n",
"fi\n",
"\n",
"# Execute my Python program with the jobid flag\n",
"python main.py --jobid=${PBS_ARRAYID}\n",
"```\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SLURM Example\n",
"\n",
"---\n",
"\n",
"```bash\n",
"#!/bin/bash -l\n",
"\n",
"#SBATCH -J \"ExampleScript\" # Name of the job\n",
"#SBATCH --get-user-env # Set environment variables\n",
"\n",
"#SBATCH --partition=compute\n",
"#SBATCH --time=12:00:00\n",
"#SBATCH --nodes=1\n",
"#SBATCH --ntasks-per-node=12\n",
"#SBATCH --cpus-per-task=1\n",
"#SBATCH --mem=0\n",
"#SBATCH --account=research-eemcs-me\n",
"#SBATCH --array=0-2\n",
"\n",
"source activate f3dasm_env\n",
"\n",
"# Execute my Python program with the jobid flag\n",
"python3 main.py --jobid=${SLURM_ARRAY_TASK_ID}\n",
"```\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can run the workflow by submitting the bash script to the HPC queue. \n",
"The following command submits an array job with 3 jobs where `f3dasm.HPC_JOBID` takes values of 0, 1, and 2.\n",
"\n",
"### TORQUE Example\n",
"\n",
"```bash\n",
"qsub pbsjob.sh -t 0-2\n",
"```\n",
"\n",
"### SLURM Example\n",
"\n",
"```bash\n",
"sbatch --array 0-2 pbsjob.sh\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "f3dasm_env3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.17"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit 3431b9e

Please sign in to comment.