Sapling Head Node Migration Plan

Elliott Slaughter

February 2023

Division of Responsibities

I'd like propose the following division of responsibilities. The three main groups are:

Stanford CS staff (Andrej, Brian, Jimmy)
Legion project members responsible for administering the machine (Elliott, etc.)
Sapling users

Responsbilities of Stanford CS Staff

Hardware installation and maintenance
Base OS installation
DNS
Time (NTP via time.stanford.edu)
Filesystems (including NFS)
CUDA drivers (if applicable), but NOT CUDA toolkit/software
Infiniband: OFED, nvidia-fm, p2p rdma, etc.
SSL certificates

Responsibilities of Legion Project Members

CUDA toolkit/software
SLURM
MPI
Docker
HTTP server
A very small number of additional packages (e.g., CMake, compilers)

Responsibilities of Sapling Users

All other software packages (installed on a per-user basis)

Migration Plan

Here are the steps I envision taking. I use some abbreviations to simplify the instructions.

Roles:

CS: Stanford CS staff
LP: Legion project members
UR: Users

Machines:

H1: old head node
H2: new head node
CN: one CPU compute node, for initial testing
RC: remaining CPU compute nodes (c000* and n000*)
RG: remaining GPU compute nodes (g000*)

Part 1. Spin Up New Head Node

CS: Install Ubuntu 20.04 base OS on H2
CS: Make H2 available via private IPMI as head2-ipmi
CS: Make H2 available via public DNS as sapling2.stanford.edu
CS: Make H2 available via public SSH
CS: Set up DNS on H2 such that it can access H1 and compute nodes
CS: Configure disks on H2:
- Two 8 TB SSDs combined in a ZFS RAID as /home
- Other SSDs should be set up as /scratchN where N starts at 1
LP: Verify and confirm
LP: Copy /etc/passwd, /etc/shadow, /etc/group, /etc/gshadow, /etc/subuid, /etc/subgid from H1 to H2
LP: Verify that H2 can be rebooted through sudo reboot or similar without losing access or any critical services

Part 2. Install Basic Services

LP: Install SLURM
LP: Install MPI
LP: Install CUDA toolkit
LP: Install Docker
LP: Install HTTP server
LP: Install module system

Part 3. Initial Migration Testing (2023-05-02)

Choose one compute node (c0002) to move over to the new H2 configurations. Call this machine CN. We will test everything with this node before performing the rest of the migration.

CS: Do NOT install a new base OS; we'll keep Ubuntu 20.04 on these nodes
CS: Configure network (IPMI, DHCP, DNS, NTP) on CN
CS: Configure NFS on CN to access H2's drives (and remove access to H1's drives)
LP: Configure SLURM/MPI/CUDA/Docker/CMake/modules on CN
LP: Verify that jobs are able to be launched on CN
UR: Verify H2 and CN access and software

Part 4. Move CPU Nodes (2023-05-04)

For each CPU node RC:
1. CS: Do NOT install a new base OS; we'll keep Ubuntu 20.04 on these nodes
2. CS: Configure network (IPMI, DHCP, DNS, NTP) on RC
3. CS: Configure NFS on RC to access H2's drives (and remove access to H1's drives)
4. LP: Configure SLURM/MPI/CUDA/Docker/CMake/modules on RC
5. LP: Verify that jobs are able to be launched on RC
LP: Re-enable CI jobs on RC

Part 5. Flag Day: Move GPU Nodes and Copy Disks (2023-05-09)

UR: STOP USING H1 FOR ALL JOBS
Repeat step (20), but for RG
LP: Copy the contents of H1 disks to H2:
1. Copy H1's /home into H2 /scratch/sapling1/home
2. ~~Copy H1's /scratch into H2 /scratch/sapling1/scratch~~
3. ~~Copy H1's /scratch2 into H2 /scratch/sapling1/scratch2~~
UR: CAN BEGIN USE OF H2

Part 6. Final Migration Steps

LP: Migrate the GitHub mirror script to H2
UR: Verify and confirm final configuration
CS: Make H2 available under sapling.stanford.edu
LP/UR: Verify and confirm
CS: H1 can be decomissioned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIGRATION_2023.md

MIGRATION_2023.md

Sapling Head Node Migration Plan

Division of Responsibities

Responsbilities of Stanford CS Staff

Responsibilities of Legion Project Members

Responsibilities of Sapling Users

Migration Plan

Part 1. Spin Up New Head Node

Part 2. Install Basic Services

Part 3. Initial Migration Testing (2023-05-02)

Part 4. Move CPU Nodes (2023-05-04)

Part 5. Flag Day: Move GPU Nodes and Copy Disks (2023-05-09)

Part 6. Final Migration Steps

Files

MIGRATION_2023.md

Latest commit

History

MIGRATION_2023.md

File metadata and controls

Sapling Head Node Migration Plan

Division of Responsibities

Responsbilities of Stanford CS Staff

Responsibilities of Legion Project Members

Responsibilities of Sapling Users

Migration Plan

Part 1. Spin Up New Head Node

Part 2. Install Basic Services

Part 3. Initial Migration Testing (2023-05-02)

Part 4. Move CPU Nodes (2023-05-04)

Part 5. Flag Day: Move GPU Nodes and Copy Disks (2023-05-09)

Part 6. Final Migration Steps