Elliott Slaughter
February 2023
I'd like propose the following division of responsibilities. The three main groups are:
- Stanford CS staff (Andrej, Brian, Jimmy)
- Legion project members responsible for administering the machine (Elliott, etc.)
- Sapling users
- Hardware installation and maintenance
- Base OS installation
- DNS
- Time (NTP via time.stanford.edu)
- Filesystems (including NFS)
- CUDA drivers (if applicable), but NOT CUDA toolkit/software
- Infiniband: OFED, nvidia-fm, p2p rdma, etc.
- SSL certificates
- CUDA toolkit/software
- SLURM
- MPI
- Docker
- HTTP server
- A very small number of additional packages (e.g., CMake, compilers)
- All other software packages (installed on a per-user basis)
Here are the steps I envision taking. I use some abbreviations to simplify the instructions.
Roles:
CS
: Stanford CS staffLP
: Legion project membersUR
: Users
Machines:
H1
: old head nodeH2
: new head nodeCN
: one CPU compute node, for initial testingRC
: remaining CPU compute nodes (c000*
andn000*
)RG
: remaining GPU compute nodes (g000*
)
CS
: Install Ubuntu 20.04 base OS onH2
CS
: MakeH2
available via private IPMI as head2-ipmiCS
: MakeH2
available via public DNS as sapling2.stanford.eduCS
: MakeH2
available via public SSHCS
: Set up DNS onH2
such that it can accessH1
and compute nodesCS
: Configure disks onH2
:- Two 8 TB SSDs combined in a ZFS RAID as
/home
- Other SSDs should be set up as
/scratchN
whereN
starts at 1
- Two 8 TB SSDs combined in a ZFS RAID as
LP
: Verify and confirmLP
: Copy/etc/passwd
,/etc/shadow
,/etc/group
,/etc/gshadow
,/etc/subuid
,/etc/subgid
fromH1
toH2
LP
: Verify thatH2
can be rebooted throughsudo reboot
or similar without losing access or any critical services
LP
: Install SLURMLP
: Install MPILP
: Install CUDA toolkitLP
: Install DockerLP
: Install HTTP serverLP
: Install module system
Choose one compute node (c0002
) to move over to the new
H2
configurations. Call this machine CN
. We will test everything with
this node before performing the rest of the migration.
CS
: Do NOT install a new base OS; we'll keep Ubuntu 20.04 on these nodesCS
: Configure network (IPMI, DHCP, DNS, NTP) onCN
CS
: Configure NFS onCN
to accessH2
's drives (and remove access toH1
's drives)LP
: Configure SLURM/MPI/CUDA/Docker/CMake/modules onCN
LP
: Verify that jobs are able to be launched onCN
UR
: VerifyH2
andCN
access and software
- For each CPU node
RC
:CS
: Do NOT install a new base OS; we'll keep Ubuntu 20.04 on these nodesCS
: Configure network (IPMI, DHCP, DNS, NTP) onRC
CS
: Configure NFS onRC
to accessH2
's drives (and remove access toH1
's drives)LP
: Configure SLURM/MPI/CUDA/Docker/CMake/modules onRC
LP
: Verify that jobs are able to be launched onRC
LP
: Re-enable CI jobs onRC
UR
: STOP USINGH1
FOR ALL JOBS- Repeat step (20), but for
RG
LP
: Copy the contents ofH1
disks toH2
:- Copy
H1
's/home
intoH2
/scratch/sapling1/home
CopyH1
's/scratch
intoH2
/scratch/sapling1/scratch
CopyH1
's/scratch2
intoH2
/scratch/sapling1/scratch2
- Copy
UR
: CAN BEGIN USE OFH2
LP
: Migrate the GitHub mirror script toH2
UR
: Verify and confirm final configurationCS
: MakeH2
available under sapling.stanford.eduLP
/UR
: Verify and confirmCS
:H1
can be decomissioned