PhoenixOS (PhOS) is an OS-level GPU checkpoint/restore (C/R) system. It can transparently C/R processes that use the GPU, without requiring any cooperation from the application, a key feature required by modern systems like the cloud. Most importantly, PhOS is the first OS-level C/R system that can concurrently execute C/R without stopping the execution of application.
Under CUDA platform, we compared the C/R performace of PhOS with nvidia/cuda-checkpoint:
Checkpointing Llama2-13b-chat |
---|
Restoring Llama2-13b-chat |
---|
Note that PhOS is aimming to be a generic design that towards various hardware platforms from different vendors, by providing a set of interfaces which should be implemented by specific hardware platforms. We currently provide the C/R implementation on CUDA platform, support for ROCm and Ascend are under development.
-
[Nov.6, 2024] PhOS is open sourced π [Repo] [Documentations]
π PhOS is currently fully supporting single-GPU checkpoint and restore
π We will soon release codes for cross-node live migration and multi-GPU support :)
-
[May 20, 2024] PhOS paper is now released on arXiv [Paper]
PhOS is currently under heavy development. If you're interested in contributing to this project, please join our slack workspace for more upcoming cool features on PhOS. |
-
[Clone Repository] First of all, clone this repository recursively:
git clone --recursive https://github.com/SJTU-IPADS/PhoenixOS.git
-
[Start Container] PhOS can be built and installed on official vendor image.
NOTE: PhOS require libc6 >= 2.29 for compiling CRIU from source.
For example, for running PhOS for CUDA 11.3, one can build on official CUDA images (e.g.,
nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
):# enter repository cd PhoenixOS # start container sudo docker run -dit --gpus all \ -v.:/root \ --privileged --network=host --ipc=host \ --name phos nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 # enter container sudo docker exec -it phos /bin/bash
Note that it's important to execute docker container with root privilege, as CRIU needs the permission to C/R kernel-space memory pages.
-
[Downloading Necesssary Assets] PhOS relies on some assets to build and test, please download these assets by simply running following commands:
# inside container # install basic dependencies from OS pkg manager apt-get update apt-get install git wget # download assets cd /root/scripts/build_scripts bash download_assets.sh
-
[Build] Building PhOS is simple!
PhOS provides a convinient build system, which covers compiling, linking and installing all PhOS components:
Component Description phos-autogen
Autogen Engine for generating most of Parser and Worker code for specific hardware platform, based on lightwight notation. phosd
PhOS Daemon, which continuously run at the background, taking over the control of all GPU devices on the node. libphos.so
PhOS Hijacker, which hijacks all GPU API calls on the client-side and forward to PhOS Daemon. libpccl.so
PhOS Checkpoint Communication Library (PCCL), which provide highly-optimized device-to-device state migration. Note that this library is not included in current release. unit-testing
Unit Tests for PhOS, which is based on GoogleTest. phos-cli
Command Line Interface (CLI) for interacting with PhOS. phos-remoting
Remoting Framework, which provide highly-optimized GPU API remoting performance. See more details at SJTU-IPADS/PhoenixOS-Remoting. To build and install all above components and other dependencies, simply run the build script in the container would works:
# inside container cd /root/scripts/build_scripts # clear old build cache # -c: clear previous build # -3: the clean process involves all third-parties bash build.sh -c -3 # start building # -3: the build process involves all third-parties # -i: install after successful building bash build.sh -3 -i
For customizing build options, please refers to and modify avaiable options under
scripts/build_scripts/build_config.yaml
.If you encounter any build issues, you're able to see building logs under
build_log
. Please open a new issue if things are stuck :-|
Will soon be updated :)
Once successfully installed PhOS, you can now try run your program with PhOS support!
For more details, you can refer to examples for step-by-step tutorials to run PhOS.
|
-
Start the PhOS daemon (
phosd
), which takes over all GPU reousces on the node:pos_cli --start --target daemon
-
To run your program with PhOS support, one need to put a
yaml
configure file under the directory which your program would regard as$PWD
. This file contains all necessary informations for PhOS to hijack your program. An example file looks like:# [Field] name of the job # [Note] job with same name would share some resources in posd, e.g., CUModule, etc. job_name: "llama2-13b-chat-hf" # [Field] remote address of posd, default is local daemon_addr: "127.0.0.1"
-
You are going for launch now! Try run your program with
env $phos
prefix, for example:env $phos python3 train.py
To pre-dump your program, which save the CPU & GPU state without stopping your execution, simple run:
# create directory to store checkpoing files
mkdir /root/ckpt
# pre-dump command
pos_cli --pre-dump --dir /root/ckpt --pid [your program's pid]
To dump your program, which save the CPU & GPU state and stop your execution, simple run:
# create directory to store checkpoing files
mkdir /root/ckpt
# pre-dump command
pos_cli --dump --dir /root/ckpt --pid [your program's pid]
To restore your program, simply run:
# restore command
pos_cli --restore --dir /root/ckpt
For more details, please check our paper.
If you use PhOS in your research, please cite our paper:
@article{huang2024parallelgpuos,
title={PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation},
author={Huang, Zhuobin and Wei, Xingda and Hao, Yingyi and Chen, Rong and Han, Mingcong and Gu, Jinyu and Chen, Haibo},
journal={arXiv preprint arXiv:2405.12079},
year={2024}
}
Please check mailmap for all contributors.