This repository contains the evaluation artifacts of our SoCC'22 paper Efficient Federated Learning via Guided Asynchronous Training. It is implemented by extending an early version of Plato, a new scalable federated learning research framework.
All the evaluations are done in cluster deployment. In what follows, we assume you have an AWS account and can start a cluster of EC2 nodes. (Technically you can also run atop an existing cluster where you have sudo previliges with steps similar to what are shown below; however, this is not documented.)
- Getting Started
- To run and manipulate a cluster of nodes in the AWS public cloud.
- Running Experiments
- To replicate the experimental results presented in the paper.
- Repo Structure
- Notes
- Contact
In your host machine, you should be able to directly work in a Python 3 Anaconda environment with some dependencies installed. For ease of use, we provide a script for doing so:
# starting from [project folder]
cd experiments/dev
bash standalone_install.sh
# then you may need to exit your current shell and log in again for conda to take effect
One should have an AWS account. Also, at your host machine, one should have installed the latest aws-cli with credentials well configured (so that we can manipulate all the nodes in the cluster remotely via command line tools.).
Reference
- Install AWS CLI.
- Example command for installing into Linux x86 (64-bit):
# done anywhere, e.g., at your home directory curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" sudo apt install unzip unzip awscliv2.zip sudo ./aws/install
- Configure AWS CLI.
- Example command for configuring one's AWS CLI:
where one sequentially input AWS Access Key ID, AWS Secret Access Key, Default region name and default output format.# can be done anywhere, e.g., at your home directory aws configure
Think of a name for the working folder and then create it by copying the
basic cluster management tools to the folder
[project folder]/experiments
.
Here is the example code where we create a folder named socc22_ae
(where ae
means artifact evalutation):
# starting from [project folder]
cd experiments
cp -r cluster_folder_template socc22_ae
cd socc22_ae
We will use this example folder name to demonstrate the rest of steps.
After creating the folder, you need to make necessary modifications to the following cluster configuration files (see the comments for details):
# make some modifications that suit your need/budget
# 1. EC2-related
vim ec2_node_template.yml
# relevant key:
# BlockDeviceMappings/Ebs/VolumeSize: how large is the storage of each node
# KeyName: the path to the key file (relative to ~/.ssh/) you plan to use to
# log into each node of the cluster from your host machine
vim run.sh
# relevant variable:
# LOCAL_PRIVATE_KEY: the path to the key file (should be absolute) you plan to use to
# log into each node of the cluster from your host machine,
# i.e., the same as the one you configure in ec2_node_template.yml
# 2. cluster-related
vim ec2_cluster_config.yml
# relevant key:
# type and region for the server each client
# p.s. the provided images are specifically of Ubuntu 18.04
# 3. Github-related
vim setup.sh
# relevant variable:
# GITHUB_REPO: please change it to your forked repo address
# Otherwise you cannot apply your customization to the cluster
# conveniently using our provided commands (later introduced in 2.3)
After modifying those configuration files, you can launch a cluster using commands:
# starting from [project folder]/experiments/socc22_ae
bash manage_cluster.sh launch
The sample stdout result looks like:
Launching 21 nodes ...
All 21 nodes are launched! Waiting for ready ...
All 21 nodes are ready. Collecting public IP addresses ...
Remarks
-
We provide you with
manage_cluster.sh
for a wide range of cluster-related control:# start an already launch cluster bash manage_cluster.sh start # stop a running cluster bash manage_cluster.sh stop # restart a running cluster bash manage_cluster.sh reboot # terminate an already launch cluster bash manage_cluster.sh terminate # show the public IP addresses for each node bash manage_cluster.sh show
# starting from [project folder]/experiments/socc22_ae
bash setup.sh install
bash setup.sh deploy_cluster
The sample stdout results for the two commands look like:
Initialized node pisces-worker-1 (3.145.144.92).
...(similar lines)...
Updated the repo Pisces-private on pisces-worker-1 (3.145.144.92).
...(similar lines)...
Standalone installation finished on node pisces-worker-1 (3.145.144.92).
...(similar lines)...
and
Cluster server deployed on pisces-worker-1 (3.145.144.92).
...(similar lines)...
, respectively.
Remarks
- Make sure that your
~/.ssh/
has the correct key file that you specified in../dev/ec2_node_template.yml
. - We provide you with
setup.sh
for a wide-range of application-related control:# update all running nodes' Github repo bash setup.sh update # add pip package to the used conda environment for all running nodes bash setup.sh add_dependency [package name (w/ or w/o =version)]
In the folder [project folder]/experiments/exp_config_examples
,
you can find the configuration files for running the experiments
mentioned in the paper.
For example, to replicate the Fig. 2 in the paper,
you can run all the experiments configured in
[project folder]/experiments/exp_config_examples/fig2
.
Of course, you can also try your own experiments
by customizing the configuration files.
The above-mentioned examples can serve as templates for you.
Once you have prepared a legitimate configuration file for a task,
you can put it in your created experiment folder and then use
the script cluster_run.sh
to help you launch the task.
For example, if you want to run one of the experiment that is
configured by mnist-pisces.yml
to replicate Fig. 7,
you can use the command
# starting from [project folder]/experiments/socc22_ae
cp -r ../exp_config_examples/fig7to9/ .
bash run.sh start_a_task fig7to9/mnist-pisces.yml
The sample stdout result looks like this
Task 20221031-194258 started on pisces-worker-1 (3.145.144.92).
...(similar lines)...
Use the command at local host if you want to stop it:
bash run.sh kill_a_task fig7to9/20221031-194258
Use the command at local host if you want to retrieve the log:
bash run.sh conclude_a_task fig7to9/20221031-194258
Use the command at a computing node if you want to watch its state:
vim ~/Pisces/experiments/socc22_ae/fig7to9/20221031-194258/log.txt
We suggest you try this example first to see if you can get meaningful results.
Remark
- We provide you with
cluster_run.sh
for a wide range of task-related control (no need to remember, because the prompt will inform you of them whenever you start a task, as mentioned above):# for killing the task halfway bash run.sh kill_a_task [target folder]/[some timestamp] # for fetching logs from all running nodes to the host machine # (i.e., the machine where you type this command) bash run.sh conclude_a_task [target folder]/[some timestamp]
Repo Root
|---- plato # Core implementation
|---- experiments # Evaluation
|---- cluster_folder_template # Basic tools for managing a cluster
|---- dev # Cluster management backend
|---- exp_config_examples # Configuration files used in the paper
|---- packages # External Python Package (e.g., YOLOv5)
Please consider citing our paper if you use the code or data in your research project.
@inproceedings{pisces-socc22,
author={Jiang, Zhifeng and Wang, Wei and Li, Baochun and Li, Bo},
title={Pisces: Efficient Federated Learning via Guided Asynchronous Training},
year={2022},
isbn={9781450394147},
publisher={Association for Computing Machinery},
booktitle={Proceedings of the 13th Symposium on Cloud Computing},
url={https://doi.org/10.1145/3542929.3563463},
doi={10.1145/3542929.3563463},
pages={370–385},
}
Zhifeng Jiang (zjiangaj@cse.ust.hk).