Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou
Qwen, Alibaba Inc.
This is the repository contains core implementations of the AutoIF, proposed by Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models.
AutoIF is the first scalable and reliable method for automatically generating instruction-following data and verifying its quality using code execution feedback.
We divided the AutoIF's data synthesis process into steps and provided 10-20 samples per step to facilitate your reproduction. Please remember to replace them with your own input.
General Setup Environment:
- Python 3.9
- PyTorch (currently tested on version 2.1.2+cu121)
- Transformers (version 4.41.2, unlikely to work lower than this version)
cd ./AutoIF/
pip install -r requirements.txt
Firstly, we hand-write 36 seed instructions:
Step1: Self-instruct Seed Instructions
Concatenate the instruction with the RFT prompt.
python 1_RFT.py
Please perform k times RFT with a supervised model (e.g., GPT-4, Qwen2-72B), save as format in seed_instruction.txt.
Step2: Verification Funcs and Cases Generation
Using seed and augmented instructions for generating verification funcs and cases.
python 2_verification_funcs_cases_generation.py
Please generate K verification functions and cases for each sample, save it in eval_func_rft.jsonl
Step3: Quality Cross-validation
Cross-validate the pass rates of verification functions and cases to ensure high-quality instructions.
python 3_cross_validation.py
Step4 & 5: Back Translation
Please back translate verification funcs to instructions, and then use mDeBERTa for consistency filtering.
python 4_eval_func_backtranslator.py
python 5_eval_func_backtranslator_filter.py
Step1: Query Reforming and Augmentation
We randomly concat each query with K queries of ShareGPT and reformat them using our response RFT template:
python 6_concat_sharegpt_query.py
Please use supervision model to generate k responses for each query.
Step2: Instruction-following Verification
Cross-validate the pass rate of verification functions and augmented responses to obtain high-quality queries.
python 7_query_vertification.py
In this step, we also concatenate each sample with a consistency scoring prompt. Please score them using the supervision model.
Step3: Query Quality Verification
Finally, we fliter out the sample with score > 8 and save it into LlaMA-Factory's SFT data format.
python 8_query_score_filiter.py
python 9_sft_data_construction.py
✨Tips: In our paper, DPO includes two settings, the following are their differences:
- Offline DPO: the reponses are obtained from your SFT data generated by supervision model.
- Online DPO: the reponses are obtained from your response generated by your base model during each training iteration.
Please process your SFT data using the eval functions generated in the previous step, and format the results as dpo_query_eval_score_results.jsonl.
Step1: Verification Funcs Scoring
We use verfy the pass rate of each response by using corresponding verfication funcs.
python 1_dpo_rft_wash.py
Step1: Data selection
We construct DPO pairs with postive samples (Acc>=0.5) and nagative samples (Acc=0).
python 2_dpo_data_query_construct.py
After construction you need to process as the DPO data format in LlaMA-Factory.
We use the version of LlaMA-Factory v0.6.3. Thanks for their excellent work.
✨Tips: the difference between our two setups:
- Strong-to-Weak Distillation: we use powerful model as supervision model (e.g., GPT-4, Qwen2-72B, Llama3-70B), and weak model (e.g., Qwen2-7B, Llama3-8B) as base model.
- Self-Alignment: we use the same model (e.g., Qwen2-72B, Llama3-70B) as supervision and base model.
(1) SFT Training:
deepspeed --num_gpus=8 train_bash.py \
--deepspeed $deepspeed_zero3_config_path \
--stage sft \
--do_train \
--use_fast_tokenizer \
--flash_attn \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--model_name_or_path $MODEL_PATH \
--dataset $dataset \
--template $Template \
--finetuning_type full \
--output_dir $OUTPUT_PATH \
--overwrite_cache \
--overwrite_output_dir \
--warmup_steps 20 \
--weight_decay 0.1 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--ddp_timeout 9000 \
--learning_rate 7e-6 \
--lr_scheduler_type "linear" \
--logging_steps 1 \
--cutoff_len 8192 \
--save_steps 200 \
--num_train_epochs 3.0 \
--plot_loss \
--bf16
(2) DPO Training:
deepspeed --num_gpus 8 train_bash.py \
--deepspeed $deepspeed_zero3_config_path \
--stage dpo \
--do_train \
--model_name_or_path $MODEL_PATH \
--dataset $dataset \
--dataset_dir $DATA_PATH \
--template $Template \
--finetuning_type full \
--output_dir $OUTPUT_PATH \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 4096 \
--preprocessing_num_workers 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_ratio 0.1 \
--save_steps 1000 \
--learning_rate 5e-6 \
--num_train_epochs 2.0 \
--max_samples 200000 \
--ddp_timeout 180000000 \
--plot_loss \
--fp16
For the implementations details between training 7B and 70B models, please refer to our paper.
If you find this work helpful for your research, please kindly cite it.
@article{dong2024self,
title={Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models},
author={Dong, Guanting and Lu, Keming and Li, Chengpeng and Xia, Tingyu and Yu, Bowen and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2406.13542},
year={2024}
}