Skip to content

Commit

Permalink
llm support by @dinghao Zhou
Browse files Browse the repository at this point in the history
add casual model

fix typo

rm ckpt

add topk topp sampler

fix positoin

[train_engine] support fsdp (wenet-e2e#2412)

* [train_engine] support fsdp

* [train_engine] support fsdp

* unify scaler and amp

* fp32&&fp16 works in fsdp env

* fix fsdp in cv auto cast

* try to fix wenet.join fsdp

* implementing zero1 under fsdp is almost equivalent to deepspeed's zero1

* fix clip_and_grad_

* fix train summary

* all wenet xxxformer works (-paraformer -transducer)

* try to fix nan

* add barrier for cv

* add destroy group for end of all train

* refactor wrap methods and ckpt works

* fix ckpt

* fix cv in dtype != float32

* fix ckpt in model mode

* fix bf16 amp

* refactor scaler and autocast, fix fp32 fp16 bf16 for fsdp

* fix fp32 nullcontext to nullcontext()

* modify after review

* fix lint

* fix lint

LoRA support (wenet-e2e#2049)

* support lora for v3.0.1

* format code and update lora attention && encoder

* fix bug when lora_list is None

---------

Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com>

[env] update python version and deepspeed version (wenet-e2e#2462)

* [env] update python version and deepspeed version

* [env] fix lint

fix rope pos embdining (wenet-e2e#2463)

* fix rope pos embdining

* fix dropout

* fix comment

[transformer] add multi warmup and learning rate for different modules (wenet-e2e#2449)

* [transformer] add multi warmup and learning rate for different modules

* fix typo

* it works in warmuplr

* fix lr in tensorboard in step mode

* fix cv log

* cv works

* refactor cv log

* add helper lrs_to_string

* fix lrstr

* fix ddp multiple lr

* fix initial step

* revert to -1

* fix sub params dup

* fix step

* fix step

* fix log

* add assert for scheduler

* add comment for log

---------

Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com>

add generate

add toto

support sft & pretrain training forward

gemm conversion works

support init casual model

[whisper] limit language to Chinese (wenet-e2e#2470)

[train] convert tensor to scalar (wenet-e2e#2471)

[workflow] upgrad python version to 3.10 (wenet-e2e#2472)

* [workflow] upgrad python version to 3.10

* [workflow] try to pass

refactor cache behaviour in training mode (reduce compute cost and memory) (wenet-e2e#2473)

all gemma model works

fix ut

fix ut (wenet-e2e#2477)

* fix ut

* fix py version

[transformer] Make MoE runnable (wenet-e2e#2474)

[transformer] fix mqa (wenet-e2e#2478)

enable mmap in torch.load (wenet-e2e#2479)

[example] Add deespeed configs of different stages for illustration (wenet-e2e#2485)

[example] Fix prefetch and step_save (wenet-e2e#2486)

[ctl] simplified ctl (wenet-e2e#2483)

* [ctl] simplified ctl

* [ctl] unify

[branchformer] simplified branchformer (wenet-e2e#2482)

* [transformer] simplified branchformer

* fix yaml

* support mqa  gradiengt ckpt sdpa

* fix gradient checkponit

* add deepspeed comment in layer dropout

* fix comment

[e_branchformer] simplified e_branchformer (wenet-e2e#2484)

* [e_branchformer] simplified ctl

* try to fix ut

* try to fix ut

* fix activation

* fix att args

* e-branformer works

[transformer] refactor cache (wenet-e2e#2481)

* [transformer] refactor cache

* fix ut

* unify cache type in branchformer and ebranchformer

fix cache

fix gradient ckpt in branchformer/ebranformer (wenet-e2e#2488)

fix search after refactor cache (wenet-e2e#2490)

generate works!

unify chat pattern

convert llama3 works

[transformer] set use_reentrant=False for gradient ckpt (wenet-e2e#2491)

[transformer] fix warning: ignore(True) has been deprecated (wenet-e2e#2492)

* [transformer] fix warning: ignore(True) has been deprecated

* [transformer] fix warning: ignore(True) has been deprecated

[log] avoid reduntant logging (wenet-e2e#2493)

fix w1 w2 w3 in feedforward

add 70b temporarily

mv LLM to wenet

support llm dataset

unify config

add dataset yaml in script

support llm dataset

dynamic static bucket works

[transformer] refacgtor mqa repeat (wenet-e2e#2497)

[transformer] fix mqa in cross att (wenet-e2e#2498)

[deepspeed] update json config (wenet-e2e#2499)

training works

pretrain works

refactor covert

fix flash att in generate

llama works

fix llama3

fix speed

try fix ut

support stop tokens in gen and support ppl

support stop tokens in gen and support ppl
  • Loading branch information
Mddct authored and Your Name committed Aug 7, 2024
1 parent 648fee8 commit 57a04ce
Show file tree
Hide file tree
Showing 75 changed files with 3,276 additions and 1,146 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest]
torch: ["1.13.1"]
python-version: ["3.8"]
torch: ["2.2.2"]
python-version: ["3.10"]
steps:
- uses: actions/checkout@v1

Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: 3.9
python-version: 3.10.14
architecture: x64
- name: Fetch Wenet
uses: actions/checkout@v1
Expand Down Expand Up @@ -60,7 +60,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: 3.x
python-version: 3.10.14
architecture: x64
- name: Fetch Wenet
uses: actions/checkout@v1
Expand Down Expand Up @@ -88,7 +88,7 @@ jobs:
- name: Setup Python
uses: actions/setup-python@v1
with:
python-version: 3.x
python-version: 3.10.14
architecture: x64
- name: Fetch Wenet
uses: actions/checkout@v1
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/unit_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
max-parallel: 20
matrix:
os: [ubuntu-latest]
python-version: [3.8]
python-version: [3.10.14]
steps:
- name: Cache Python Packages
uses: actions/cache@v1
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ jobs:
# Used to host cibuildwheel
- uses: actions/setup-python@v3
with:
python-version: '3.6'
python-version: '3.10'

- name: Build wheels
uses: pypa/cibuildwheel@v2.11.2
env:
CIBW_BUILD_VERBOSITY: 1
CIBW_BUILD: "cp36-* cp37-* cp38-* cp39-*"
CIBW_BUILD: "cp36-* cp37-* cp38-* cp39-* cp310-*"
# Disable building PyPy wheels on all platforms
# Skip 32-bit builds
CIBW_SKIP: "pp* *-win32 *-manylinux_i686 *-musllinux_*"
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ git clone https://github.com/wenet-e2e/wenet.git
- Create Conda env:

``` sh
conda create -n wenet python=3.8
conda create -n wenet python=3.10
conda activate wenet
conda install conda-forge::sox
pip install -r requirements.txt
Expand Down
2 changes: 1 addition & 1 deletion examples/aishell/s0/conf/train_ebranchformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ encoder_conf:
activation_type: 'swish'
causal: false
pos_enc_layer_type: 'rel_pos'
attention_layer_type: 'rel_selfattn'
selfattention_layer_type: 'rel_selfattn'

# decoder related
decoder: transformer
Expand Down
2 changes: 1 addition & 1 deletion examples/aishell/s0/conf/train_u2++_branchformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ encoder_conf:
output_size: 256
use_attn: true
attention_heads: 4
attention_layer_type: rel_selfattn
selfattention_layer_type: rel_selfattn
pos_enc_layer_type: rel_pos
use_cgmlp: true
cgmlp_linear_units: 2048
Expand Down
2 changes: 1 addition & 1 deletion examples/aishell/s0/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ dir=exp/conformer
tensorboard_dir=tensorboard
checkpoint=
num_workers=8
prefetch=500
prefetch=10

# use average_checkpoint will get better result
average_checkpoint=true
Expand Down
31 changes: 1 addition & 30 deletions examples/aishell/whisper/conf/ds_stage1.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,40 +23,11 @@
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6
},
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
},
"flops_profiler": {
"enabled": false,
"profile_step": 100,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
},
"tensorboard": {
"enabled": false,
"output_path": "tensorboard/ds_logs/",
"job_name": "deepspeed"
"contiguous_gradients" : true
}
}
33 changes: 33 additions & 0 deletions examples/aishell/whisper/conf/ds_stage2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"steps_per_print": 100,
"gradient_clipping": 5,
"fp16": {
"enabled": false,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"consecutive_hysteresis": false,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true
}
}
41 changes: 41 additions & 0 deletions examples/aishell/whisper/conf/ds_stage3.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 1,
"steps_per_print": 100,
"gradient_clipping": 5,
"fp16": {
"enabled": false,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"consecutive_hysteresis": false,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e5
}
}
2 changes: 1 addition & 1 deletion examples/aishell/whisper/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ checkpoint=exp/whisper/large-v3/wenet_whisper.init-ctc.pt
dir=exp/finetune_whisper_largev3_conv1d2
tensorboard_dir=tensorboard
num_workers=8
prefetch=500
prefetch=10

# use average_checkpoint will get better result
average_checkpoint=true
Expand Down
2 changes: 1 addition & 1 deletion examples/librispeech/s0/conf/train_u2++_branchformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ encoder_conf:
output_size: 256
use_attn: true
attention_heads: 4
attention_layer_type: rel_selfattn
selfattention_layer_type: rel_selfattn
pos_enc_layer_type: rel_pos
use_cgmlp: true
cgmlp_linear_units: 2048
Expand Down
2 changes: 1 addition & 1 deletion examples/wenetspeech/s0/conf/train_u2++_conformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ dataset_conf:

grad_clip: 5
accum_grad: 4
max_epoch: 1 # NOTE(xcsong): Configure the epoch in run.sh
max_epoch: 100
log_interval: 100
save_interval: 1000 # NOTE(xcsong): we use step_save instead of epoch_save for large datasets

Expand Down
26 changes: 8 additions & 18 deletions examples/wenetspeech/s0/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,12 @@ train_set=train_`echo $set | tr 'A-Z' 'a-z'`
dev_set=dev
test_sets="test_net test_meeting"

# NOTE(xcsong): we use step_save instead of epoch_save for large datasets
epoch=100

train_config=conf/train_u2++_conformer.yaml
checkpoint=
dir=exp/u2pp_conformer
tensorboard_dir=tensorboard
num_workers=8
prefetch=10

cmvn_sampling_divisor=20 # 20 means 5% of the training data to estimate cmvn

Expand Down Expand Up @@ -157,33 +157,23 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "$0: using torch ddp"
fi

# repeat data.list, we use step_save instead of epoch_save for large datasets
train_data=data/$train_set/data.list.repeat${epoch}
if [ ! -f "${train_data}" ]; then
echo "repeat data/$train_set/data.list ${epoch} times"
for (( i=1; i<=$epoch; i++ ))
do
cat "data/$train_set/data.list" >> "${train_data}"
done
echo "save new data.list in ${train_data}, it will be used for training"
else
echo "${train_data} already exists."
fi

echo "$0: num_nodes is $num_nodes, proc_per_node is $num_gpus"
torchrun --nnodes=$num_nodes --nproc_per_node=$num_gpus --rdzv_endpoint=$HOST_NODE_ADDR \
--rdzv_id=2023 --rdzv_backend="c10d" \
wenet/bin/train.py \
--train_engine ${train_engine} \
--config $train_config \
--data_type "shard" \
--train_data ${train_data} \
--train_data data/$train_set/data.list \
--cv_data data/$dev_set/data.list \
${checkpoint:+--checkpoint $checkpoint} \
--model_dir $dir \
--tensorboard_dir ${tensorboard_dir} \
--ddp.dist_backend $dist_backend \
--num_workers 2 \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--timeout 1200 \
--deepspeed_config ${deepspeed_config} \
--deepspeed.save_states ${deepspeed_save_states}
fi
Expand Down
10 changes: 1 addition & 9 deletions examples/wenetspeech/whisper/conf/ds_stage1.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,11 @@
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"stage3_param_persistence_threshold": 1e6
"contiguous_gradients" : true
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ dataset_conf:

grad_clip: 5
accum_grad: 8
max_epoch: 1 # NOTE(xcsong): Configure the epoch in run.sh
max_epoch: 100
log_interval: 100
save_interval: 1000 # NOTE(xcsong): we use step_save instead of epoch_save for large datasets

Expand Down
20 changes: 3 additions & 17 deletions examples/wenetspeech/whisper/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,12 @@ train_set=train_l
dev_set=dev
test_sets="test_net test_meeting"

epoch=100
train_config=conf/finetune_whisper_largev3.yaml
checkpoint=exp/whisper/large-v3/wenet_whisper.init-ctc.pt
dir=exp/finetune_whisper_largev3
tensorboard_dir=tensorboard
num_workers=1
prefetch=500
num_workers=8
prefetch=10

# use average_checkpoint will get better result
average_checkpoint=true
Expand Down Expand Up @@ -92,19 +91,6 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "$0: using torch ddp"
fi

# repeat data.list, we use step_save instead of epoch_save for large datasets
train_data=data/$train_set/data.list.repeat${epoch}
if [ ! -f "${train_data}" ]; then
echo "repeat data/$train_set/data.list ${epoch} times"
for (( i=1; i<=$epoch; i++ ))
do
cat "data/$train_set/data.list" >> "${train_data}"
done
echo "save new data.list in ${train_data}, it will be used for training"
else
echo "${train_data} already exists."
fi

# NOTE(xcsong): Both ddp & deepspeed can be launched by torchrun
# NOTE(xcsong): To unify single-node & multi-node training, we add
# all related args. You should change `nnodes` &
Expand All @@ -128,7 +114,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--train_engine ${train_engine} \
--config $train_config \
--data_type $data_type \
--train_data ${train_data} \
--train_data data/$train_set/data.list \
--cv_data data/$dev_set/data.list \
${checkpoint:+--checkpoint $checkpoint} \
--model_dir $dir \
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ cpplint==1.6.1
torch>=2.1.2
torchaudio>=2.1.2
tqdm
deepspeed<0.13.0
deepspeed>=0.14.0
librosa
openai-whisper
pre-commit==3.5.0
Expand Down
Loading

0 comments on commit 57a04ce

Please sign in to comment.