llm support by @dinghao Zhou

add casual model fix typo rm ckpt add topk topp sampler fix positoin [train_engine] support fsdp (wenet-e2e#2412) * [train_engine] support fsdp * [train_engine] support fsdp * unify scaler and amp * fp32&&fp16 works in fsdp env * fix fsdp in cv auto cast * try to fix wenet.join fsdp * implementing zero1 under fsdp is almost equivalent to deepspeed's zero1 * fix clip_and_grad_ * fix train summary * all wenet xxxformer works (-paraformer -transducer) * try to fix nan * add barrier for cv * add destroy group for end of all train * refactor wrap methods and ckpt works * fix ckpt * fix cv in dtype != float32 * fix ckpt in model mode * fix bf16 amp * refactor scaler and autocast, fix fp32 fp16 bf16 for fsdp * fix fp32 nullcontext to nullcontext() * modify after review * fix lint * fix lint LoRA support (wenet-e2e#2049) * support lora for v3.0.1 * format code and update lora attention && encoder * fix bug when lora_list is None --------- Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com> [env] update python version and deepspeed version (wenet-e2e#2462) * [env] update python version and deepspeed version * [env] fix lint fix rope pos embdining (wenet-e2e#2463) * fix rope pos embdining * fix dropout * fix comment [transformer] add multi warmup and learning rate for different modules (wenet-e2e#2449) * [transformer] add multi warmup and learning rate for different modules * fix typo * it works in warmuplr * fix lr in tensorboard in step mode * fix cv log * cv works * refactor cv log * add helper lrs_to_string * fix lrstr * fix ddp multiple lr * fix initial step * revert to -1 * fix sub params dup * fix step * fix step * fix log * add assert for scheduler * add comment for log --------- Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com> add generate add toto support sft & pretrain training forward gemm conversion works support init casual model [whisper] limit language to Chinese (wenet-e2e#2470) [train] convert tensor to scalar (wenet-e2e#2471) [workflow] upgrad python version to 3.10 (wenet-e2e#2472) * [workflow] upgrad python version to 3.10 * [workflow] try to pass refactor cache behaviour in training mode (reduce compute cost and memory) (wenet-e2e#2473) all gemma model works fix ut fix ut (wenet-e2e#2477) * fix ut * fix py version [transformer] Make MoE runnable (wenet-e2e#2474) [transformer] fix mqa (wenet-e2e#2478) enable mmap in torch.load (wenet-e2e#2479) [example] Add deespeed configs of different stages for illustration (wenet-e2e#2485) [example] Fix prefetch and step_save (wenet-e2e#2486) [ctl] simplified ctl (wenet-e2e#2483) * [ctl] simplified ctl * [ctl] unify [branchformer] simplified branchformer (wenet-e2e#2482) * [transformer] simplified branchformer * fix yaml * support mqa gradiengt ckpt sdpa * fix gradient checkponit * add deepspeed comment in layer dropout * fix comment [e_branchformer] simplified e_branchformer (wenet-e2e#2484) * [e_branchformer] simplified ctl * try to fix ut * try to fix ut * fix activation * fix att args * e-branformer works [transformer] refactor cache (wenet-e2e#2481) * [transformer] refactor cache * fix ut * unify cache type in branchformer and ebranchformer fix cache fix gradient ckpt in branchformer/ebranformer (wenet-e2e#2488) fix search after refactor cache (wenet-e2e#2490) generate works! unify chat pattern convert llama3 works [transformer] set use_reentrant=False for gradient ckpt (wenet-e2e#2491) [transformer] fix warning: ignore(True) has been deprecated (wenet-e2e#2492) * [transformer] fix warning: ignore(True) has been deprecated * [transformer] fix warning: ignore(True) has been deprecated [log] avoid reduntant logging (wenet-e2e#2493) fix w1 w2 w3 in feedforward add 70b temporarily mv LLM to wenet support llm dataset unify config add dataset yaml in script support llm dataset dynamic static bucket works [transformer] refacgtor mqa repeat (wenet-e2e#2497) [transformer] fix mqa in cross att (wenet-e2e#2498) [deepspeed] update json config (wenet-e2e#2499) training works pretrain works refactor covert fix flash att in generate llama works fix llama3 fix speed try fix ut support stop tokens in gen and support ppl support stop tokens in gen and support ppl
Zth9730 · Aug 7, 2024 · 57a04ce · 57a04ce
1 parent 648fee8
commit 57a04ce
Show file tree

Hide file tree

Showing 75 changed files with 3,276 additions and 1,146 deletions.
diff --git a/.github/workflows/doc.yml b/.github/workflows/doc.yml
@@ -19,8 +19,8 @@ jobs:
       fail-fast: false
       matrix:
         os: [ubuntu-latest]
-        torch: ["1.13.1"]
-        python-version: ["3.8"]
+        torch: ["2.2.2"]
+        python-version: ["3.10"]
     steps:
       - uses: actions/checkout@v1
 

diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -32,7 +32,7 @@ jobs:
       - name: Setup Python
         uses: actions/setup-python@v1
         with:
-          python-version: 3.9
+          python-version: 3.10.14
           architecture: x64
       - name: Fetch Wenet
         uses: actions/checkout@v1
@@ -60,7 +60,7 @@ jobs:
       - name: Setup Python
         uses: actions/setup-python@v1
         with:
-          python-version: 3.x
+          python-version: 3.10.14
           architecture: x64
       - name: Fetch Wenet
         uses: actions/checkout@v1
@@ -88,7 +88,7 @@ jobs:
       - name: Setup Python
         uses: actions/setup-python@v1
         with:
-          python-version: 3.x
+          python-version: 3.10.14
           architecture: x64
       - name: Fetch Wenet
         uses: actions/checkout@v1

diff --git a/.github/workflows/unit_test.yml b/.github/workflows/unit_test.yml
@@ -12,7 +12,7 @@ jobs:
       max-parallel: 20
       matrix:
         os: [ubuntu-latest]
-        python-version: [3.8]
+        python-version: [3.10.14]
     steps:
       - name: Cache Python Packages
         uses: actions/cache@v1

diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
@@ -21,13 +21,13 @@ jobs:
       # Used to host cibuildwheel
       - uses: actions/setup-python@v3
         with:
-          python-version: '3.6'
+          python-version: '3.10'
 
       - name: Build wheels
         uses: pypa/cibuildwheel@v2.11.2
         env:
           CIBW_BUILD_VERBOSITY: 1
-          CIBW_BUILD: "cp36-* cp37-* cp38-* cp39-*"
+          CIBW_BUILD: "cp36-* cp37-* cp38-* cp39-* cp310-*"
           # Disable building PyPy wheels on all platforms
           # Skip 32-bit builds
           CIBW_SKIP: "pp* *-win32 *-manylinux_i686 *-musllinux_*"

diff --git a/README.md b/README.md
@@ -56,7 +56,7 @@ git clone https://github.com/wenet-e2e/wenet.git
 - Create Conda env:
 
 ``` sh
-conda create -n wenet python=3.8
+conda create -n wenet python=3.10
 conda activate wenet
 conda install conda-forge::sox
 pip install -r requirements.txt

diff --git a/examples/aishell/s0/conf/train_ebranchformer.yaml b/examples/aishell/s0/conf/train_ebranchformer.yaml
@@ -18,7 +18,7 @@ encoder_conf:
     activation_type: 'swish'
     causal: false
     pos_enc_layer_type: 'rel_pos'
-    attention_layer_type: 'rel_selfattn'
+    selfattention_layer_type: 'rel_selfattn'
 
 # decoder related
 decoder: transformer

diff --git a/examples/aishell/s0/conf/train_u2++_branchformer.yaml b/examples/aishell/s0/conf/train_u2++_branchformer.yaml
@@ -5,7 +5,7 @@ encoder_conf:
     output_size: 256
     use_attn: true
     attention_heads: 4
-    attention_layer_type: rel_selfattn
+    selfattention_layer_type: rel_selfattn
     pos_enc_layer_type: rel_pos
     use_cgmlp: true
     cgmlp_linear_units: 2048

diff --git a/examples/aishell/s0/run.sh b/examples/aishell/s0/run.sh
@@ -55,7 +55,7 @@ dir=exp/conformer
 tensorboard_dir=tensorboard
 checkpoint=
 num_workers=8
-prefetch=500
+prefetch=10
 
 # use average_checkpoint will get better result
 average_checkpoint=true

diff --git a/examples/aishell/whisper/conf/ds_stage1.json b/examples/aishell/whisper/conf/ds_stage1.json
@@ -23,40 +23,11 @@
       "device": "none",
       "pin_memory": true
     },
-    "offload_param": {
-      "device": "none",
-      "pin_memory": true
-    },
     "allgather_partitions": true,
     "allgather_bucket_size": 5e8,
     "overlap_comm": true,
     "reduce_scatter": true,
     "reduce_bucket_size": 5e8,
-    "contiguous_gradients" : true,
-    "stage3_max_live_parameters": 1e9,
-    "stage3_max_reuse_distance": 1e9,
-    "stage3_prefetch_bucket_size": 5e8,
-    "stage3_param_persistence_threshold": 1e6
-  },
-  "activation_checkpointing": {
-    "partition_activations": false,
-    "cpu_checkpointing": false,
-    "contiguous_memory_optimization": false,
-    "number_checkpoints": null,
-    "synchronize_checkpoint_boundary": false,
-    "profile": false
-  },
-  "flops_profiler": {
-    "enabled": false,
-    "profile_step": 100,
-    "module_depth": -1,
-    "top_modules": 1,
-    "detailed": true,
-    "output_file": null
-  },
-  "tensorboard": {
-    "enabled": false,
-    "output_path": "tensorboard/ds_logs/",
-    "job_name": "deepspeed"
+    "contiguous_gradients" : true
   }
 }
diff --git a/examples/aishell/whisper/conf/ds_stage2.json b/examples/aishell/whisper/conf/ds_stage2.json
@@ -0,0 +1,33 @@
+{
+  "train_micro_batch_size_per_gpu": 1,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 100,
+  "gradient_clipping": 5,
+  "fp16": {
+    "enabled": false,
+    "auto_cast": false,
+    "loss_scale": 0,
+    "initial_scale_power": 16,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "consecutive_hysteresis": false,
+    "min_loss_scale": 1
+  },
+  "bf16": {
+   "enabled": true
+  },
+  "zero_force_ds_cpu_optimizer": false,
+  "zero_optimization": {
+    "stage": 2,
+    "offload_optimizer": {
+      "device": "none",
+      "pin_memory": true
+    },
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients" : true
+  }
+}
diff --git a/examples/aishell/whisper/conf/ds_stage3.json b/examples/aishell/whisper/conf/ds_stage3.json
@@ -0,0 +1,41 @@
+{
+  "train_micro_batch_size_per_gpu": 1,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 100,
+  "gradient_clipping": 5,
+  "fp16": {
+    "enabled": false,
+    "auto_cast": false,
+    "loss_scale": 0,
+    "initial_scale_power": 16,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "consecutive_hysteresis": false,
+    "min_loss_scale": 1
+  },
+  "bf16": {
+   "enabled": true
+  },
+  "zero_force_ds_cpu_optimizer": false,
+  "zero_optimization": {
+    "stage": 3,
+    "offload_optimizer": {
+      "device": "none",
+      "pin_memory": true
+    },
+    "offload_param": {
+      "device": "none",
+      "pin_memory": true
+    },
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients" : true,
+    "stage3_max_live_parameters": 1e9,
+    "stage3_max_reuse_distance": 1e9,
+    "stage3_prefetch_bucket_size": 5e8,
+    "stage3_param_persistence_threshold": 1e5
+  }
+}
diff --git a/examples/aishell/whisper/run.sh b/examples/aishell/whisper/run.sh
@@ -43,7 +43,7 @@ checkpoint=exp/whisper/large-v3/wenet_whisper.init-ctc.pt
 dir=exp/finetune_whisper_largev3_conv1d2
 tensorboard_dir=tensorboard
 num_workers=8
-prefetch=500
+prefetch=10
 
 # use average_checkpoint will get better result
 average_checkpoint=true

diff --git a/examples/librispeech/s0/conf/train_u2++_branchformer.yaml b/examples/librispeech/s0/conf/train_u2++_branchformer.yaml
@@ -5,7 +5,7 @@ encoder_conf:
     output_size: 256
     use_attn: true
     attention_heads: 4
-    attention_layer_type: rel_selfattn
+    selfattention_layer_type: rel_selfattn
     pos_enc_layer_type: rel_pos
     use_cgmlp: true
     cgmlp_linear_units: 2048

diff --git a/examples/wenetspeech/s0/conf/train_u2++_conformer.yaml b/examples/wenetspeech/s0/conf/train_u2++_conformer.yaml
@@ -104,7 +104,7 @@ dataset_conf:
 
 grad_clip: 5
 accum_grad: 4
-max_epoch: 1  # NOTE(xcsong): Configure the epoch in run.sh
+max_epoch: 100
 log_interval: 100
 save_interval: 1000  # NOTE(xcsong): we use step_save instead of epoch_save for large datasets
 

diff --git a/examples/wenetspeech/s0/run.sh b/examples/wenetspeech/s0/run.sh
@@ -47,12 +47,12 @@ train_set=train_`echo $set | tr 'A-Z' 'a-z'`
 dev_set=dev
 test_sets="test_net test_meeting"
 
-# NOTE(xcsong): we use step_save instead of epoch_save for large datasets
-epoch=100
-
 train_config=conf/train_u2++_conformer.yaml
 checkpoint=
 dir=exp/u2pp_conformer
+tensorboard_dir=tensorboard
+num_workers=8
+prefetch=10
 
 cmvn_sampling_divisor=20 # 20 means 5% of the training data to estimate cmvn
 
@@ -157,33 +157,23 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
     echo "$0: using torch ddp"
   fi
 
-  # repeat data.list, we use step_save instead of epoch_save for large datasets
-  train_data=data/$train_set/data.list.repeat${epoch}
-  if [ ! -f "${train_data}" ]; then
-    echo "repeat data/$train_set/data.list ${epoch} times"
-    for (( i=1; i<=$epoch; i++ ))
-    do
-        cat "data/$train_set/data.list" >> "${train_data}"
-    done
-    echo "save new data.list in ${train_data}, it will be used for training"
-  else
-    echo "${train_data} already exists."
-  fi
-
   echo "$0: num_nodes is $num_nodes, proc_per_node is $num_gpus"
   torchrun --nnodes=$num_nodes --nproc_per_node=$num_gpus --rdzv_endpoint=$HOST_NODE_ADDR \
            --rdzv_id=2023 --rdzv_backend="c10d" \
     wenet/bin/train.py \
       --train_engine ${train_engine} \
       --config $train_config \
       --data_type "shard" \
-      --train_data ${train_data} \
+      --train_data data/$train_set/data.list \
       --cv_data data/$dev_set/data.list \
       ${checkpoint:+--checkpoint $checkpoint} \
       --model_dir $dir \
+      --tensorboard_dir ${tensorboard_dir} \
       --ddp.dist_backend $dist_backend \
-      --num_workers 2 \
+      --num_workers ${num_workers} \
+      --prefetch ${prefetch} \
       --pin_memory \
+      --timeout 1200 \
       --deepspeed_config ${deepspeed_config} \
       --deepspeed.save_states ${deepspeed_save_states}
 fi

diff --git a/examples/wenetspeech/whisper/conf/ds_stage1.json b/examples/wenetspeech/whisper/conf/ds_stage1.json
@@ -23,19 +23,11 @@
       "device": "none",
       "pin_memory": true
     },
-    "offload_param": {
-      "device": "none",
-      "pin_memory": true
-    },
     "allgather_partitions": true,
     "allgather_bucket_size": 5e8,
     "overlap_comm": true,
     "reduce_scatter": true,
     "reduce_bucket_size": 5e8,
-    "contiguous_gradients" : true,
-    "stage3_max_live_parameters": 1e9,
-    "stage3_max_reuse_distance": 1e9,
-    "stage3_prefetch_bucket_size": 5e8,
-    "stage3_param_persistence_threshold": 1e6
+    "contiguous_gradients" : true
   }
 }
diff --git a/examples/wenetspeech/whisper/conf/finetune_whisper_largev3.yaml b/examples/wenetspeech/whisper/conf/finetune_whisper_largev3.yaml
@@ -108,7 +108,7 @@ dataset_conf:
 
 grad_clip: 5
 accum_grad: 8
-max_epoch: 1  # NOTE(xcsong): Configure the epoch in run.sh
+max_epoch: 100
 log_interval: 100
 save_interval: 1000  # NOTE(xcsong): we use step_save instead of epoch_save for large datasets
 

diff --git a/examples/wenetspeech/whisper/run.sh b/examples/wenetspeech/whisper/run.sh
@@ -44,13 +44,12 @@ train_set=train_l
 dev_set=dev
 test_sets="test_net test_meeting"
 
-epoch=100
 train_config=conf/finetune_whisper_largev3.yaml
 checkpoint=exp/whisper/large-v3/wenet_whisper.init-ctc.pt
 dir=exp/finetune_whisper_largev3
 tensorboard_dir=tensorboard
-num_workers=1
-prefetch=500
+num_workers=8
+prefetch=10
 
 # use average_checkpoint will get better result
 average_checkpoint=true
@@ -92,19 +91,6 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
     echo "$0: using torch ddp"
   fi
 
-  # repeat data.list, we use step_save instead of epoch_save for large datasets
-  train_data=data/$train_set/data.list.repeat${epoch}
-  if [ ! -f "${train_data}" ]; then
-    echo "repeat data/$train_set/data.list ${epoch} times"
-    for (( i=1; i<=$epoch; i++ ))
-    do
-        cat "data/$train_set/data.list" >> "${train_data}"
-    done
-    echo "save new data.list in ${train_data}, it will be used for training"
-  else
-    echo "${train_data} already exists."
-  fi
-
   # NOTE(xcsong): Both ddp & deepspeed can be launched by torchrun
   # NOTE(xcsong): To unify single-node & multi-node training, we add
   #               all related args. You should change `nnodes` &
@@ -128,7 +114,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
       --train_engine ${train_engine} \
       --config $train_config \
       --data_type  $data_type \
-      --train_data ${train_data} \
+      --train_data data/$train_set/data.list \
       --cv_data data/$dev_set/data.list \
       ${checkpoint:+--checkpoint $checkpoint} \
       --model_dir $dir \

diff --git a/requirements.txt b/requirements.txt
@@ -18,7 +18,7 @@ cpplint==1.6.1
 torch>=2.1.2
 torchaudio>=2.1.2
 tqdm
-deepspeed<0.13.0
+deepspeed>=0.14.0
 librosa
 openai-whisper
 pre-commit==3.5.0