Go to file
周以晴 10a88a5012
Update README.md
2023-05-14 23:08:44 +08:00
.vscode [code] update .vscode launch.json 2023-05-14 22:55:01 +08:00
custom_models . 2023-05-14 22:53:28 +08:00
.gitignore [code] refactor 2023-05-07 13:01:02 +08:00
LICENSE Create LICENSE 2023-05-08 00:26:38 +08:00
README.md Update README.md 2023-05-14 23:08:44 +08:00
generate.py [code] refactor 2023-05-07 13:01:02 +08:00
lit_export.py [optimize] map_location='cpu' for load_from_checkpoint 2023-05-09 00:37:52 +08:00
lit_module.py [fix] add patch to fix DeepSpeedStrategy offload 'zero_force_ds_cpu_optimizer' issue 2023-05-09 23:00:28 +08:00
lit_patches.py [fix] add patch to fix DeepSpeedStrategy offload 'zero_force_ds_cpu_optimizer' issue 2023-05-09 23:00:28 +08:00
lit_train.py [fix] add patch to fix DeepSpeedStrategy offload 'zero_force_ds_cpu_optimizer' issue 2023-05-09 23:00:28 +08:00
requirements.txt [code] update requirements 2023-05-06 21:05:53 +08:00
utils.py [feature] custom_models 2023-05-14 22:23:16 +08:00

README.md

GPT-Pretrain

Usage

Make it simple

python lit_train.py --model_name gpt2 --use_tril_attention_mask
python lit_export.py --version 0
python generate.py --model_name_or_path exports/version_0 --tokenizer_name_or_path gpt2

📝 Note: Training with a "--use_tril_attention_mask" is recommended. However, huggingface model implementions might not support 2D attention mask. You may write a custom model to support 2D attention mask, just like what I did in custom_models/gpt2.

Train on multiple GPUs

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy fsdp # default and recommended
python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy deepspeed
python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy ddp

Reduce CUDA memory cost

  • half precision
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16
    
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --fp16
    
  • smaller batch size & accumulate grad batches
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
        --train_batch_size 2 --val_batch_size 4 --accumulate_grad_batches 128
    
  • cpu_offload
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
        --strategy fsdp_cpu_offload
    
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
        --strategy deepspeed_stage_3_offload