Go to file
Colin 087366c59b init gpt train without download. 2024-02-24 13:40:39 +08:00
.vscode enable pretrain. 2024-02-22 15:03:32 +08:00
custom_models init gpt train without download. 2024-02-24 13:40:39 +08:00
dataset init gpt train without download. 2024-02-24 13:40:39 +08:00
.gitignore [code] refactor 2023-05-07 13:01:02 +08:00
LICENSE Create LICENSE 2023-05-08 00:26:38 +08:00
README.md Update README.md 2023-05-14 23:08:44 +08:00
generate.py [code] formatter-caused changes 2023-05-28 20:02:56 +08:00
lit_export.py [code] formatter-caused changes 2023-05-28 20:02:56 +08:00
lit_module.py init gpt train without download. 2024-02-24 13:40:39 +08:00
lit_patches.py [fix] add patch to fix DeepSpeedStrategy offload 'zero_force_ds_cpu_optimizer' issue 2023-05-09 23:00:28 +08:00
lit_train.py init gpt train without download. 2024-02-24 13:40:39 +08:00
requirements.txt [code] update requirements 2023-05-06 21:05:53 +08:00
utils.py [fix] fix genarate with custom models does not go to custom_models 2023-05-28 22:57:51 +08:00

README.md

GPT-Pretrain

Usage

Make it simple

python lit_train.py --model_name gpt2 --use_tril_attention_mask
python lit_export.py --version 0
python generate.py --model_name_or_path exports/version_0 --tokenizer_name_or_path gpt2

📝 Note: Training with a "--use_tril_attention_mask" is recommended. However, huggingface model implementions might not support 2D attention mask. You may write a custom model to support 2D attention mask, just like what I did in custom_models/gpt2.

Train on multiple GPUs

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy fsdp # default and recommended
python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy deepspeed
python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy ddp

Reduce CUDA memory cost

  • half precision
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16
    
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --fp16
    
  • smaller batch size & accumulate grad batches
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
        --train_batch_size 2 --val_batch_size 4 --accumulate_grad_batches 128
    
  • cpu_offload
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
        --strategy fsdp_cpu_offload
    
    python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
        --strategy deepspeed_stage_3_offload