Go to file

yiqing-zhou b76d333f39 [code] formatter-caused changes		2023-05-28 20:02:56 +08:00
.vscode	[code] update .vscode launch.json	2023-05-14 22:55:01 +08:00
custom_models	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
.gitignore	[code] refactor	2023-05-07 13:01:02 +08:00
LICENSE	Create LICENSE	2023-05-08 00:26:38 +08:00
README.md	Update README.md	2023-05-14 23:08:44 +08:00
generate.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
lit_export.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
lit_module.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
lit_patches.py	[fix] add patch to fix DeepSpeedStrategy offload 'zero_force_ds_cpu_optimizer' issue	2023-05-09 23:00:28 +08:00
lit_train.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
requirements.txt	[code] update requirements	2023-05-06 21:05:53 +08:00
utils.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00

README.md

GPT-Pretrain

Usage

Make it simple

python lit_train.py --model_name gpt2 --use_tril_attention_mask
python lit_export.py --version 0
python generate.py --model_name_or_path exports/version_0 --tokenizer_name_or_path gpt2

📝 Note: Training with a "--use_tril_attention_mask" is recommended. However, huggingface model implementions might not support 2D attention mask. You may write a custom model to support 2D attention mask, just like what I did in custom_models/gpt2.

Train on multiple GPUs

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy fsdp # default and recommended

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy deepspeed

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy ddp

Reduce CUDA memory cost

half precision

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16

python lit_train.py --model_name gpt2 --use_tril_attention_mask --fp16

smaller batch size & accumulate grad batches

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
    --train_batch_size 2 --val_batch_size 4 --accumulate_grad_batches 128

cpu_offload

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
    --strategy fsdp_cpu_offload

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
    --strategy deepspeed_stage_3_offload