Go to file

Colin 122cbd9ff8 Use local tokenizer.		2024-02-24 14:14:12 +08:00
.vscode	enable pretrain.	2024-02-22 15:03:32 +08:00
custom_models	Use local tokenizer.	2024-02-24 14:14:12 +08:00
dataset	init gpt train without download.	2024-02-24 13:40:39 +08:00
.gitignore	[code] refactor	2023-05-07 13:01:02 +08:00
LICENSE	Create LICENSE	2023-05-08 00:26:38 +08:00
README.md	Update README.md	2023-05-14 23:08:44 +08:00
generate.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
lit_export.py	[code] formatter-caused changes	2023-05-28 20:02:56 +08:00
lit_module.py	set use local dataset.	2024-02-24 13:44:22 +08:00
lit_patches.py	[fix] add patch to fix DeepSpeedStrategy offload 'zero_force_ds_cpu_optimizer' issue	2023-05-09 23:00:28 +08:00
lit_train.py	Use local tokenizer.	2024-02-24 14:14:12 +08:00
requirements.txt	[code] update requirements	2023-05-06 21:05:53 +08:00
utils.py	[fix] fix genarate with custom models does not go to custom_models	2023-05-28 22:57:51 +08:00

README.md

GPT-Pretrain

Usage

Make it simple

python lit_train.py --model_name gpt2 --use_tril_attention_mask
python lit_export.py --version 0
python generate.py --model_name_or_path exports/version_0 --tokenizer_name_or_path gpt2

📝 Note: Training with a "--use_tril_attention_mask" is recommended. However, huggingface model implementions might not support 2D attention mask. You may write a custom model to support 2D attention mask, just like what I did in custom_models/gpt2.

Train on multiple GPUs

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy fsdp # default and recommended

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy deepspeed

python lit_train.py --model_name gpt2 --use_tril_attention_mask --strategy ddp

Reduce CUDA memory cost

half precision

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16

python lit_train.py --model_name gpt2 --use_tril_attention_mask --fp16

smaller batch size & accumulate grad batches

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
    --train_batch_size 2 --val_batch_size 4 --accumulate_grad_batches 128

cpu_offload

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
    --strategy fsdp_cpu_offload

python lit_train.py --model_name gpt2 --use_tril_attention_mask --bf16 \
    --strategy deepspeed_stage_3_offload