@ -107,8 +107,9 @@ Linear(intermediate_parallel) no bias -> [6, 1, 4096]
| expand expand -> [6, 1, 32, 128]
\ / |
┏---- dot |
┃ softmax /
attention┃ \ /
┃ += attention_mask /
attention┃ softmax /
┃ \ /
┗---- dot -> [1, 32, 6, 128] -> [6, 1, 4096]
Linear -> [6, 1, 4096]