Dense Layers in Attention Don't Train

I am trying to create a transformer model using Flux.jl from scratch. I have an encoder struct that has an attention struct, which includes Q, K, V, O dense layers. My problem is that Dense.weights and Dense.bias don’t seem to change after training. They seem to be the same after a lot of epochs.

I have added the
Flux.@functor AttentionLayer
as well as Flux.@functor Encoder etc.

The overall model seems to be getting trained, with the MSE loss decreasing, but while other parameters seem to change, the Dense layers in the attention don’t seem to be affected.

Anyone who could help me?

Welcome.

First, could you give a Minimum example to help you?

Second, maybe you should interested in GitHub - chengchingwen/Transformers.jl: Julia Implementation of Transformer models, It has an interesting documentation.