Dense Layers in Attention Don't Train

Helectus · July 18, 2024, 2:57pm

I am trying to create a transformer model using Flux.jl from scratch. I have an encoder struct that has an attention struct, which includes Q, K, V, O dense layers. My problem is that Dense.weights and Dense.bias don’t seem to change after training. They seem to be the same after a lot of epochs.

I have added the
Flux.@functor AttentionLayer
as well as Flux.@functor Encoder etc.

The overall model seems to be getting trained, with the MSE loss decreasing, but while other parameters seem to change, the Dense layers in the attention don’t seem to be affected.

Anyone who could help me?

dmolina · July 20, 2024, 4:19pm

Welcome.

First, could you give a Minimum example to help you?

Second, maybe you should interested in GitHub - chengchingwen/Transformers.jl: Julia Implementation of Transformer models, It has an interesting documentation.

Topic		Replies	Views
How to implement a sparse decoder layer in Flux.jl? New to Julia question , flux	1	87	December 17, 2024
Bug when training a custom model using Flux Machine Learning flux , training	2	357	February 18, 2023
Is there an implementation of the attention mechanism in Flux.jl? Machine Learning flux	5	2834	September 23, 2020
Dense Layers, softmax, relu New to Julia	4	2443	March 4, 2020
Why the result from Flux.jl is totally different from tf.Keras (with the same simple MLP) Machine Learning question , package	6	1457	December 3, 2019

Dense Layers in Attention Don't Train

Related topics