Params not getting updated during training

i started from a minimal code again and it is stuck training after first round
i guess i have to go back to mx and tf for now
i found it , when dataloader batchsize is more than 1 , it gets like this

This seems like a misunderstanding of Flux vs TF/MXnet APIs. Could you please post the actual minimal code example you tried (with dummy data generation if need be) and also the Python code youā€™re trying to translate? I ask because the issue seems to most likely be in the data preprocessing or loading side rather than the model definition or training.

1 Like

i have posted the simple code above
im still working on different ways to do this

here is the minimal code :

using Flux
model = Chain(Dense(10,1))
x=rand([1,2,3,4,5,6,7,8,9,0],(10,10000))
y=rand([1,2,3,4,5,6,7,8,9,0],(1,10000))
function losss(x,y)
    return Flux.mae(model(x),y)
end
optimiser = Flux.Descent(0.01)
train_loader = Flux.Data.DataLoader((x,y),batchsize=1)
Flux.@epochs 10 Flux.train!(losss,params(model),train_loader,optimiser,
    cb = Flux.throttle(() -> println(losss(x,y)),10))

the problem raises after first epoch , all the weights and biases in model becomes NaN and as the result modelā€™s output becomes NaN ,also the loss result ,hence the training fails to continueā€¦

1 Like

This could be due to an overflow, Descent() does not protect you from exploding gradients and such. You could either clip your gradients Flux.Optimise.Optimiser(ClipGradient(10.0), Descent(0.01)) or use a momentum based optimiser like ADAM().

2 Likes

I would also be surprised if this didnā€™t result in NaNs if you wrote something similar in PyTorch, mostly because thereā€™s no activation function like sigmoid/tanh/etc. to reduce the size of the outputs. If you try reducing the magnitude of your inputs (either by dividing everything by, say 100 or using a rand variant that samples from [0, 1)), the network should be less susceptible to spitting out NaNs. You may also want to try a larger batch size: it will ā€œsmooth outā€ the gradient updates and thus also reduce the possibility of overflowing into NaNs.

3 Likes