Params not getting updated during training

i started from a minimal code again and it is stuck training after first round
i guess i have to go back to mx and tf for now
i found it , when dataloader batchsize is more than 1 , it gets like this

This seems like a misunderstanding of Flux vs TF/MXnet APIs. Could you please post the actual minimal code example you tried (with dummy data generation if need be) and also the Python code you’re trying to translate? I ask because the issue seems to most likely be in the data preprocessing or loading side rather than the model definition or training.

i have posted the simple code above
im still working on different ways to do this

here is the minimal code :

using Flux
model = Chain(Dense(10,1))
x=rand([1,2,3,4,5,6,7,8,9,0],(10,10000))
y=rand([1,2,3,4,5,6,7,8,9,0],(1,10000))
function losss(x,y)
    return Flux.mae(model(x),y)
end
optimiser = Flux.Descent(0.01)
train_loader = Flux.Data.DataLoader((x,y),batchsize=1)
Flux.@epochs 10 Flux.train!(losss,params(model),train_loader,optimiser,
    cb = Flux.throttle(() -> println(losss(x,y)),10))

the problem raises after first epoch , all the weights and biases in model becomes NaN and as the result model’s output becomes NaN ,also the loss result ,hence the training fails to continue…

This could be due to an overflow, Descent() does not protect you from exploding gradients and such. You could either clip your gradients Flux.Optimise.Optimiser(ClipGradient(10.0), Descent(0.01)) or use a momentum based optimiser like ADAM().

I would also be surprised if this didn’t result in NaNs if you wrote something similar in PyTorch, mostly because there’s no activation function like sigmoid/tanh/etc. to reduce the size of the outputs. If you try reducing the magnitude of your inputs (either by dividing everything by, say 100 or using a rand variant that samples from [0, 1)), the network should be less susceptible to spitting out NaNs. You may also want to try a larger batch size: it will ā€œsmooth outā€ the gradient updates and thus also reduce the possibility of overflowing into NaNs.