i started from a minimal code again and it is stuck training after first round
i guess i have to go back to mx and tf for now
i found it , when dataloader batchsize is more than 1 , it gets like this
This seems like a misunderstanding of Flux vs TF/MXnet APIs. Could you please post the actual minimal code example you tried (with dummy data generation if need be) and also the Python code youāre trying to translate? I ask because the issue seems to most likely be in the data preprocessing or loading side rather than the model definition or training.
i have posted the simple code above
im still working on different ways to do this
here is the minimal code :
using Flux
model = Chain(Dense(10,1))
x=rand([1,2,3,4,5,6,7,8,9,0],(10,10000))
y=rand([1,2,3,4,5,6,7,8,9,0],(1,10000))
function losss(x,y)
return Flux.mae(model(x),y)
end
optimiser = Flux.Descent(0.01)
train_loader = Flux.Data.DataLoader((x,y),batchsize=1)
Flux.@epochs 10 Flux.train!(losss,params(model),train_loader,optimiser,
cb = Flux.throttle(() -> println(losss(x,y)),10))
the problem raises after first epoch , all the weights and biases in model becomes NaN and as the result modelās output becomes NaN ,also the loss result ,hence the training fails to continueā¦
This could be due to an overflow, Descent()
does not protect you from exploding gradients and such. You could either clip your gradients Flux.Optimise.Optimiser(ClipGradient(10.0), Descent(0.01))
or use a momentum based optimiser like ADAM()
.
I would also be surprised if this didnāt result in NaNs if you wrote something similar in PyTorch, mostly because thereās no activation function like sigmoid/tanh/etc. to reduce the size of the outputs. If you try reducing the magnitude of your inputs (either by dividing everything by, say 100 or using a rand
variant that samples from [0, 1)
), the network should be less susceptible to spitting out NaNs. You may also want to try a larger batch size: it will āsmooth outā the gradient updates and thus also reduce the possibility of overflowing into NaNs.