 # Flux.jl vanilla ANN loss goes to NaN with mini batch

I’m building a vanilla NN with 1 or 2 hidden layers for a regression problem that’s basically as follows:
I take N sets of known values (x_1, y_1), ..., (x_k, y_k) as input to an approximation solver that returns N sets of (a_1, b_1), ..., (a_k, b_k), and to a solver for true values that returns N sets of (c_1, d_1), ..., (c_k, d_k), where a_i \approx c_i and b_i \approx d_i, for i = 1, 2, ..., k, but could have large inaccuracies. I’m building an ANN that takes N x (k*4) input [x_1, y_1, a_1, b_1, x_2, y_2, a_2, b_2,...] and outputs [c_1', d_1', c_2', d_2', ...] that are better approximations to the true c, d values than the original a, b approximations.

I’m using ADAM optimizer, and naturally, my loss function is `MSELoss` and the model is simply two `Dense` layers chained together. All worked well, except when I have a large input size where K*4 \approx 6000 and N = 10000, training the full batch on GPU (Nvidia GTX960M, pretty trash but that’s all I have for now) gave me `OutOfMemoryError()`. The maximum N I can run full batch training without the error is about 2500. So I decided to try mini-batching the input, and feed the mini batches to GPU in each iteration as following:

``````randIdx = collect(1:1:size(trainData))  # trainData has shape K*4 x N
numBatches = round(Int, floor(size(trainData) / batch_size))
for epoch = 1:epochs
println("epoch: ", epoch)
Random.shuffle!(randIdx)  # to shuffle training set
i = 1
for j = 1:numBatches
println(j)
batchData = trainData[:, randIdx[i:batch_size]] |> gpu
batchTarget = trainTarget[:, randIdx[i:batch_size]] |> gpu
Flux.train!(loss, Flux.params(model), [(batchData, batchTarget)], opt)
epochTrainLoss += Tracker.data(loss(batchData, batchTarget))
epochTrainAcc += Tracker.data(accuracy(batchData, batchTarget))
i += batch_size
end
push!(trainLoss, epochTrainLoss / numBatches)
push!(trainAcc, epochTrainAcc / numBatches)
push!(valLoss, Tracker.data(loss(valData, valTarget)))
push!(valAcc, Tracker.data(accuracy(valData, valTarget)))
epochTrainLoss, epochTrainAcc = 0.0, 0.0   # reset values
end
``````

However, now in the second iteration of the first epoch, I’m getting `Loss is NaN` error:

``````julia> include("case_general.jl")
epoch: 1
1
2.11837100982666
2
Stacktrace:
 losscheck(::Tracker.TrackedReal{Float32}) at C:\Users\me\.julia\packages\Tracker\RRYy6\src\back.jl:155
 gradient_(::getfield(Flux.Optimise, Symbol("##14#20")){getfield(Main, Symbol("#loss#256"))}, ::Tracker.Params) at C:\Users\me\.julia\packages\Tracker\RRYy6\src\back.jl:98
 #gradient#24(::Bool, ::Function, ::Function, ::Tracker.Params) at C:\Users\me\.julia\packages\Tracker\RRYy6\src\back.jl:164
 macro expansion at C:\Users\me\.julia\packages\Flux\qXNjB\src\optimise\train.jl:71 [inlined]
 macro expansion at C:\Users\me\.julia\packages\Juno\TfNYn\src\progress.jl:124 [inlined]
 #train!#12(::getfield(Flux.Optimise, Symbol("##16#22")), ::Function, ::Function, ::Tracker.Params, ::Array{Tuple{CuArray{Float32,2},CuArray{Float32,2}},1}, ::ADAM) at C:\Users\me\.julia\package\train.jl:69
 train!(::Function, ::Tracker.Params, ::Array{Tuple{CuArray{Float32,2},CuArray{Float32,2}},1}, ::ADAM) at C:\Users\me\.julia\packages\Flux\qXNjB\src\optimise\train.jl:67
 #mlp#254(::Int64, ::Function, ::String, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Int64, ::Int64, ::Int64) at F:\work\large cases\case_general.jl:156
 (::getfield(Main, Symbol("#kw##mlp")))(::NamedTuple{(:K2,),Tuple{Int64}}, ::typeof(mlp), ::String, ::Array{Float64,2}, ::Array{Float64,2}, ::Float64, ::Int64, ::Int64, ::Int64) at .\none:0
 macro expansion at .\util.jl:156 [inlined]
 main(::Int64) at F:\work\large cases\case_general.jl:221
 top-level scope at none:0
 include at .\boot.jl:326 [inlined]
I looked at this link on NaN loss and tried adjusting my learning rate to extremely small (1e-15) and increasing my batch size to 2500 (so that it should have the same behavior as full batch training with 2500 samples), but those didn’t help, and only the first iteration loss slightly changed (around 2.1). I also checked my input data and target arrays and made sure none of the elements is `NaN` or `Inf`. Any advice on how I can track where the loss value is exploding?
Yep, I was an idiot and forgot to do `i+batch_size` in `trainData[:, randIdx[i:batch_size]]`. Now I just need to deal with the `OutOfMemoryError()`.