For loop having too many allocation

I am pretty much new to julia, and I am writing a code to train a neural network. I have loaded in my training dataset, but I find that the learning process takes too long time. When I timed my loss function, it takes 6 seconds

function loss2()
    err = Float64(0.0)
    for ics = 1:n_cases
        for ipt = 1:n_pts 
            err += sum((NN2([xyz[ipt,ics,:]..., uparams[ics,:]..., wa[ics]])
            .- pp[ipt,ics,:]).^2)
    return err

@time loss2()
6.858366 seconds (42.79M allocations: 1.672 GiB, 4.22% gc time)

I believe that this is the reason why my training is very slow (it’s in the order of days!)

xyz is a 576x1296x3 array
uparams is a 1296x9 array
wa is a 1296 length vector
pp is a 576x1296x5 array.
n_pts = 576
n_cases = 1296

NN2 is defined as follows:

NN2 = Flux.Chain(
        Flux.Dense(13, 16, sigmoid),
        Flux.Dense(16, 5),

Could you please help me with suggestions to reduce the allocations within the loss functions, or general speed improvement tips?

You are making a lot of copies both by doing slicing indexing ([xyz[ipt,ics,:]) and by doing splatting (…). If you can drop those, you could remove a lot of allocations. But that’s only the start of optimizations you might potentially make.

1 Like

Check out Performance Tips · The Julia Language

In particular, your sample code appears to be entirely based on accessing global variable in the inner loop, which is very slow. The very first performance tip is “Avoid global variables”, so I would certainly start there.

Thanks for the suggestion. I removed the slices and splitting; and tried with expanding the array indices as follows:

function loss2()
    err = Float64(0.0)
    for ics = 1:n_cases
        for ipt = 1:n_pts
                err += sum((NN2([xyz[ipt,ics,1],xyz[ipt,ics,2],xyz[ipt,ics,3],
                    uparams[ics,6],uparams[ics,7],uparams[ics,8],uparams[ics,9], wa[ics]])
                - [pp[ipt,ics,1],pp[ipt,ics,2],pp[ipt,ics,3],pp[ipt,ics,4],pp[ipt,ics,5]]).^2)
    return err

It is around 1.5 times faster. Now I have:

@time loss2()
3.934574 seconds (36.53M allocations: 1.43 GiB, 6.02% gc time, 4.80% compilation time)

I still have lot of allocations, albeit a bit lesser than previous. Any way to further reduce this?

Please, use triple backticks around your code, like this:
function loss2()
And indent your code. Then it has syntax coloring and is much more readable.

1 Like

As @rdeits said: read the performance tips in the manual.

In particular: avoid global variables(!!), and remember that Julia Arrays are column major. Your code appears to be written in a row-major style.

Also, even in your updated snippet you are allocating arrays on every iteration.

You can simplify the loss function if you reshape your input data in the following way

X1 = zeros(13, n_pts * n_cases)
X2 = zeros(5, n_pts * n_cases)
i = 0
for ics in 1:n_cases, ipt in 1:n_pts
    i += 1
    X1[:, i] = vcat(xyz[ipt, ics, :], uparams[ics, :], wa[ics])
    X2[:, i] = pp[ipt, ics, :]

loss_new(X1, X2, NN2) = sum(abs2, NN2(X1) - X2)
julia> @time loss_new(X, Y, NN2);
  0.348413 seconds (15 allocations: 296.159 MiB, 4.08% gc time)
1 Like

I’m not experienced with Flux, or things like Chain, etc. But the last part here, y->abs.(y), also creates a vector. Can this somehow be avoided, or is it fundamental to how these things work?

In general, there’s a ton of things that can be improved.

For the last part, I need a 5-element vector as output from the neural network. Then I compare this with the value of pp within my loss function. And use the minimizing of this loss function to train my neural network

Could you please tell me how you defined the loss3(…) function? The issue that I am facing with this implementation is that the NN2 function takes in a 13 element vector as the input. So, I would again need to access the arrays X1 and X2 column-wise in my loss function. The way I defined the loss3 function is as follows:

function loss3()
    for i = 1:n_pts*n_cases
        ll = sum(abs2, NN2(X1[:,i]) - X2[:,i])
@time loss3()
  1.467642 seconds (22.39 M allocations: 1.146 GiB, 7.29% gc time)

This again gives me a lot of allocations. I am interested in knowing how you implemented it

Sorry, loss3 is the same as loss_new (I fixed it in the original answer).

NN2 function takes 13 element vector or matrix of size 13xN as an input. If the input is a vector of length 13, then the output of the first layer is a vector of length 16. However, if the input is a matrix of size 13 x N, then the output of the first layer is a matrix of size 16xN. See Dense function help for more details.

Try this

  1. Use views
  2. Do not use global variables

You’re doing it again. Please listen, don’t use global variables. (Several people have said this now, and it’s the number 1 performance tip in the manual, which is mandatory reading).

1 Like

@VaclavMacha thanks for the clarification. I didn’t know that the Dense function could accept matrix inputs. Now I see that the number of allocations are low.

loss_new(X1, X2, NN2) = sum(abs2, NN2(X1) - X2)
@time loss_new(X1,X2,NN2)
  0.613954 seconds (15 allocations: 296.159 MiB, 41.58% gc time)

@DNF I understand now the problem with the global variables. Using the solution from @VaclavMacha, I am passing in all the relevant variables into the function.

However, I still have an error when running my training function. It would be great if you could help me fix this as well. The training of the neural network is as follows:

data = Iterators.repeated((), 1000)
opt = Flux.ADAM(0.01)
Flux.train!(loss_new(X1,X2,NN2), Flux.params(NN2), data, opt)

I get the following error:

ERROR: MethodError: objects of type Float64 are not callable
Maybe you forgot to use an operator such as *, ^, %, / etc. ?

I can’t seem to find the cause of this error.

Shouldn’t the first input to train! be a function?

1 Like

Ah yes, that works! I will now close this thread… Thanks everyone, the help has been really amazing! :slight_smile:

1 Like

Just for curiosity, what’s NN2? It seems to return an array also, thus there is also some place for accelerating that further if that can be fused with the the sum.

NN2 is a Flux Neural Network. I need the output of the neural neural network as 5 different positive quantities. Hence the 5-element vector output. These 5 values are then compared with the values in pp (or X2) to optimize the weights and biases of the neural network.

1 Like

Another tip is to make all of your data Float32.

If your network is sufficiently small and doesn’t use any super exotic operations, GitHub - PumasAI/SimpleChains.jl: Simple chains may be of interest as a low-allocation option.