Avoid allocation of a Flux model on the CPU

Hi everyone,
I have to execute a flux model inside a Monte Carlo simulation. I am currently working on the CPU; I am facing the problem of executing a model(configuration)for each step of the Monte Carlo, which allocates memory. Since this runs on a nonparallelizable loop (each iteration depends on the result of the previous one, so using batches is not a solution), I got tons of allocations that make the GC hit continuously.
By inspecting the Profiling of the allocation, I get the following fireplot

It seems that most of the allocation here happens in the Convolutional layers of the neural network.

The model is defined as

model = Chain(x -> 2x .- 1,
          x -> reshape(x, (8, 8, 1,1)),
		  Conv((3,3), 1=>4, tanh; pad=1),
		  Conv((3,3), 4=>8, tanh; pad=1),
		  x -> reshape(x, :),
		  Dense(8*64, n_observables, tanh))

And the function monte_carlo_run! calls the model continuously inside a for loop.

How can I avoid the allocations inside the convolutioinnal layers?

1 Like

If it is only for inference, you can try this GitHub - ericphanson/AllocArrays.jl: Arrays that use a dynamically-scoped allocator. If it works, report the results.

2 Likes

If you don’t have to train the network, you can define your own custom Conv layer that calls the non-allocating NNlib.conv!(y, x, w, cdims). See here:

2 Likes

Wow, AllocArrays seems to effectively reduce by far the memory allocated by Flux with almost zero effort. Some memory is still allocated but reduced by a factor of 50. I will implement it in the actual program and see how it goes.

using Bumper, Flux, AllocArrays

x = zeros(Float32, 32, 32, 1, 1)
x_new  = AllocArray(x)

model = Chain(Conv((3,3), 1=>4, relu; pad=1), 
              x ->reshape(x, :), 
              Dense(4*32*32=>4))


function simple_run(model, data, iterations)
       result = model(data)
       tmp_input = similar(data)
       
       for  i in 2:iterations
           tmp_input .= data
           tmp_input .+= i
           result .+= model(tmp_input)
       end
       return result
end

function bumper_run(model, data, iterations)
       b = UncheckedBumperAllocator(2^20)
       result = model(data)
       tmp_input = similar(data)
       with_allocator(b) do
       for  i in 2:iterations
           tmp_input .= data
           tmp_input .+= i
           result .+= model(tmp_input)
           reset!(b)
       end
       end
       return result
end

@time simple_run(model, x, 1)
#  0.000129 seconds (58 allocations: 76.188 KiB)

@time simple_run(model, x, 10000)
#  0.501996 seconds (560.00 k allocations: 703.434 MiB, 1.98% gc time)

@time bumper_run(model, x, 1)
#  0.000151 seconds (79 allocations: 1.075 MiB)

@time bumper_run(model, x, 10000)
#  0.554891 seconds (580.02 k allocations: 37.540 MiB, 1.21% gc time)
2 Likes