Controlling memory allocation using external struct as placeholder

question

#1

Hello Guys!

This question is about a function that I will post below but it is also about a more generic question which is … How can we efficiently deal with a function f that is called many times and every function call involves allocating some arrays (that are needed in order to get the result of the function). I am assuming that the dimension of the arrays is fixed so at every call of f the size of all the auxiliar arrays is known.

I decided to generate a type ( mutable struct now) called opt that has a placeholder for all the arrays that are needed for the function f.

Every time I call f I modify “inplance” all the arrays of opt instead of generating them and erasing them inside the function.

Nevertheless the function seems to allocate a lot of memory every time it is called and I can’t see why.

The function that I benchmark is

function partial_fit!(rbm::RBM, X::Matrix,  lr::Real, opt::CDK)
    compute_grad!(rbm, X, opt)
    update_params!(rbm, opt, lr)    
end

All the real computation (and allocation) comes from

function compute_grad!(rbm::RBM, X::Matrix,  opt::CDK)

    T = eltype(rbm.vis_bias)
    batch_size = size(X)[2]
    
    # Perform gibbs sampling to compute the negative phase
    for k in 1:opt.K
        if k==1       
            opt.H .= sigmoid.(rbm.W * X .+ rbm.hid_bias)
            opt.V_hat .= sigmoid.(rbm.W'* opt.H .+ rbm.vis_bias) .> rand(T,rbm.n_vis, batch_size)
            opt.H_hat .= sigmoid.(rbm.W * opt.V_hat .+ rbm.hid_bias) 
        else
            opt.V_hat .= sigmoid.(rbm.W'* opt.H_hat .+ rbm.vis_bias) .> rand(T,rbm.n_vis, batch_size)
            opt.H_hat .= sigmoid.(rbm.W * opt.V_hat .+ rbm.hid_bias) 
        end               
    end   
   
    opt.grad_W =  (opt.H * X' .-  opt.H_hat * opt.V_hat')./ batch_size; 
    opt.grad_vis_bias = vec(sum((X .- opt.V_hat), 2))./ batch_size;
    opt.grad_hid_bias = vec(sum((opt.H .- opt.H_hat), 2))./ batch_size;
    
    opt.rec_error = sqrt(sum((X.-opt.V_hat).^2))
end

The results of the benchmark are:

BenchmarkTools.Trial: 
  memory estimate:  10.81 MiB
  allocs estimate:  103
  --------------
  minimum time:     16.863 ms (0.00% GC)
  median time:      18.426 ms (0.00% GC)
  mean time:        18.836 ms (5.53% GC)
  maximum time:     23.076 ms (9.37% GC)
  --------------
  samples:          265
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In order to try to improve the code I decided to rewrite compute_grad. I wrote compute_grad_with_dot! where everytime I had an array A = somethig I rewrote it as A .= somethig, allocating all the needed memory inside opt

function partial_fit_with_dot!(rbm::RBM, X::Matrix,  lr::Real, opt::CDK)
    compute_grad_with_dot!(rbm, X, opt)
    update_params!(rbm, opt, lr)    
end

function compute_grad_with_dot!(rbm::RBM, X::Matrix,  opt::CDK)

    T = eltype(rbm.vis_bias)
    batch_size = size(X)[2]
    
    # Perform gibbs sampling to compute the negative phase
    for k in 1:opt.K
        opt.V_sampling .= rand(T, rbm.n_vis, batch_size)
        
        if k==1       
            opt.H .= sigmoid.(rbm.W * X .+ rbm.hid_bias)
            opt.V_hat .= sigmoid.(rbm.W'* opt.H .+ rbm.vis_bias) .> opt.V_sampling
            opt.H_hat .= sigmoid.(rbm.W * opt.V_hat .+ rbm.hid_bias) 
        else
            opt.V_hat .= sigmoid.(rbm.W'* opt.H_hat .+ rbm.vis_bias) .> opt.V_sampling
            opt.H_hat .= sigmoid.(rbm.W * opt.V_hat .+ rbm.hid_bias) 
        end               
    end   
   
    opt.grad_W .=  (opt.H * X' .-  opt.H_hat * opt.V_hat')./ batch_size; 
    opt.grad_vis_bias .= vec(sum((X .- opt.V_hat), 2))./ batch_size;
    opt.grad_hid_bias .= vec(sum((opt.H .- opt.H_hat), 2))./ batch_size;
    opt.rec_error = sqrt(sum((X .- opt.V_hat).^2))
end

The results are pretty much to the ones before in terms of memory.

@benchmark partial_fit_with_dot!(rbm, X_train[:,1:500], 0.1, cdk)

BenchmarkTools.Trial: 
  memory estimate:  10.75 MiB
  allocs estimate:  87
  --------------
  minimum time:     17.265 ms (0.00% GC)
  median time:      20.427 ms (0.00% GC)
  mean time:        20.815 ms (5.42% GC)
  maximum time:     37.003 ms (0.00% GC)
  --------------
  samples:          239
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

The full code can be found here:

What is the most effective way to deal with this situations?

Thank you for your time guys.


#2

Matrix multiplication is not inplace. You may want to use the the inplace version. You can use the function A_mul_B!, though I like to use InPlaceOps.jl for this:

@into! u = X * v

You’ll need a cache array to store the result.

Transposition allocates on v0.5 but not on v0.6.


#3

Thank you for your response Crhis,

Just tried your approach but the memory allocation seems to be about the same

@benchmark partial_fit_with_dot2!(rbm, X_train[:,1:500], 0.1, cdk)

BenchmarkTools.Trial: 
  memory estimate:  10.63 MiB
  allocs estimate:  81
  --------------
  minimum time:     17.312 ms (0.00% GC)
  median time:      20.193 ms (0.00% GC)
  mean time:        20.519 ms (5.42% GC)
  maximum time:     33.541 ms (9.82% GC)
  --------------
  samples:          243
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This is the function changed with A_mult_B!:


function compute_grad_with_dot2!(rbm::RBM, X::Matrix,  opt::CDK)

    T = eltype(rbm.vis_bias)
    batch_size = size(X)[2]
    
    # Perform gibbs sampling to compute the negative phase
    for k in 1:opt.K
        opt.V_sampling .= rand(T, rbm.n_vis, batch_size)
        if k==1       
            opt.H .= sigmoid.(A_mul_B!(opt.H_aux, rbm.W, X) .+ rbm.hid_bias)
            opt.V_hat .= sigmoid.(rbm.W'* opt.H .+ rbm.vis_bias) .> opt.V_sampling
            opt.H_hat .= sigmoid.(A_mul_B!(opt.H_aux, rbm.W, opt.V_hat)  .+ rbm.hid_bias) 
        else
            opt.V_hat .= sigmoid.(rbm.W'* opt.H_hat .+ rbm.vis_bias) .> opt.V_sampling
            opt.H_hat .= sigmoid.(A_mul_B!(opt.H_aux, rbm.W, opt.V_hat)  .+ rbm.hid_bias) 
        end        
    end   
   
    opt.grad_W .=  (opt.H * X' .-  opt.H_hat * opt.V_hat')./ batch_size; 
    opt.grad_vis_bias .= vec(sum((X .- opt.V_hat), 2))./ batch_size;
    opt.grad_hid_bias .= vec(sum((opt.H .- opt.H_hat), 2))./ batch_size;
    opt.rec_error = sqrt(sum((X .- opt.V_hat).^2))
end

#4

Well, let’s find some more allocations.

This will allocate a new array before setting its values into opt.V_sampling. You’ll want to use rand!.


#5

That certainly helped

I still fill the function allocates too much memory though. Is there any tool that could hit the lines that might allocate memory?

It would be quite cool. Something like what cython has which colors the lines that are not optimized well when you compile a function.

function compute_grad_with_dot2!(rbm::RBM, X::Matrix,  opt::CDK)

    T = eltype(rbm.vis_bias)
    batch_size = size(X)[2]
    
    # Perform gibbs sampling to compute the negative phase
    for k in 1:opt.K
        rand!(opt.V_sampling)
        
        if k==1       
            opt.H .= sigmoid.(A_mul_B!(opt.H_aux, rbm.W, X) .+ rbm.hid_bias)
            opt.V_hat .= sigmoid.(rbm.W'* opt.H .+ rbm.vis_bias) .> opt.V_sampling
            opt.H_hat .= sigmoid.(A_mul_B!(opt.H_aux, rbm.W, opt.V_hat)  .+ rbm.hid_bias) 
        else
            opt.V_hat .= sigmoid.(rbm.W'* opt.H_hat .+ rbm.vis_bias) .> opt.V_sampling
            opt.H_hat .= sigmoid.(A_mul_B!(opt.H_aux, rbm.W, opt.V_hat)  .+ rbm.hid_bias) 
        end        
    end   
   
    opt.grad_W .=  (opt.H * X' .-  opt.H_hat * opt.V_hat')./ batch_size; 
    opt.grad_vis_bias .= vec(sum((X .- opt.V_hat), 2))./ batch_size;
    opt.grad_hid_bias .= vec(sum((opt.H .- opt.H_hat), 2))./ batch_size;
    opt.rec_error = sqrt(sum((X .- opt.V_hat).^2))
end

Benchmark:

BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  82
  --------------
  minimum time:     16.258 ms (0.00% GC)
  median time:      17.947 ms (0.00% GC)
  mean time:        18.670 ms (3.58% GC)
  maximum time:     29.965 ms (0.00% GC)
  --------------
  samples:          267
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

#6

This can be an At_mul_B! and

this can be an A_mul_Bt!.

Yes, see this part of the docs.

http://docs.julialang.org/en/stable/manual/profile/#man-track-allocation