Reducing allocations and run time

Hi,
I’m running a program and trying to speed it up. I don’t get the logic of views, or any other way to reduce the number of allocations. A simplified example is

aa = [4 3; 4 5]
bb = [1 2;6 3]
cc = zeros(2,2,2)
@time @views for n=1:2 for m=1:2
    cc[:,n,m] = aa[n,:].*bb[:,m]
end end

which gives

0.000057 seconds (36 allocations: 1.812 KiB)

in a bigger example the number of allocations is much larger. How can views bring allocations to 0?

Thanks

You need to do the assignment in-place with .= because otherwise the block of memory resulting from the multiplication is allocated before it’s written to the specified range in your result array.

2 Likes

Not sure, if I use

cc[:,n,m] .=

I get

0.000059 seconds (40 allocations: 1.812 KiB)

Ok I assumed you were not running this at global scope, but it seems you are? Put your code in a function so that the types of all variables are known and constant.

This is one of the most common reasons for superfluous allocations and bad performance, if you reference variables that are just defined at the top level. You could always reassign their values if they are not marked const, so the compiled code will assume they can have any type. Putting stuff into functions, marking variables const or using let blocks are ways to mitigate this

3 Likes

An additional speed up by using also @inbounds

I had created a bad example, outside a function. This is a similar example where I put the code into a function

Tk=10
Tp=16
S=104
N=336
par1 = 2
mat1 = rand(Tk,S,4)
mat2 = rand(Tp,N,4)
mat3 = rand(S,N,4)
include("Untitled.jl")

function myfun(Tk,Tp,S,N,i,mat1,mat2,mat3,par1)
    reset_timer!()

    mat4 = mat3.^par1
    y = zeros(Tk,Tp,S,N)

    @timeit "mytime" for n=1:N for s in 1:S#@timeit "create indV"
              y[:,:,s,n] = mat1[:,s,i]*mat2[:,n,i]'/mat4[s,n,i]

    end end
    print_timer()
    return y
end

myfun(Tk,Tp,S,N,1,mat1,mat2,mat3,par1)

The timer still gives

                          Time                   Allocations      
                   ──────────────────────   ───────────────────────
 Tot / % measured:     40.4ms / 86.4%            148MiB / 70.5%    

 Section   ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────
 mytime         1   34.9ms   100%  34.9ms    105MiB  100%    105MiB

with and without .=

I guess a more general question is how to generate 0 allocations (or close to) in this example.

Now you’re not using views it seems. Also your example seems to differ a bit in that you’re not doing broadcasted multiplication anymore but some sort of matrix multiplication, which will allocate unless you use preexisting memory for the result. I think that’s what mul! does but I’m not much of a linalg person. Basically think about whether your expressions create intermediary arrays. If yes, try to write those results into already existing memory with mutating functions or the previously mentioned broadcasted assignment. In your example you’re creating copies with your indexing syntax, then multiplying and dividing and therefore creating a bunch of intermediary arrays.

1 Like

Some suggestions so that it will be easier to help you:

  1. provide an actual minimal working example. Your code does not run (clearly, because of the “Untitled.jl” file).

  2. Put these two lines outside the function and pass the matrices as parameters as well in the tests. Thes lines obviously allocate, and it will be better to have them outside the tests:

  1. I am not sure what @timeit actually does (have not seen it here before). The most usual package for benchmarking is BenchmarkTools, and you could use @btime on the call to myfun with the new y and mat4 parameters, to work on those allocations.
1 Like

It’s not identical, but a MWE with 3 options:

using BenchmarkTools, LinearAlgebra, Random

function myfun!(y, mat1, mat2, iCase)
    Tk, S = size(mat1);
    Tp, N = size(mat2);

    if iCase == 1
        yPart = zeros(Tk, Tp);
    end
    @inbounds @views for n=1:N 
        for s in 1:S
            if iCase == 1
                mul!(yPart, mat1[:,s], mat2[:,n]');
                y[:,:,s,n] .= yPart;
            elseif iCase == 2
                y[:,:,s,n] .= mat1[:,s]*mat2[:,n]';
            elseif iCase == 3
                for i1 = 1 : Tk
                    for i2 = 1 : Tp
                        y[i1,i2,s,n] = mat1[i1,s]*mat2[i2,n];
                    end
                end
            end
        end
    end
    return nothing
end

function bmark(iCase)
    rng = MersenneTwister(123)
    Tk=10
    Tp=16
    S=104
    N=336
    mat1 = rand(rng, Tk,S)
    mat2 = rand(rng, Tp,N)
    y = zeros(Tk,Tp,S,N)
    b = @benchmarkable myfun!($y, $mat1, $mat2, $iCase);
    run(b)
end

Unrolling the matrix multiplication avoids allocations and wins in terms of speed. But mul! helps some.

julia> bmark(1)
BenchmarkTools.Trial:
  memory estimate:  1.33 KiB
  allocs estimate:  1
  --------------
  minimum time:     8.877 ms (0.00% GC)
  median time:      9.118 ms (0.00% GC)
  mean time:        9.320 ms (0.00% GC)
  maximum time:     13.458 ms (0.00% GC)
  --------------
  samples:          537
  evals/sample:     1

julia> bmark(2)
BenchmarkTools.Trial:
  memory estimate:  45.32 MiB
  allocs estimate:  34945
  --------------
  minimum time:     11.003 ms (0.00% GC)
  median time:      12.034 ms (0.00% GC)
  mean time:        18.336 ms (32.68% GC)
  maximum time:     76.342 ms (84.29% GC)
  --------------
  samples:          273
  evals/sample:     1

julia> bmark(3)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.754 ms (0.00% GC)
  median time:      4.003 ms (0.00% GC)
  mean time:        4.180 ms (0.00% GC)
  maximum time:     11.018 ms (0.00% GC)
  --------------
  samples:          1196
  evals/sample:     1

I’m sure there must be a function that implements A B' without allocating, but I don’t know what it is.

1 Like