Memory allocation in matrix multiplication

I am using matrix multiplication inside large loops and I wonder how I can reduce memory allocation. For example, the following (useless) code

function run_cycle!(R, A, B)
    for i in 1:1000
        R = A*B
    end
end

function test_timing()
    A = rand(5, 5)
    B = rand(5, 5)
    R = Array{Float64,2}(undef, 5, 5)
    @time run_cycle!(R, A, B)
end

test_timing()

returns

0.000682 seconds (1000 allocations: 281.250 KiB)

Is there any way to eliminate this allocation? Before, in element-wise operations with matrices, I used to use macro @. to eliminate memory allocation. However, although allocation is eliminated, in this case @. R = A*B does something totally different.

1 Like

This works:

using LinearAlgebra
mul!(R, A, B)

If your arrays are small (I’ll use 5x5 as an example), then odds are a different library will give you better performance.
Two options include StaticArrays.jl, a popular choice that uses stack allocated arrays:

using StaticArrays, LinearAlgebra, BenchmarkTools
As = @SMatrix rand(5,5);
Bs = @SMatrix rand(5,5);
Aa = Array(As); Ba = Array(Bs); Ca = similar(Aa, size(Aa,1), size(Ba,2));

This yields:

julia> @btime As2 * $Bs setup = (As2 = $(Ref(As))[])
  21.886 ns (0 allocations: 0 bytes)
5×5 SArray{Tuple{5,5},Float64,2,25} with indices SOneTo(5)×SOneTo(5):
 1.15682   0.961188  1.08718   0.848822  1.10338
 0.584637  1.26248   0.896451  1.05197   0.892125
 1.22157   1.71552   1.4453    1.5288    1.28231
 1.40697   2.4277    1.80974   1.78083   2.07344
 0.963769  1.29204   1.19697   1.30232   1.07738

julia> @btime mul!($Ca, $Aa, $Ba)
  136.604 ns (0 allocations: 0 bytes)
5×5 Array{Float64,2}:
 1.15682   0.961188  1.08718   0.848822  1.10338
 0.584637  1.26248   0.896451  1.05197   0.892125
 1.22157   1.71552   1.4453    1.5288    1.28231
 1.40697   2.4277    1.80974   1.78083   2.07344
 0.963769  1.29204   1.19697   1.30232   1.07738

Because static arrays are stack allocated, the Julia compiler cheats the benchmark if you just do @btime $As * $Bs.

My library PaddedMatrices.jl isn’t well documented and is still missing a lot of features, but aims to be much faster than StaticArrays.jl:

using PaddedMatrices
Ap = FixedSizeMatrix{5,5,Float64}(undef) .= Aa;
Bp = FixedSizeMatrix{5,5,Float64}(undef) .= Ba;
Cp = FixedSizeMatrix{5,5,Float64}(undef);

yields

julia> @btime mul!($Cp, $Ap, $Bp)
  5.734 ns (0 allocations: 0 bytes)
5×5 FixedSizeArray{Tuple{5,5},Float64,2,Tuple{1,5},32,0,0,false} with indices 1:5×1:5:
 1.15682   0.961188  1.08718   0.848822  1.10338
 0.584637  1.26248   0.896451  1.05197   0.892125
 1.22157   1.71552   1.4453    1.5288    1.28231
 1.40697   2.4277    1.80974   1.78083   2.07344
 0.963769  1.29204   1.19697   1.30232   1.07738
8 Likes

thank you very much :slight_smile: