Why fewer memory allocations does not necessarily suggest higher speed

Hi all

I’ve always been told to reduce redundant memory allocations, especially when the given operation is to be executed for many times; and loops are fast, so vectorised code is not always necessary.

However, in practice, I found it is not always the case. For example, I calculated the matrix product of X * diagonal(y) * X’, where y is a vector and X a matrix, and stored the results in a pre-allocated matrix D:

function func1(X, y)
           transpose(X) * Diagonal(y) * X       
end

function func2!(X, y, D)
       Xt = transpose(X)
       fill!(D, 0)
       for n = 1:size(X, 1)
            @inbounds @views D .+= Xt[:, n] .* transpose(Xt[:, n]) .* y[n] 
       end
       D
end

And the results are

julia> D = Matrix{Float64}(undef, 500, 500);
julia> @benchmark D .= func1(randn(1000, 500), randn(1000))
BenchmarkTools.Trial: 
  memory estimate:  9.54 MiB
  allocs estimate:  10
  --------------
  minimum time:     11.245 ms (0.00% GC)
  median time:      15.246 ms (0.00% GC)
  mean time:        21.155 ms (6.78% GC)
  maximum time:     163.981 ms (9.29% GC)
  --------------
  samples:          236
  evals/sample:     1

julia> @benchmark func2!(randn(1000, 500), randn(1000), $D)
BenchmarkTools.Trial:   memory estimate:  3.82 MiB
  allocs estimate:  3
  --------------
  minimum time:     353.599 ms (0.00% GC)
  median time:      383.437 ms (0.00% GC)
  mean time:        506.716 ms (0.00% GC)
  maximum time:     1.159 s (0.00% GC)
  --------------
  samples:          11
  evals/sample:     1

I was wondering what made func2! slower than func1, even the former yielded fewer memory allocations. And more generally, is there a golden rule to help decide when to use loops, and when to use built-in matrix operations?

Any help would be more than appreciated.

Beating BLAS routines is always a tough ask because they are already very optimized.
Reducing allocations can be better specially if your code is allocating a lot. Since func1 doesn’t allocate that much the impact of those allocations isn’t that large. You can see that by making a version of func1 that doesn’t allocate

function func3!(X, y, D, E)
    mul!(E, Diagonal(y), X)
    mul!(D, transpose(X), E)
    nothing
end

julia> @benchmark func1($a,$b)
BenchmarkTools.Trial: 
  memory estimate:  5.72 MiB
  allocs estimate:  6
  --------------
  minimum time:     25.907 ms (0.00% GC)
  median time:      26.195 ms (0.00% GC)
  mean time:        26.289 ms (0.28% GC)
  maximum time:     27.058 ms (2.35% GC)
  --------------
  samples:          191
  evals/sample:     1

julia> @benchmark func3!($a,$b, $D, $E)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     25.388 ms (0.00% GC)
  median time:      25.914 ms (0.00% GC)
  mean time:        25.884 ms (0.00% GC)
  maximum time:     26.236 ms (0.00% GC)
  --------------
  samples:          194
  evals/sample:     1

Another thing to take a look at, its very likely that the BLAS operation is using multiple threads, while your version isn’t.

If you wan’t to see pure julia competing with the BLAS operations, check https://github.com/JuliaSIMD/LoopVectorization.jl for some very fast loops.

3 Likes

Thank you for your answer - it’s very helpful. I was wondering if macros like @turbo would cause overheads of multi-threading, especially when I have many sequential Matrix operations?

@turbo is single-threaded by default, however when calling @tturbo the package doesn’t use the normal julia threading and instead uses https://github.com/JuliaSIMD/Polyester.jl, which has lower overhead. I wouldnt worry about the threading overhead though, its on the order of microseconds. Check the package for some more information about it.

3 Likes
julia> using LoopVectorization, BenchmarkHistograms, LinearAlgebra

julia> D = Matrix{Float64}(undef, 500, 500); D2 = similar(D); D3 = similar(D);

julia> X = randn(1000,500); y = randn(1000);

julia> function func5!(D, X, y)
           @tturbo for m in axes(D,1), k in axes(D,2)
               Dmk = zero(eltype(D))
               for n = axes(X, 1)
                    Dmk += X[n, m] * X[n, k] * y[n]
               end
               D[m, k] = Dmk
           end
           D
       end
func5! (generic function with 1 method)

julia> function func2!(D, X, y)
              Xt = transpose(X)
              fill!(D, 0)
              for n = 1:size(X, 1)
                   @inbounds @views D .+= Xt[:, n] .* transpose(Xt[:, n]) .* y[n]
              end
              D
       end
func2! (generic function with 1 method)

julia> func2!(D2,X,y);

julia> func5!(D,X,y) ≈ D2
true

julia> @benchmark func5!($D,$X,$y)
samples: 4270; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1.162e6 - 1.17e6 ]  ██████████████████████████████3551
 (1.17e6  - 1.178e6]  █████▊680
 (1.178e6 - 1.186e6]  ▎15
 (1.186e6 - 1.194e6]  ▏3
 (1.194e6 - 1.202e6]  ▏3
 (1.202e6 - 1.21e6 ]   0
 (1.21e6  - 1.218e6]   0
 (1.218e6 - 1.225e6]  ▏2
 (1.225e6 - 1.233e6]  ▏2
 (1.233e6 - 1.241e6]  ▏1
 (1.241e6 - 1.249e6]  ▏1
 (1.249e6 - 1.257e6]  ▏5
 (1.257e6 - 1.265e6]  ▏2
 (1.265e6 - 1.334e6]  ▏5

                  Counts

min: 1.162 ms (0.00% GC); mean: 1.168 ms (0.00% GC); median: 1.167 ms (0.00% GC); max: 1.334 ms (0.00% GC).

julia> @benchmark func2!($D3,$X,$y)
samples: 30; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (1.703e8 - 1.705e8]  █████████████████████▌5
 (1.705e8 - 1.708e8]  █████████████████████▌5
 (1.708e8 - 1.71e8 ]  █████████████████████████▊6
 (1.71e8  - 1.713e8]  ██████████████████████████████ 7
 (1.713e8 - 1.716e8]  █████████████████████████▊6
 (1.716e8 - 1.718e8]  ████▍1

                  Counts

min: 170.258 ms (0.00% GC); mean: 170.990 ms (0.00% GC); median: 171.029 ms (0.00% GC); max: 171.823 ms (0.00% GC).

Note that I benchmarked on a computer with 18 cores and 1 MiB L2 cache/core, so LoopVectorization can handle much larger arrays on this computer before performance falls off a cliff [e.g., see the README plot. I will fix that eventually, but for now it’d be worth manually blocking the arrays or using Tullio.

func3! is faster for me:

julia> @benchmark func3!($D3,$E,$X,$y)
samples: 4103; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (936000.0 - 959000.0]  ▌31
 (959000.0 - 983000.0]  ▎12
 (983000.0 - 1.006e6 ]  ▎11
 (1.006e6  - 1.03e6  ]  ▏6
 (1.03e6   - 1.053e6 ]  ▎12
 (1.053e6  - 1.076e6 ]  ▏6
 (1.076e6  - 1.1e6   ]  ▏2
 (1.1e6    - 1.123e6 ]  ▏5
 (1.123e6  - 1.147e6 ]  ▏3
 (1.147e6  - 1.17e6  ]  ▏1
 (1.17e6   - 1.194e6 ]  ▏8
 (1.194e6  - 1.217e6 ]  ██████████████████▊1534
 (1.217e6  - 1.24e6  ]  ██████████████████████████████2467
 (1.24e6   - 1.396e6 ]  ▏5

                  Counts

min: 935.817 μs (0.00% GC); mean: 1.214 ms (0.00% GC); median: 1.218 ms (0.00% GC); max: 1.396 ms (0.00% GC).

julia> D3 ≈ D2
true
5 Likes

Faster, transposing the datalayout of X:

julia> Xtt = transpose(copy(transpose(X)));

julia> @benchmark func5!($D3,$Xtt,$y)
samples: 6905; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (686000.0 - 701000.0]  ▎30
 (701000.0 - 716000.0]  █▌322
 (716000.0 - 731000.0]  ██████████████████████████████6443
 (731000.0 - 746000.0]  ▌86
 (746000.0 - 760000.0]  ▏5
 (760000.0 - 775000.0]  ▏4
 (775000.0 - 790000.0]  ▏2
 (790000.0 - 805000.0]  ▏2
 (805000.0 - 820000.0]  ▏1
 (820000.0 - 835000.0]   0
 (835000.0 - 849000.0]   0
 (849000.0 - 864000.0]  ▏1
 (864000.0 - 879000.0]  ▏2
 (879000.0 - 1.036e6 ]  ▏7

                  Counts

min: 686.128 μs (0.00% GC); mean: 720.280 μs (0.00% GC); median: 719.413 μs (0.00% GC); max: 1.036 ms (0.00% GC).

julia> D3 ≈ D2
true

Fastest (on my computer): using Octavian and LoopVectorization for func4!:

julia> using Octavian, LoopVectorization

julia> function func4!(D, E, X, y)
           @tturbo for m ∈ indices((X,E),1), n ∈ indices((X,E),2)
               E[m,n] = y[m] * X[m,n]
           end
           matmul!(D, transpose(X), E)
           nothing
       end
func4! (generic function with 2 methods)

julia> @benchmark func4!($D3,$E,$X,$y)
samples: 8852; evals/sample: 1; memory estimate: 0 bytes; allocs estimate: 0
ns

 (369000.0 - 391000.0]  █101
 (391000.0 - 412000.0]  ▋61
 (412000.0 - 434000.0]  ▌56
 (434000.0 - 455000.0]  █▋174
 (455000.0 - 476000.0]  █▋174
 (476000.0 - 498000.0]  ██▎250
 (498000.0 - 519000.0]  █▎133
 (519000.0 - 540000.0]  ▏5
 (540000.0 - 562000.0]  ███████████████████████▉2701
 (562000.0 - 583000.0]  ██████████████████████████████3406
 (583000.0 - 604000.0]  ███████████▋1307
 (604000.0 - 626000.0]  ████449
 (626000.0 - 647000.0]  ▎21
 (647000.0 - 668000.0]  ▏5
 (668000.0 - 742000.0]  ▏9

                  Counts

min: 369.469 μs (0.00% GC); mean: 561.015 μs (0.00% GC); median: 574.398 μs (0.00% GC); max: 742.266 μs (0.00% GC).

julia> D3 ≈ D2
true

At 370 (575) microseconds min (median), it is around 300x faster than func2! (which took 170 milliseconds).

10 Likes