Reduce allocations in row-by-row dotproduct

Hi, I was wondering if there is a better way of writing the function add_and_mul below, which is effectively (up to some scaling and addition) trying to compute the dot-product between the various row-vectors in two matrices? There must be a more elegant way of doing this, but I’m not yet seeing it…

using BenchmarkTools, Random, LinearAlgebra
Random.seed!(42)

function add_and_mul(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    X2[:, 1:d] = view(X1, :, 1:d) * s
    X2[:, d+1] = view(X1, :, d+1) * r + sum(A .* (view(X2, :, 1:d) - view(X1, :, 1:d)*r), dims = 2)
    return X2
end

function RunBench(N, d)
    r = rand()
    s = rand()
    X1 = rand(N, d+1)
    A = rand(N, d)
    display(add_and_mul(X1, A, r, s, d))
    display(@benchmark add_and_mul($X1, $A, $r, $s, $d))
end

RunBench(10_000, 5)

which gives

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   77.917 ΞΌs …  19.387 ms  β”Š GC (min … max):  0.00% … 98.92%
 Time  (median):     191.792 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   293.755 ΞΌs Β± 542.751 ΞΌs  β”Š GC (mean Β± Οƒ):  32.63% Β± 19.26%

  β–β–ƒβ–ˆβ–‡β–„β–‚ ▁                                                      β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–β–β–β–β–β–„β–ƒβ–ƒβ–β–ƒβ–β–β–„β–…β–…β–„β–†β–…β–†β–…β–†β–†β–†β–‡β–†β–‡β–†β–‡β–‡β–‡β–‡β–†β–†β–‡β–†β–†β–‡β–†β–†β–…β–†β–†β–…β–†β–†β–†β–…β–…β–… β–ˆ
  77.9 ΞΌs       Histogram: log(frequency) by time       2.46 ms <

 Memory estimate: 2.21 MiB, allocs estimate: 24.

This will be a lot easier to write with for loops.

Something like this?

function add_and_mul_with_for(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    X2[:, 1:d] = view(X1, :, 1:d) * s
    @views for i in axes(X2, 1)
        X2[i, d+1] = X1[i, d+1] * r + A[i, :] β‹… (X2[i, 1:d] - X1[i, 1:d]*r)
    end
    return X2
end

That would give me

BenchmarkTools.Trial: 8055 samples with 1 evaluation.
 Range (min … max):  445.625 ΞΌs …  21.843 ms  β”Š GC (min … max):  0.00% … 97.61%
 Time  (median):     509.834 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   619.508 ΞΌs Β± 590.469 ΞΌs  β”Š GC (mean Β± Οƒ):  17.29% Β± 16.64%

  β–…β–ˆβ–…β–ƒ                                                          ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–…β–…β–†β–‡β–‡β–‡β–‡β–†β–‡β–‡β–†β–†β–…β–†β–†β–…β–†β–‡β–‡β–† β–ˆ
  446 ΞΌs        Histogram: log(frequency) by time       3.09 ms <

 Memory estimate: 2.67 MiB, allocs estimate: 40006.

Like this:


function add_and_mul2(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    for j in 1:d
        for i in axes(X2, 1)
            X2[i, j] = X1[i, j] * s
        end
    end
    for i in axes(X2, 1)
        X2[i, d+1] = X1[i, d+1] * r
    end
    for j in 1:d
        for i in axes(X2, 1)
            X2[i, d+1] += A[i,j] * (X2[i, j] - X1[i, j]*r)
        end
    end
    return X2
end

Preallocating X2 will yield more consistent results. But if that’s feasible depends on your use of this function. You get some more speedup by decorating the for loops with @inbounds.

1 Like

Note that you can fuse two of the loops there, and hoist some bounds checks to get another big speed-up:

function add_and_mul3(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    @boundscheck checkbounds(A, axes(X2, 1), 1:d)
    @boundscheck checkbounds(X1, :, 1:d+1)
    
    @views X2[:, d+1] .= X1[:, d+1] .* r
    for j in 1:d
        @inbounds @simd for i in axes(X2, 1)
            X1ij = X1[i, j]
            X2ij = X1ij * s
            X2[i, j] = X2ij
            X2[i, d+1] += A[i,j] * (X2ij - X1ij * r)
        end
    end
    return X2
end

function RunBench(;N=10_000, d=5)
    r = rand()
    s = rand()
    X1 = rand(N, d+1)
    A = rand(N, d)
    @info "" add_and_mul(X1, A, r, s, d) β‰ˆ add_and_mul2(X1, A, r, s, d) β‰ˆ add_and_mul3(X1, A, r, s, d)
    display(@benchmark add_and_mul($X1, $A, $r, $s, $d))
    sleep(1)
    display(@benchmark add_and_mul2($X1, $A, $r, $s, $d))
    sleep(1)
    display(@benchmark add_and_mul3($X1, $A, $r, $s, $d))
end

gives me

julia> RunBench()
β”Œ Info: 
β””   add_and_mul(X1, A, r, s, d) β‰ˆ add_and_mul2(X1, A, r, s, d) β‰ˆ add_and_mul3(X1, A, r, s, d) = true
β”Œ Info: 
β”‚   add_and_mul =
β”‚    BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
β”‚     Range (min … max):   72.898 ΞΌs …   1.970 ms  β”Š GC (min … max):  0.00% … 87.09%
β”‚     Time  (median):     113.355 ΞΌs               β”Š GC (median):     0.00%
β”‚     Time  (mean Β± Οƒ):   199.154 ΞΌs Β± 214.478 ΞΌs  β”Š GC (mean Β± Οƒ):  16.48% Β± 16.41%
β”‚    
β”‚      β–†β–ˆβ–‡β–†β–…β–…β–…β–„β–ƒβ–‚β–β–                             ▁▁ ▁                 β–‚
β”‚      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–†β–…β–…β–„β–…β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–†β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–‡β–‡β–‡β–†β–…β–†β–†β–†β–‡β–†β–† β–ˆ
β”‚      72.9 ΞΌs       Histogram: log(frequency) by time       1.02 ms <
β”‚    
β””     Memory estimate: 2.21 MiB, allocs estimate: 24.
β”Œ Info: 
β”‚   add_and_mul2 =
β”‚    BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
β”‚     Range (min … max):  41.067 ΞΌs …  1.736 ms  β”Š GC (min … max):  0.00% … 94.50%
β”‚     Time  (median):     49.454 ΞΌs              β”Š GC (median):     0.00%
β”‚     Time  (mean Β± Οƒ):   75.996 ΞΌs Β± 82.370 ΞΌs  β”Š GC (mean Β± Οƒ):  10.38% Β± 10.69%
β”‚    
β”‚      β–ˆβ–‡β–…β–…β–ƒ             ▂▃▃▃▁                                     β–‚
β”‚      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–†β–…β–„β–„β–‡β–‡β–‡β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–…β–…β–…β–…β–„β–β–„β–β–…β–„β–…β–ƒβ–…β–…β–…β–…β–…β–„β–…β–†β–ƒβ–…β–„β–†β–…β–…β–„β–ƒβ–…β–…β–…β–… β–ˆ
β”‚      41.1 ΞΌs      Histogram: log(frequency) by time       483 ΞΌs <
β”‚    
β””     Memory estimate: 468.83 KiB, allocs estimate: 3.
β”Œ Info: 
β”‚   add_and_mul3 =
β”‚    BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
β”‚     Range (min … max):  14.006 ΞΌs …  1.580 ms  β”Š GC (min … max):  0.00% … 97.28%
β”‚     Time  (median):     23.685 ΞΌs              β”Š GC (median):     0.00%
β”‚     Time  (mean Β± Οƒ):   46.578 ΞΌs Β± 76.171 ΞΌs  β”Š GC (mean Β± Οƒ):  14.22% Β± 10.53%
β”‚    
β”‚      β–ˆβ–‡β–†β–†β–‚                 β–ƒβ–ƒβ–ƒβ–ƒβ–‚                                 β–‚
β”‚      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–„β–β–ƒβ–„β–…β–„β–β–ƒβ–‡β–ˆβ–‡β–…β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–†β–†β–†β–†β–…β–†β–„β–ƒβ–„β–„β–„β–„β–ƒβ–ƒβ–†β–‡β–†β–…β–…β–…β–…β–„β–†β–…β–…β–…β–…β–† β–ˆ
β”‚      14 ΞΌs        Histogram: log(frequency) by time       380 ΞΌs <
β”‚    
β””     Memory estimate: 468.83 KiB, allocs estimate: 3.
4 Likes

Thank you both, this is very insightful and provides a good reduction of my number of allocations and a nice speed-up!