Reduce allocations in row-by-row dotproduct

JADekker · February 5, 2025, 1:48pm

Hi, I was wondering if there is a better way of writing the function add_and_mul below, which is effectively (up to some scaling and addition) trying to compute the dot-product between the various row-vectors in two matrices? There must be a more elegant way of doing this, but I’m not yet seeing it…

using BenchmarkTools, Random, LinearAlgebra
Random.seed!(42)

function add_and_mul(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    X2[:, 1:d] = view(X1, :, 1:d) * s
    X2[:, d+1] = view(X1, :, d+1) * r + sum(A .* (view(X2, :, 1:d) - view(X1, :, 1:d)*r), dims = 2)
    return X2
end

function RunBench(N, d)
    r = rand()
    s = rand()
    X1 = rand(N, d+1)
    A = rand(N, d)
    display(add_and_mul(X1, A, r, s, d))
    display(@benchmark add_and_mul($X1, $A, $r, $s, $d))
end

RunBench(10_000, 5)

which gives

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   77.917 μs …  19.387 ms  ┊ GC (min … max):  0.00% … 98.92%
 Time  (median):     191.792 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   293.755 μs ± 542.751 μs  ┊ GC (mean ± σ):  32.63% ± 19.26%

  ▁▃█▇▄▂ ▁                                                      ▂
  █████████▆▄▃▁▁▁▁▁▄▃▃▁▃▁▁▄▅▅▄▆▅▆▅▆▆▆▇▆▇▆▇▇▇▇▆▆▇▆▆▇▆▆▅▆▆▅▆▆▆▅▅▅ █
  77.9 μs       Histogram: log(frequency) by time       2.46 ms <

 Memory estimate: 2.21 MiB, allocs estimate: 24.

Oscar_Smith · February 5, 2025, 1:50pm

This will be a lot easier to write with for loops.

JADekker · February 5, 2025, 1:54pm

Something like this?

function add_and_mul_with_for(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    X2[:, 1:d] = view(X1, :, 1:d) * s
    @views for i in axes(X2, 1)
        X2[i, d+1] = X1[i, d+1] * r + A[i, :] ⋅ (X2[i, 1:d] - X1[i, 1:d]*r)
    end
    return X2
end

That would give me

BenchmarkTools.Trial: 8055 samples with 1 evaluation.
 Range (min … max):  445.625 μs …  21.843 ms  ┊ GC (min … max):  0.00% … 97.61%
 Time  (median):     509.834 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   619.508 μs ± 590.469 μs  ┊ GC (mean ± σ):  17.29% ± 16.64%

  ▅█▅▃                                                          ▁
  ████▇▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▆▇▇▇▇▆▇▇▆▆▅▆▆▅▆▇▇▆ █
  446 μs        Histogram: log(frequency) by time       3.09 ms <

 Memory estimate: 2.67 MiB, allocs estimate: 40006.

sgaure · February 5, 2025, 2:18pm

Like this:


function add_and_mul2(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    for j in 1:d
        for i in axes(X2, 1)
            X2[i, j] = X1[i, j] * s
        end
    end
    for i in axes(X2, 1)
        X2[i, d+1] = X1[i, d+1] * r
    end
    for j in 1:d
        for i in axes(X2, 1)
            X2[i, d+1] += A[i,j] * (X2[i, j] - X1[i, j]*r)
        end
    end
    return X2
end

Preallocating X2 will yield more consistent results. But if that’s feasible depends on your use of this function. You get some more speedup by decorating the for loops with @inbounds.

Mason · February 5, 2025, 2:44pm

Note that you can fuse two of the loops there, and hoist some bounds checks to get another big speed-up:

function add_and_mul3(X1, A, r, s, d)
    X2 = Matrix{Float64}(undef, size(X1))
    @boundscheck checkbounds(A, axes(X2, 1), 1:d)
    @boundscheck checkbounds(X1, :, 1:d+1)
    
    @views X2[:, d+1] .= X1[:, d+1] .* r
    for j in 1:d
        @inbounds @simd for i in axes(X2, 1)
            X1ij = X1[i, j]
            X2ij = X1ij * s
            X2[i, j] = X2ij
            X2[i, d+1] += A[i,j] * (X2ij - X1ij * r)
        end
    end
    return X2
end

function RunBench(;N=10_000, d=5)
    r = rand()
    s = rand()
    X1 = rand(N, d+1)
    A = rand(N, d)
    @info "" add_and_mul(X1, A, r, s, d) ≈ add_and_mul2(X1, A, r, s, d) ≈ add_and_mul3(X1, A, r, s, d)
    display(@benchmark add_and_mul($X1, $A, $r, $s, $d))
    sleep(1)
    display(@benchmark add_and_mul2($X1, $A, $r, $s, $d))
    sleep(1)
    display(@benchmark add_and_mul3($X1, $A, $r, $s, $d))
end

gives me

julia> RunBench()
┌ Info: 
└   add_and_mul(X1, A, r, s, d) ≈ add_and_mul2(X1, A, r, s, d) ≈ add_and_mul3(X1, A, r, s, d) = true
┌ Info: 
│   add_and_mul =
│    BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
│     Range (min … max):   72.898 μs …   1.970 ms  ┊ GC (min … max):  0.00% … 87.09%
│     Time  (median):     113.355 μs               ┊ GC (median):     0.00%
│     Time  (mean ± σ):   199.154 μs ± 214.478 μs  ┊ GC (mean ± σ):  16.48% ± 16.41%
│    
│      ▆█▇▆▅▅▅▄▃▂▁▁                             ▁▁ ▁                 ▂
│      ██████████████▇█▇▇▇▆▅▅▄▅▆▇▇███▆▆▆▇███▇████████▇▇▆▇▇▇▇▆▅▆▆▆▇▆▆ █
│      72.9 μs       Histogram: log(frequency) by time       1.02 ms <
│    
└     Memory estimate: 2.21 MiB, allocs estimate: 24.
┌ Info: 
│   add_and_mul2 =
│    BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
│     Range (min … max):  41.067 μs …  1.736 ms  ┊ GC (min … max):  0.00% … 94.50%
│     Time  (median):     49.454 μs              ┊ GC (median):     0.00%
│     Time  (mean ± σ):   75.996 μs ± 82.370 μs  ┊ GC (mean ± σ):  10.38% ± 10.69%
│    
│      █▇▅▅▃             ▂▃▃▃▁                                     ▂
│      █████▇▇▆▅▆▅▄▄▇▇▇▆▆█████▇▆▅▄▅▅▅▅▄▁▄▁▅▄▅▃▅▅▅▅▅▄▅▆▃▅▄▆▅▅▄▃▅▅▅▅ █
│      41.1 μs      Histogram: log(frequency) by time       483 μs <
│    
└     Memory estimate: 468.83 KiB, allocs estimate: 3.
┌ Info: 
│   add_and_mul3 =
│    BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
│     Range (min … max):  14.006 μs …  1.580 ms  ┊ GC (min … max):  0.00% … 97.28%
│     Time  (median):     23.685 μs              ┊ GC (median):     0.00%
│     Time  (mean ± σ):   46.578 μs ± 76.171 μs  ┊ GC (mean ± σ):  14.22% ± 10.53%
│    
│      █▇▆▆▂                 ▃▃▃▃▂                                 ▂
│      █████▆▄▃▄▁▃▄▅▄▁▃▇█▇▅▄▇██████▆▆▆▆▆▆▆▅▆▄▃▄▄▄▄▃▃▆▇▆▅▅▅▅▄▆▅▅▅▅▆ █
│      14 μs        Histogram: log(frequency) by time       380 μs <
│    
└     Memory estimate: 468.83 KiB, allocs estimate: 3.

JADekker · February 5, 2025, 3:16pm

Thank you both, this is very insightful and provides a good reduction of my number of allocations and a nice speed-up!

Topic		Replies	Views
Question: how to get dot products from size(x, n) matrix and length(n) vector Statistics	3	348	June 23, 2022
Optimizing sums of products (dot products) Performance simd , linearalgebra , sum	17	533	September 24, 2024
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	412	November 7, 2022
Why for-loop of vector multiplication is slower that dot product? General Usage question	11	445	August 11, 2022
Diagonal elements of matrix product General Usage question , performance , linearalgebra	12	2558	November 2, 2022

Reduce allocations in row-by-row dotproduct

Related topics