Optimize combination of element-wise products and matrix multiplication

Hello everyone,

I heavily use the following operation in my code:

d = (A .* B)*c where A, B \in R^{n_x,n_y}, c \in R^{n_y} and d \in R^{n_x}.

I couldn’t find a faster way to compute this than the vectorized form d = (A .* B)*c

Here is my benchmark:

  function f1!(d, A, B, c)
    d .= (A .* B)*c
end

function f2!(E, d, A, B, c)
    nx, ny = size(A)
    for j=1:ny
        for i=1:nx
            E[i,j] = A[i,j]*B[i,j]
        end
    end
    d .= E*c
end

function timing()
@btime  begin
    nx = 500
    ny = 20
    A = randn(nx, ny)
    B = randn(nx, ny)
    c = randn(ny);
end

@btime begin 
    nx = 500
    ny = 20
    A = randn(nx, ny)
    B = randn(nx, ny)
    c = randn(ny);
    d = zeros(nx)
    f1!(d, A, B, c)
end
    
@btime begin
    nx = 500
    ny = 20
    A = randn(nx, ny)
    B = randn(nx, ny)
    c = randn(ny);
    d = zeros(nx)
    E = zeros(nx,ny)
    f2!(E, d, A, B, c)
end      
end

timing()

89.051 μs (5 allocations: 156.64 KiB)
105.161 μs (9 allocations: 242.97 KiB)
117.541 μs (9 allocations: 242.97 KiB)

1 Like

I notice that you include the allocations in your timing, which you may not want to do if you just want to measure the performance of the operation.

The function f2! would benefit from @inbounds and possibly @simd on the inner loop.

Inbounds was effective, and @avx from LoopVectorization was even more so. Using inplace multiplication reduced the allocations, which will pay off if the function is called multiple times from within another function. You may also consider doing inplace multiplication in f1!, as well as updating d in the loop in f2!.

using LoopVectorization

function f1!(d, A, B, c)
    d .= (A .* B) * c
end

function f2!(E, d, A, B, c)
    nx, ny = size(A)
    @avx for j = 1:ny
        for i = 1:nx
            E[i, j] = A[i, j] * B[i, j]
        end
    end
    mul!(d, E, c)
end

nx = 500
ny = 20
A = randn(nx, ny)
B = randn(nx, ny)
c = randn(ny)
d = zeros(nx)
@btime f1!($d, $A, $B, $c)

nx = 500
ny = 20
A = randn(nx, ny)
B = randn(nx, ny)
c = randn(ny)
d = zeros(nx)
E = zeros(nx, ny)
@btime f2!($E, $d, $A, $B, $c)

# baseline
13.487 μs (3 allocations: 82.27 KiB)
23.558 μs (1 allocation: 4.06 KiB)

# inbounds  f2
10.446 μs (1 allocation: 4.06 KiB)

# @avx
13.170 μs (3 allocations: 82.27 KiB)
9.716 μs (1 allocation: 4.06 KiB)

# mul! f2
9.832 μs (0 allocations: 0 bytes)
2 Likes

A couple more options with LoopVectorization:

using LoopVectorization, Test
function f3!(d, A, B, c)
    nx, ny = size(A)
    @avx for i = 1:nx
        di = zero(eltype(d))
        for j = 1:ny
            di += (A[i, j] * B[i, j]) * c[j]
        end
        d[i] = di
    end
end
function f4!(d, A, B, c)
    @avx @. d = (A * B) *ˡ c # note super script l; denotes lazy multiplication that fuses with broadcasts
end
nx = 500; ny = 20;
A = randn(nx, ny);
B = randn(nx, ny);
c = randn(ny);
d1 = zeros(nx); d2 = similar(d1); d3 = similar(d1); d4 = similar(d1);
E = zeros(nx, ny);
@btime f1!($d1, $A, $B, $c);
@btime f2!($E, $d2, $A, $B, $c);
@btime f3!($d3, $A, $B, $c);
@btime f4!($d4, $A, $B, $c);
@test d1 ≈ d2 ≈ d3 ≈ d4

On my computer, this yields:

julia> @btime f1!($d1, $A, $B, $c);
  11.135 μs (3 allocations: 82.27 KiB)

julia> @btime f2!($E, $d2, $A, $B, $c);
  10.135 μs (0 allocations: 0 bytes)

julia> @btime f3!($d3, $A, $B, $c);
  1.285 μs (0 allocations: 0 bytes)

julia> @btime f4!($d4, $A, $B, $c);
  1.350 μs (0 allocations: 0 bytes)

julia> @test d1 ≈ d2 ≈ d3 ≈ d4
Test Passed

The @avx broadcasted version with the lazy multiplication will require LoopVectorization v0.8.17.
If A and B have a lot more columns than in this example, you should split that loop into pieces, using views of A and B for better memory performance.

3 Likes

@baggepinnen Thank you for the example.

I am a little curious why inbounds f2 is faster than baseline f1, i.e., 10.446 vs. 13.487. What is the reason for such difference? Does f1 do bounds checking internally when calling the built-in operators? If so, how can we disable the checking? Or is the advantage of inbounds f2 actually attributed to the pre-allocation of matrix E?

It does fewer allocations

this code first allocates A.*B, then allocates AB * c.

@baggepinnen Yes, I fully agree with you. That pre-allocation is indeed useful if the function is called inside a loop. (By the way, I just noticed that you worked in NUS. National University of Singapore, right? I am also here. :smiley:)

Is there any non-allocating function for element-wise matrix product?

Thank you all for your help!

@Elrod I got the following error when I run your code for f4!. Is there missing package in my Julia environment for Lazy operations

MethodError: no method matching getindex(::LoopVectorization.Product{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(*),Tuple{Array{Float64,2},Array{Float64,2}}},Array{Float64,1}}, ::CartesianIndex{1})

Stacktrace:
[1] _broadcast_getindex at ./broadcast.jl:584 [inlined]
[2] _getindex at ./broadcast.jl:628 [inlined]
[3] _broadcast_getindex at ./broadcast.jl:603 [inlined]
[4] getindex at ./broadcast.jl:564 [inlined]
[5] macro expansion at ./broadcast.jl:910 [inlined]
[6] macro expansion at ./simdloop.jl:77 [inlined]
[7] copyto! at ./broadcast.jl:909 [inlined]
[8] copyto! at ./broadcast.jl:864 [inlined]
[9] materialize! at ./broadcast.jl:826 [inlined]
[10] vmaterialize!(::Array{Float64,1}, ::LoopVectorization.Product{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(*),Tuple{Array{Float64,2},Array{Float64,2}}},Array{Float64,1}}, ::Val{:Main}) at /home/mat/.julia/packages/LoopVectorization/nzoRk/src/broadcast.jl:331
[11] f4! at ./In[6]:26 [inlined]
[12] ##core#442(::Array{Float64,1}, ::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,1}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[13] ##sample#443(::BenchmarkTools.Parameters) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
[14] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#441")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:405
[15] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#441")},BenchmarkTools.Parameters}})() at ./essentials.jl:715
[16] #invokelatest#1 at ./essentials.jl:716 [inlined]
[17] #run_result#37 at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
[18] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#441")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:94
[19] #warmup#45 at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:141 [inlined]
[20] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#441")}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:141
[21] top-level scope at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:481
[22] top-level scope at In[6]:37

That method was added in LoopVectorization v0.8.17, so this should work if you just updating your packages. I look forward to seeing your benchmarks!

2 Likes

Thank you, that fixes the problem.