Optimize combination of element-wise products and matrix multiplication

mleprovost · July 14, 2020, 11:51pm

Hello everyone,

I heavily use the following operation in my code:

d = (A .* B)*c where A, B \in R^{n_x,n_y}, c \in R^{n_y} and d \in R^{n_x}.

I couldn’t find a faster way to compute this than the vectorized form d = (A .* B)*c

Here is my benchmark:

  function f1!(d, A, B, c)
    d .= (A .* B)*c
end

function f2!(E, d, A, B, c)
    nx, ny = size(A)
    for j=1:ny
        for i=1:nx
            E[i,j] = A[i,j]*B[i,j]
        end
    end
    d .= E*c
end

function timing()
@btime  begin
    nx = 500
    ny = 20
    A = randn(nx, ny)
    B = randn(nx, ny)
    c = randn(ny);
end

@btime begin 
    nx = 500
    ny = 20
    A = randn(nx, ny)
    B = randn(nx, ny)
    c = randn(ny);
    d = zeros(nx)
    f1!(d, A, B, c)
end
    
@btime begin
    nx = 500
    ny = 20
    A = randn(nx, ny)
    B = randn(nx, ny)
    c = randn(ny);
    d = zeros(nx)
    E = zeros(nx,ny)
    f2!(E, d, A, B, c)
end      
end

timing()

89.051 μs (5 allocations: 156.64 KiB)
105.161 μs (9 allocations: 242.97 KiB)
117.541 μs (9 allocations: 242.97 KiB)

baggepinnen · July 15, 2020, 5:31am

I notice that you include the allocations in your timing, which you may not want to do if you just want to measure the performance of the operation.

The function f2! would benefit from @inbounds and possibly @simd on the inner loop.

baggepinnen · July 15, 2020, 5:35am

Inbounds was effective, and @avx from LoopVectorization was even more so. Using inplace multiplication reduced the allocations, which will pay off if the function is called multiple times from within another function. You may also consider doing inplace multiplication in f1!, as well as updating d in the loop in f2!.

using LoopVectorization

function f1!(d, A, B, c)
    d .= (A .* B) * c
end

function f2!(E, d, A, B, c)
    nx, ny = size(A)
    @avx for j = 1:ny
        for i = 1:nx
            E[i, j] = A[i, j] * B[i, j]
        end
    end
    mul!(d, E, c)
end

nx = 500
ny = 20
A = randn(nx, ny)
B = randn(nx, ny)
c = randn(ny)
d = zeros(nx)
@btime f1!($d, $A, $B, $c)

nx = 500
ny = 20
A = randn(nx, ny)
B = randn(nx, ny)
c = randn(ny)
d = zeros(nx)
E = zeros(nx, ny)
@btime f2!($E, $d, $A, $B, $c)

# baseline
13.487 μs (3 allocations: 82.27 KiB)
23.558 μs (1 allocation: 4.06 KiB)

# inbounds  f2
10.446 μs (1 allocation: 4.06 KiB)

# @avx
13.170 μs (3 allocations: 82.27 KiB)
9.716 μs (1 allocation: 4.06 KiB)

# mul! f2
9.832 μs (0 allocations: 0 bytes)

Elrod · July 15, 2020, 6:56am

A couple more options with LoopVectorization:

using LoopVectorization, Test
function f3!(d, A, B, c)
    nx, ny = size(A)
    @avx for i = 1:nx
        di = zero(eltype(d))
        for j = 1:ny
            di += (A[i, j] * B[i, j]) * c[j]
        end
        d[i] = di
    end
end
function f4!(d, A, B, c)
    @avx @. d = (A * B) *ˡ c # note super script l; denotes lazy multiplication that fuses with broadcasts
end
nx = 500; ny = 20;
A = randn(nx, ny);
B = randn(nx, ny);
c = randn(ny);
d1 = zeros(nx); d2 = similar(d1); d3 = similar(d1); d4 = similar(d1);
E = zeros(nx, ny);
@btime f1!($d1, $A, $B, $c);
@btime f2!($E, $d2, $A, $B, $c);
@btime f3!($d3, $A, $B, $c);
@btime f4!($d4, $A, $B, $c);
@test d1 ≈ d2 ≈ d3 ≈ d4

On my computer, this yields:

julia> @btime f1!($d1, $A, $B, $c);
  11.135 μs (3 allocations: 82.27 KiB)

julia> @btime f2!($E, $d2, $A, $B, $c);
  10.135 μs (0 allocations: 0 bytes)

julia> @btime f3!($d3, $A, $B, $c);
  1.285 μs (0 allocations: 0 bytes)

julia> @btime f4!($d4, $A, $B, $c);
  1.350 μs (0 allocations: 0 bytes)

julia> @test d1 ≈ d2 ≈ d3 ≈ d4
Test Passed

The @avx broadcasted version with the lazy multiplication will require LoopVectorization v0.8.17.
If A and B have a lot more columns than in this example, you should split that loop into pieces, using views of A and B for better memory performance.

Shuhua · July 15, 2020, 7:26am

@baggepinnen Thank you for the example.

I am a little curious why inbounds f2 is faster than baseline f1, i.e., 10.446 vs. 13.487. What is the reason for such difference? Does f1 do bounds checking internally when calling the built-in operators? If so, how can we disable the checking? Or is the advantage of inbounds f2 actually attributed to the pre-allocation of matrix E?

baggepinnen · July 15, 2020, 7:30am

It does fewer allocations

this code first allocates A.*B, then allocates AB * c.

Shuhua · July 15, 2020, 7:35am

@baggepinnen Yes, I fully agree with you. That pre-allocation is indeed useful if the function is called inside a loop. (By the way, I just noticed that you worked in NUS. National University of Singapore, right? I am also here. )

Juan · July 15, 2020, 11:06am

Is there any non-allocating function for element-wise matrix product?

mleprovost · July 15, 2020, 3:30pm

Thank you all for your help!

@Elrod I got the following error when I run your code for f4!. Is there missing package in my Julia environment for Lazy operations

MethodError: no method matching getindex(::LoopVectorization.Product{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(*),Tuple{Array{Float64,2},Array{Float64,2}}},Array{Float64,1}}, ::CartesianIndex{1})

Stacktrace:
[1] _broadcast_getindex at ./broadcast.jl:584 [inlined]
[2] _getindex at ./broadcast.jl:628 [inlined]
[3] _broadcast_getindex at ./broadcast.jl:603 [inlined]
[4] getindex at ./broadcast.jl:564 [inlined]
[5] macro expansion at ./broadcast.jl:910 [inlined]
[6] macro expansion at ./simdloop.jl:77 [inlined]
[7] copyto! at ./broadcast.jl:909 [inlined]
[8] copyto! at ./broadcast.jl:864 [inlined]
[9] materialize! at ./broadcast.jl:826 [inlined]
[10] vmaterialize!(::Array{Float64,1}, ::LoopVectorization.Product{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(*),Tuple{Array{Float64,2},Array{Float64,2}}},Array{Float64,1}}, ::Val{:Main}) at /home/mat/.julia/packages/LoopVectorization/nzoRk/src/broadcast.jl:331
[11] f4! at ./In[6]:26 [inlined]
[12] ##core#442(::Array{Float64,1}, ::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,1}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[13] ##sample#443(::BenchmarkTools.Parameters) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
[14] _run(::BenchmarkTools.Benchmark{Symbol(“##benchmark#441”)}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:405
[15] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol(“##benchmark#441”)},BenchmarkTools.Parameters}})() at ./essentials.jl:715
[16] #invokelatest#1 at ./essentials.jl:716 [inlined]
[17] #run_result#37 at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
[18] run(::BenchmarkTools.Benchmark{Symbol(“##benchmark#441”)}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:94
[19] #warmup#45 at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:141 [inlined]
[20] warmup(::BenchmarkTools.Benchmark{Symbol(“##benchmark#441”)}) at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:141
[21] top-level scope at /home/mat/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:481
[22] top-level scope at In[6]:37

Elrod · July 15, 2020, 4:05pm

That method was added in LoopVectorization v0.8.17, so this should work if you just updating your packages. I look forward to seeing your benchmarks!

mleprovost · July 15, 2020, 4:38pm

Thank you, that fixes the problem.

Topic		Replies	Views
Vector - Matrix - Vector multiplication Performance	19	4228	March 14, 2021
Why mul! is so fast? General Usage question , linearalgebra	7	6871	November 26, 2019
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2906	July 16, 2018
Why for-loop of vector multiplication is slower that dot product? General Usage question	11	455	August 11, 2022
Julian way to write this code Performance question	18	913	September 22, 2024

Optimize combination of element-wise products and matrix multiplication

Related topics