Efficient way compute do WX+b?

Which is in Julia the most efficient CPU-way to compute Wx+b where W is a (JxI) matrix, X is a I-dimensional vector and b is a J dimensional vector ?
(context: NN layers)

I have saw this thread but I can’t understand the answer :slight_smile:

In general, would make sense to first check for the size of W or the number of threads in the computation and select a different computation method?

On my laptop with 12 threads I have:

julia> using LoopVectorization, BenchmarkTools, Test, Base.Threads, LinearAlgebra

julia> function _zComp!(z,w,x,b)
         z .= 0.0
         @inbounds @threads for n in axes(w,1)
            zn = 0.0
            @inbounds for nl in axes(x,1)
               zn += w[n,nl] * x[nl]
            end
            z[n] += zn
            z[n] += b[n]
         end
         return nothing
       end
_zComp! (generic function with 1 method)

julia> function _zComp1!(z,w,x,b)
         z .= 0.0
         @turbo for n in axes(w,1)
            zn = 0.0
            for nl in axes(x,1)
               zn += w[n,nl] * x[nl]
            end
            z[n] += zn
            z[n] += b[n]
         end
         return nothing
       end
_zComp1! (generic function with 1 method)

julia> function _zComp2!(z,w,x,b)
         z .= w*x .+ b
         return nothing
       end
_zComp2! (generic function with 1 method)

julia> function _zComp3!(z,w,x,b)
         z .= 0.0
         mul!(z, w, x) .+= b
         return nothing
       end
_zComp3! (generic function with 1 method)

julia> (I,J) = (800,500);

julia> b     = fill(0.1,J);

julia> w = rand(J,I);

julia> x = rand(I);

julia> y = zeros(J);

julia> ytrue = w*x .+ b;

julia> _zComp!(y,w,x,b)

julia> @test y ≈ ytrue
Test Passed

julia> _zComp1!(y,w,x,b)

julia> @test y ≈ ytrue
Test Passed

julia> _zComp2!(y,w,x,b)

julia> @test y ≈ ytrue
Test Passed

julia> _zComp3!(y,w,x,b)

julia> @test y ≈ ytrue
Test Passed

julia> @btime _zComp!(y,w,x,b)
  77.164 μs (62 allocations: 6.53 KiB)

julia> @btime _zComp1!(y,w,x,b)
  79.880 μs (0 allocations: 0 bytes)

julia> @btime _zComp2!(y,w,x,b)
  75.305 μs (3 allocations: 4.00 KiB)

julia> @btime _zComp3!(y,w,x,b)
  74.479 μs (0 allocations: 0 bytes)

So with relatively large matrices it is more or less tha same. However with smaller matrices the threaded version is really bad:


julia> (I,J) = (80,50);

julia> b     = fill(0.1,J);

julia> w = rand(J,I);

julia> x = rand(I);

julia> y = zeros(J);

julia> @btime _zComp!(y,w,x,b)
  7.742 μs (62 allocations: 6.53 KiB)

julia> @btime _zComp1!(y,w,x,b)
  438.051 ns (0 allocations: 0 bytes)

julia> @btime _zComp2!(y,w,x,b)
  613.680 ns (2 allocations: 480 bytes)

julia> @btime _zComp3!(y,w,x,b)
  593.643 ns (0 allocations: 0 bytes)

Try

function _zComp4!(z,w,x,b)
    z .= b
    mul!(z, w, x, true, true)
    return nothing
end
2 Likes

More or less the same.. I have added MKL.jl and I was able to halve the timing of versions 2,3 and 4, but only for the large matrices.

What are the two boolean parameters? I can’t find them in the documentation of mul!, only a 3 parameters version and a 5 parameters where the last two arguments (alpha and beta) seem to be scalar weights…

Yes, this is the one. true == 1 and false == 0. This would also work: mul!(z, w, x, 1, 1). The reason true and false are used in this and some other contexts is that they are the lowest level in the number type promotion hierarchy, so they never lead to accidental type promotion.

5 Likes

In this case the Booleans will be treated as ones. The advantage of using Booleans is that they are “universal” zeros and ones in the sense that they are always promoted to the type of the other addend. You can be confident that a + true will have the same type as a.

9 Likes

That’s super interesting, I guess it saves some the trouble of getting one(promote_type(T, U)) where w::Matrix{T} and x::Matrix{U}?

Is there a self-documenting way to use that pattern, so that casual readers can understand the purpose? I guess it could be

just_1 = true # a 1 of lowest possible type
just_0 = false
mul!(z, w, x, just_1, just_1)

But that seems kind of ugly.

You can try @tturbo for multithreaded @turbo. It may do better for larger matrices.

But I’d guess it’s fairly memory bound, and the only improvement you could get would be if it takes multiple cores to saturate your memory bus.
Apple silicon only needs one, but most other architectures need several.

1 Like