Which is in Julia the most efficient CPU-way to compute Wx+b where W is a (JxI) matrix, X is a I-dimensional vector and b is a J dimensional vector ?
(context: NN layers)
I have saw this thread but I can’t understand the answer
In general, would make sense to first check for the size of W or the number of threads in the computation and select a different computation method?
On my laptop with 12 threads I have:
julia> using LoopVectorization, BenchmarkTools, Test, Base.Threads, LinearAlgebra
julia> function _zComp!(z,w,x,b)
z .= 0.0
@inbounds @threads for n in axes(w,1)
zn = 0.0
@inbounds for nl in axes(x,1)
zn += w[n,nl] * x[nl]
end
z[n] += zn
z[n] += b[n]
end
return nothing
end
_zComp! (generic function with 1 method)
julia> function _zComp1!(z,w,x,b)
z .= 0.0
@turbo for n in axes(w,1)
zn = 0.0
for nl in axes(x,1)
zn += w[n,nl] * x[nl]
end
z[n] += zn
z[n] += b[n]
end
return nothing
end
_zComp1! (generic function with 1 method)
julia> function _zComp2!(z,w,x,b)
z .= w*x .+ b
return nothing
end
_zComp2! (generic function with 1 method)
julia> function _zComp3!(z,w,x,b)
z .= 0.0
mul!(z, w, x) .+= b
return nothing
end
_zComp3! (generic function with 1 method)
julia> (I,J) = (800,500);
julia> b = fill(0.1,J);
julia> w = rand(J,I);
julia> x = rand(I);
julia> y = zeros(J);
julia> ytrue = w*x .+ b;
julia> _zComp!(y,w,x,b)
julia> @test y ≈ ytrue
Test Passed
julia> _zComp1!(y,w,x,b)
julia> @test y ≈ ytrue
Test Passed
julia> _zComp2!(y,w,x,b)
julia> @test y ≈ ytrue
Test Passed
julia> _zComp3!(y,w,x,b)
julia> @test y ≈ ytrue
Test Passed
julia> @btime _zComp!(y,w,x,b)
77.164 μs (62 allocations: 6.53 KiB)
julia> @btime _zComp1!(y,w,x,b)
79.880 μs (0 allocations: 0 bytes)
julia> @btime _zComp2!(y,w,x,b)
75.305 μs (3 allocations: 4.00 KiB)
julia> @btime _zComp3!(y,w,x,b)
74.479 μs (0 allocations: 0 bytes)
So with relatively large matrices it is more or less tha same. However with smaller matrices the threaded version is really bad:
julia> (I,J) = (80,50);
julia> b = fill(0.1,J);
julia> w = rand(J,I);
julia> x = rand(I);
julia> y = zeros(J);
julia> @btime _zComp!(y,w,x,b)
7.742 μs (62 allocations: 6.53 KiB)
julia> @btime _zComp1!(y,w,x,b)
438.051 ns (0 allocations: 0 bytes)
julia> @btime _zComp2!(y,w,x,b)
613.680 ns (2 allocations: 480 bytes)
julia> @btime _zComp3!(y,w,x,b)
593.643 ns (0 allocations: 0 bytes)