Matrix Addition with multithreads!


#1

Hi, there!

I’d like to do matrix addition with multithreads.
I can do matrix multiplication with multithreads via BLAS.set_num_threads(t).
But I do not know how to do this with matrix addition.
Can Anyone help me?


#2

Why wouldn’t you let the BLAS library decide what’s best to do?

Is it experiment or you want the best performance?

I’m not an expert but I’d assume Matrix Addition is a memory bounded operation (When vectorized) hence Multi Threading isn’t going to help.


#3

I’m not sure how much speed you will get, but you can use @threads and write your own addition function like the one below. Be sure to set the number of threads before starting julia with the environment variable JULIA_NUM_THREADS. See the Multi-Threading section in the docs for more info.

function matrixadd(A::Matrix{T}, B::Matrix{T}) where T
    C = similar(A)
    Threads.@threads for i = 1:length(A)
        @inbounds C[i] = A[i] + B[i]
    end
    C
end

Of course preallocating the output always helps. Just using @. C = A + B is probably going to give you pretty good performance.


#4

It’s been mentioned before that native Julia matches BLAS at least for level 1 BLAS, so I would assume @ksmcreynolds’s solution gets close or at least very close to using axpy! when allocating. But I would make a few changes:

function matrixadd!(C::AbstractArray{T},A::AbstractArray{T}, B::AbstractArray{T}) where T
   @boundscheck checkbounds(C,eachindex(B))
   @boundscheck checkbounds(C,eachindex(A))
    Threads.@threads for i = 1:eachindex(C)
        @inbounds C[i] = A[i] + B[i]
    end
end
matrixadd(A::AbstractArray{T}, B::AbstractArray{T}) where T = (C = similar(A); matrixadd!(C,A,B); C)

#5

Small correction: 1:eachindex(C) should just be eachindex(C), right?


#6

you can also use GPUArrays.jl which has a threaded backend!


#7

Thank you for your answering.
I tried your example and I learned @inbounds is able to make my code faster.


#8

Thank you for your answer!
It’s amazing to use the where keyword.


#9

Thanks to your answer that properly correct the above code.


#10

Thank you for your answer!


#11
julia> Sys.CPU_CORES
4
julia> Threads.nthreads()
4
julia> A = randn(4000,4000); B = randn(4000,4000); C = similar(A);
julia> function matrixadd!(C::AbstractArray{T},A::AbstractArray{T}, B::AbstractArray{T}) where T
                 @boundscheck checkbounds(C,eachindex(B))
                 @boundscheck checkbounds(C,eachindex(A))
                 @inbounds Threads.@threads for i = eachindex(C)
                      C[i] = A[i] + B[i]
                  end
              end
julia> @benchmark matrixadd!($C, $A, $B)
BenchmarkTools.Trial: 
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     28.962 ms (0.00% GC)
  median time:      29.081 ms (0.00% GC)
  mean time:        29.289 ms (0.00% GC)
  maximum time:     32.234 ms (0.00% GC)
  --------------
  samples:          171
  evals/sample:     1

julia> @benchmark $C .= $A .+ $B
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     30.073 ms (0.00% GC)
  median time:      31.142 ms (0.00% GC)
  mean time:        31.496 ms (0.00% GC)
  maximum time:     53.334 ms (0.00% GC)
  --------------
  samples:          159
  evals/sample:     1

My CPU% was almost 4x higher with the former.
EDIT:
I do see better performance when doing something more expensive than addition. But, if that is all you do, I would stick with single threaded. No need to cook your processor (possibly slowing other applications down until it cools down) for barely any gain.