Hi, there!

But I do not know how to do this with matrix addition.
Can Anyone help me?

Why wouldn’t you let the BLAS library decide what’s best to do?

Is it experiment or you want the best performance?

I’m not an expert but I’d assume Matrix Addition is a memory bounded operation (When vectorized) hence Multi Threading isn’t going to help.

I’m not sure how much speed you will get, but you can use `@threads` and write your own addition function like the one below. Be sure to set the number of threads before starting `julia` with the environment variable `JULIA_NUM_THREADS`. See the Multi-Threading section in the docs for more info.

``````function matrixadd(A::Matrix{T}, B::Matrix{T}) where T
C = similar(A)
@inbounds C[i] = A[i] + B[i]
end
C
end
``````

Of course preallocating the output always helps. Just using `@. C = A + B` is probably going to give you pretty good performance.

It’s been mentioned before that native Julia matches BLAS at least for level 1 BLAS, so I would assume @ksmcreynolds’s solution gets close or at least very close to using `axpy!` when allocating. But I would make a few changes:

``````function matrixadd!(C::AbstractArray{T},A::AbstractArray{T}, B::AbstractArray{T}) where T
@boundscheck checkbounds(C,eachindex(B))
@boundscheck checkbounds(C,eachindex(A))
@inbounds C[i] = A[i] + B[i]
end
end
``````
2 Likes

Small correction: `1:eachindex(C)` should just be `eachindex(C)`, right?

1 Like

you can also use GPUArrays.jl which has a threaded backend!

I tried your example and I learned `@inbounds` is able to make my code faster.

It’s amazing to use the where keyword.

``````julia> Sys.CPU_CORES
4
4
julia> A = randn(4000,4000); B = randn(4000,4000); C = similar(A);
julia> function matrixadd!(C::AbstractArray{T},A::AbstractArray{T}, B::AbstractArray{T}) where T
@boundscheck checkbounds(C,eachindex(B))
@boundscheck checkbounds(C,eachindex(A))
C[i] = A[i] + B[i]
end
end
BenchmarkTools.Trial:
memory estimate:  48 bytes
allocs estimate:  1
--------------
minimum time:     28.962 ms (0.00% GC)
median time:      29.081 ms (0.00% GC)
mean time:        29.289 ms (0.00% GC)
maximum time:     32.234 ms (0.00% GC)
--------------
samples:          171
evals/sample:     1

julia> @benchmark \$C .= \$A .+ \$B
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     30.073 ms (0.00% GC)
median time:      31.142 ms (0.00% GC)
mean time:        31.496 ms (0.00% GC)
maximum time:     53.334 ms (0.00% GC)
--------------
samples:          159
evals/sample:     1
``````

My CPU% was almost 4x higher with the former.
EDIT:
I do see better performance when doing something more expensive than addition. But, if that is all you do, I would stick with single threaded. No need to cook your processor (possibly slowing other applications down until it cools down) for barely any gain.

1 Like