How to do chunk matrix/vector multiplication with @threads

biona001 · August 28, 2018, 6:14am

I implemented c = Ab using multithreading naively by computing row-wise dot products. I did so on BitArrays because it is the closest thing to the actual data structure in my problem.

However, when I tried to switch the row-wise inner products to submatrix/vector operations, my updates to the final vector stops happening, and I’m not sure why. I suspect it has to do with atomic operations, but there is so little documentation on this I’m still not sure what to do.

Below I paste my code:

function naive_mul!( 
	c :: AbstractVector, 
	A :: AbstractMatrix, 
	b :: AbstractVector
)
	for j in eachindex(b)
		c[j] = dot(A[j, :], b)
	end
end

function threaded_mul!( 
	c :: AbstractVector, 
	A :: AbstractMatrix, 
	b :: AbstractVector
)
	Threads.@threads for j in eachindex(b)
		c[j] = dot(A[j, :], b)
	end
end

On 8 threads, the multiplication is much faster:

using BenchmarkTools
a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)

julia> @btime naive_mul!(c, a, b)
  2.231 ms (1000 allocations: 46.88 KiB)

julia> @btime threaded_mul!(c, a, b)
  528.598 μs (515 allocations: 22.83 KiB)

But when I tried to do the same using “chunks” of the original matrix instead of row-by-row dot product, somehow I don’t get anything:

#function to assign the chunk size
function delegate(X::AbstractArray)
    nchunks = Threads.nthreads()
    splits = [round(Int, s) for s in linspace(0.0, size(X,1), nchunks+1)]
    return splits
end

function threaded_mul!(
    c::AbstractVector, 
    a::AbstractMatrix, 
    b::AbstractVector, 
    splits::Vector{Int64}
)
    Threads.@threads for i = 1:Threads.nthreads()
        region = splits[i]+1:splits[i+1]
        A_mul_B!(c[region], view(a, region, :), b)
    end
end

Split up the 1000x1000 matrix into flat evenly spaced chunks that begins on:

julia> splits = delegate(a)
9-element Array{Int64,1}:
    0
  125
  250
  375
  500
  625
  750
  875
 1000

Why does this not correctly assign c??

a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)
julia> threaded_mul!(c, a, b, delegate(a))

julia> c
1000-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
⋮

biona001 · August 28, 2018, 8:56pm

It was because c[region] actually creates a new vector that will store the result. Since this vector is not bound to the original vector c, it gets lost and hence the original vector does not get updated. To fix this, simply use view() on c and everything should work:

using BenchmarkTools
a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)

function delegate(X::AbstractArray)
    nchunks = Threads.nthreads()
    splits = [round(Int, s) for s in linspace(0.0, size(X,1), nchunks+1)]
    return splits
end

function threaded_mul!(
    c::AbstractVector, 
    a::AbstractMatrix, 
    b::AbstractVector, 
    splits::Vector{Int64}
)
    Threads.@threads for i = 1:Threads.nthreads()
        region = splits[i]+1:splits[i+1]
        A_mul_B!(view(c, region), view(a, region, :), b)
    end
end

threaded_mul!(c,a,b,delegate(a)) # this works!

Topic		Replies	Views
Multithreaded MatVec Numerics multithreading , matrices	10	1964	February 4, 2022
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2478	July 13, 2020
Strategy for using multiple threads in an in-place matrix operation General Usage question	3	652	January 9, 2020
Matrix vector multiplication Performance question	4	909	September 27, 2020
Julia multithreading New to Julia question	8	543	January 15, 2023

How to do chunk matrix/vector multiplication with @threads

Related topics