How to do chunk matrix/vector multiplication with @threads

I implemented c = Ab using multithreading naively by computing row-wise dot products. I did so on BitArrays because it is the closest thing to the actual data structure in my problem.

However, when I tried to switch the row-wise inner products to submatrix/vector operations, my updates to the final vector stops happening, and I’m not sure why. I suspect it has to do with atomic operations, but there is so little documentation on this I’m still not sure what to do.

Below I paste my code:

function naive_mul!( 
	c :: AbstractVector, 
	A :: AbstractMatrix, 
	b :: AbstractVector
)
	for j in eachindex(b)
		c[j] = dot(A[j, :], b)
	end
end

function threaded_mul!( 
	c :: AbstractVector, 
	A :: AbstractMatrix, 
	b :: AbstractVector
)
	Threads.@threads for j in eachindex(b)
		c[j] = dot(A[j, :], b)
	end
end

On 8 threads, the multiplication is much faster:

using BenchmarkTools
a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)

julia> @btime naive_mul!(c, a, b)
  2.231 ms (1000 allocations: 46.88 KiB)

julia> @btime threaded_mul!(c, a, b)
  528.598 μs (515 allocations: 22.83 KiB)

But when I tried to do the same using “chunks” of the original matrix instead of row-by-row dot product, somehow I don’t get anything:

#function to assign the chunk size
function delegate(X::AbstractArray)
    nchunks = Threads.nthreads()
    splits = [round(Int, s) for s in linspace(0.0, size(X,1), nchunks+1)]
    return splits
end

function threaded_mul!(
    c::AbstractVector, 
    a::AbstractMatrix, 
    b::AbstractVector, 
    splits::Vector{Int64}
)
    Threads.@threads for i = 1:Threads.nthreads()
        region = splits[i]+1:splits[i+1]
        A_mul_B!(c[region], view(a, region, :), b)
    end
end

Split up the 1000x1000 matrix into flat evenly spaced chunks that begins on:

julia> splits = delegate(a)
9-element Array{Int64,1}:
    0
  125
  250
  375
  500
  625
  750
  875
 1000

Why does this not correctly assign c??

a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)
julia> threaded_mul!(c, a, b, delegate(a))

julia> c
1000-element Array{Float64,1}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
⋮

It was because c[region] actually creates a new vector that will store the result. Since this vector is not bound to the original vector c, it gets lost and hence the original vector does not get updated. To fix this, simply use view() on c and everything should work:

using BenchmarkTools
a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)

function delegate(X::AbstractArray)
    nchunks = Threads.nthreads()
    splits = [round(Int, s) for s in linspace(0.0, size(X,1), nchunks+1)]
    return splits
end

function threaded_mul!(
    c::AbstractVector, 
    a::AbstractMatrix, 
    b::AbstractVector, 
    splits::Vector{Int64}
)
    Threads.@threads for i = 1:Threads.nthreads()
        region = splits[i]+1:splits[i+1]
        A_mul_B!(view(c, region), view(a, region, :), b)
    end
end

threaded_mul!(c,a,b,delegate(a)) # this works! 
2 Likes