I implemented c = Ab
using multithreading naively by computing row-wise dot products. I did so on BitArrays
because it is the closest thing to the actual data structure in my problem.
However, when I tried to switch the row-wise inner products to submatrix/vector operations, my updates to the final vector stops happening, and I’m not sure why. I suspect it has to do with atomic operations, but there is so little documentation on this I’m still not sure what to do.
Below I paste my code:
function naive_mul!(
c :: AbstractVector,
A :: AbstractMatrix,
b :: AbstractVector
)
for j in eachindex(b)
c[j] = dot(A[j, :], b)
end
end
function threaded_mul!(
c :: AbstractVector,
A :: AbstractMatrix,
b :: AbstractVector
)
Threads.@threads for j in eachindex(b)
c[j] = dot(A[j, :], b)
end
end
On 8 threads, the multiplication is much faster:
using BenchmarkTools
a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)
julia> @btime naive_mul!(c, a, b)
2.231 ms (1000 allocations: 46.88 KiB)
julia> @btime threaded_mul!(c, a, b)
528.598 μs (515 allocations: 22.83 KiB)
But when I tried to do the same using “chunks” of the original matrix instead of row-by-row dot product, somehow I don’t get anything:
#function to assign the chunk size
function delegate(X::AbstractArray)
nchunks = Threads.nthreads()
splits = [round(Int, s) for s in linspace(0.0, size(X,1), nchunks+1)]
return splits
end
function threaded_mul!(
c::AbstractVector,
a::AbstractMatrix,
b::AbstractVector,
splits::Vector{Int64}
)
Threads.@threads for i = 1:Threads.nthreads()
region = splits[i]+1:splits[i+1]
A_mul_B!(c[region], view(a, region, :), b)
end
end
Split up the 1000x1000 matrix into flat evenly spaced chunks that begins on:
julia> splits = delegate(a)
9-element Array{Int64,1}:
0
125
250
375
500
625
750
875
1000
Why does this not correctly assign c
??
a = bitrand(1000, 1000)
b = rand(1000)
c = zeros(1000)
julia> threaded_mul!(c, a, b, delegate(a))
julia> c
1000-element Array{Float64,1}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
⋮