Indeed, if I write collect(chunks(1:length(a), nchunks))
then the @batch
version of the code now works! Thank you for the answer!
Regarding your comments:
Do I have a good reason to use Polyester?
Maybe not. I was just trying to implement a multithreaded code and obtained no speed up compared to the single-threaded case. So I wondered if the problem was multithreaded overhead (there are probably others) and if this could be mitigated using Polyester threads.
Regarding false sharing:
I am not familiar with the concept. Could you please elaborate? Curiously, false sharing (if I understand it) seems to be a problem when using @threads
but not when using @batch
. I implemented the following functions:
function threaded_sum_v1(a, nchunks)
accu = zeros(eltype(a), nchunks)
@inbounds begin
Threads.@threads for (irange, idxs) in chunks(1:length(a), nchunks)
for i in irange
accu[idxs] += a[i]
end
end
end # @inbounds
return sum(accu)
end
function threaded_sum_v2(a, nchunks)
accu = zeros(eltype(a), nchunks)
@inbounds begin
Threads.@threads for (irange, idxs) in chunks(1:length(a), nchunks)
local_accu = zero(eltype(a))
for i in irange
local_accu += a[i]
end
accu[idxs] = local_accu
end
end # @inbounds
return sum(accu)
end
function batch_sum_v1(a, nchunks)
accu = zeros(eltype(a), nchunks)
@inbounds begin
@batch for (irange, idxs) in collect(chunks(1:length(a), nchunks))
for i in irange
accu[idxs] += a[i]
end
end
end # @inbounds
return sum(accu)
end
function batch_sum_v2(a, nchunks)
accu = zeros(eltype(a), nchunks)
@inbounds begin
@batch for (irange, idxs) in collect(chunks(1:length(a), nchunks))
local_accu = zero(eltype(a))
for i in irange
local_accu += a[i]
end
accu[idxs] = local_accu
end
end # @inbounds
return sum(accu)
end
Where the *_v1
versions should suffer from false sharing and *_v2
should not. Then I timed then:
a = rand(100_000_000)
@btime sum($a)
30.366 ms (0 allocations: 0 bytes)
@btime threaded_sum_v1($a, 4)
55.822 ms (22 allocations: 2.30 KiB)
@btime threaded_sum_v2($a, 4)
16.716 ms (22 allocations: 2.30 KiB)
@btime batch_sum_v1($a, 4)
16.914 ms (2 allocations: 288 bytes)
@btime batch_sum_v1($a, 4)
15.921 ms (2 allocations: 288 bytes)
Some comments:
- for the version using
@threads
false sharing seems to lead to a significant slow down (it takes longer than the single-threaded version!), but when using @batch
there is hardly any difference between the versions with false sharing and without. Any idea of what might be going on here?
- when comparing the versions without false sharing, the timings using
@batch
and @threads
are almost the same. Which perhaps hints that there is not much benefit in general to use @batch
.
Now, I am not sure if false sharing is an issue for the problem I am actually considering (the sum function is just a minimal working example that illustrated the problem of using @batch
with ChunkSplitters).
Some more context: I have implemented a multithreaded versions of sparse matrix - sparse matrix multiplication, C = A*B
, where all matrices are sparse. The way I tackled the problem was to split the matrix B
in chunks of columns, and to multithread over the chunks of B
. So each thread computes C_chunk = A*B_chunk
and at the end we assemble the full matrix C
from the different C_chunk
s. So here the accumulator in each thread is a SparseMatrixCSC
(and not a simple Number
). Would false sharing be a problem for this kind of accumulator? I would like to be able to pre-allocate accumulators, so that they can be recycled when doing repeated matrix multiplications (although this might not bring much benifit, since when making the multiplication one always has to do some resize!
of vectors since we do not now the sparsity structure of C
before hand).