Using Polyester.jl with ChunkSplitters.jl

Bruno_Amorim · January 8, 2024, 5:14pm

I am trying to paralelize a function, where I have some quantity a that I split in chunks and paralelize over the differente chunks. After that I combine the results obtained from all chunks. An example of this would be implementing a multithreaded sum.

ChunkSplitters.jl provides a convinent way to do this. We would write:

function threaded_sum(a, nchunks)
    
    accu = zeros(eltype(a), nchunks)
    
    Threads.@threads for (irange, idxs) in chunks(1:length(a), nchunks)
        for i in irange
            accu[idxs] += a[i]
        end
    end
    
    return sum(accu)
    
end

This works just fine. However, I would like to use Polyester.jl instead of Base.Threads. I would expect that this would work:

function batch_sum(a, nchunks)
    
    accu = zeros(eltype(a), nchunks)
    
    @batch for (irange, idxs) in chunks(1:length(a), nchunks)
        for i in irange
            accu[idxs] += a[i]
        end
    end
    
    return sum(accu)
    
end

However, when I try

a = rand(100)
batch_sum(a, 4)

I get the following error:

MethodError: no method matching step(::ChunkSplitters.Chunk{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}})

Does anyone know why this is the case and how one can work arround this?

carstenbauer · January 8, 2024, 5:56pm

You can try collect(chunks(1:length(a), nchunks)).

Two more comments:

Why do you want to use Polyester? Do you have a good reason? Among other things the package hasn’t gotten much love for a while now.
Regardless of the @batch vs @threads question, your reduction pattern is likely to be very inefficient due to false sharing. You should at least perform the per-task reduction locally before writing into the shared array accu (see e.g. Home · ChunkSplitters.jl). This way, you would run into false sharing only once instead of O(irange) times.

Bruno_Amorim · January 9, 2024, 1:41pm

Indeed, if I write collect(chunks(1:length(a), nchunks)) then the @batch version of the code now works! Thank you for the answer!

Regarding your comments:

Do I have a good reason to use Polyester?
Maybe not. I was just trying to implement a multithreaded code and obtained no speed up compared to the single-threaded case. So I wondered if the problem was multithreaded overhead (there are probably others) and if this could be mitigated using Polyester threads.

Regarding false sharing:
I am not familiar with the concept. Could you please elaborate? Curiously, false sharing (if I understand it) seems to be a problem when using @threads but not when using @batch. I implemented the following functions:

function threaded_sum_v1(a, nchunks)
    
    accu = zeros(eltype(a), nchunks)
    
    @inbounds begin
    Threads.@threads for (irange, idxs) in chunks(1:length(a), nchunks)
        for i in irange
            accu[idxs] += a[i]
        end
    end
    end # @inbounds
    
    return sum(accu)
    
end

function threaded_sum_v2(a, nchunks)
    
    accu = zeros(eltype(a), nchunks)
    
    @inbounds begin
    Threads.@threads for (irange, idxs) in chunks(1:length(a), nchunks)
        local_accu = zero(eltype(a))
        for i in irange
           local_accu += a[i]
        end
        accu[idxs] = local_accu
    end
    end # @inbounds  
    
    return sum(accu)
    
end

function batch_sum_v1(a, nchunks)
    
    accu = zeros(eltype(a), nchunks)
    
    @inbounds begin
    @batch for (irange, idxs) in collect(chunks(1:length(a), nchunks))
        for i in irange
            accu[idxs] += a[i]
        end
    end
    end # @inbounds
    
    return sum(accu)
    
end

function batch_sum_v2(a, nchunks)
    
    accu = zeros(eltype(a), nchunks)
    
    @inbounds begin
    @batch for (irange, idxs) in collect(chunks(1:length(a), nchunks))
        local_accu = zero(eltype(a))
        for i in irange
           local_accu += a[i]
        end
        accu[idxs] = local_accu
    end
    end # @inbounds
    
    return sum(accu)
    
end

Where the *_v1 versions should suffer from false sharing and *_v2 should not. Then I timed then:

a = rand(100_000_000)

@btime sum($a) 
  30.366 ms (0 allocations: 0 bytes)

@btime threaded_sum_v1($a, 4)
  55.822 ms (22 allocations: 2.30 KiB)

@btime threaded_sum_v2($a, 4)
  16.716 ms (22 allocations: 2.30 KiB)

@btime batch_sum_v1($a, 4)
  16.914 ms (2 allocations: 288 bytes)

@btime batch_sum_v1($a, 4)
  15.921 ms (2 allocations: 288 bytes)

Some comments:

for the version using @threads false sharing seems to lead to a significant slow down (it takes longer than the single-threaded version!), but when using @batch there is hardly any difference between the versions with false sharing and without. Any idea of what might be going on here?
when comparing the versions without false sharing, the timings using @batch and @threads are almost the same. Which perhaps hints that there is not much benefit in general to use @batch.

Now, I am not sure if false sharing is an issue for the problem I am actually considering (the sum function is just a minimal working example that illustrated the problem of using @batch with ChunkSplitters).

Some more context: I have implemented a multithreaded versions of sparse matrix - sparse matrix multiplication, C = A*B, where all matrices are sparse. The way I tackled the problem was to split the matrix B in chunks of columns, and to multithread over the chunks of B. So each thread computes C_chunk = A*B_chunk and at the end we assemble the full matrix C from the different C_chunks. So here the accumulator in each thread is a SparseMatrixCSC (and not a simple Number). Would false sharing be a problem for this kind of accumulator? I would like to be able to pre-allocate accumulators, so that they can be recycled when doing repeated matrix multiplications (although this might not bring much benifit, since when making the multiplication one always has to do some resize! of vectors since we do not now the sparsity structure of C before hand).

carstenbauer · January 9, 2024, 5:57pm

Great, happy to help.

Generally, it’s much more likely that your parallelisation strategy is suboptimal. That said, overhead can of course matter in some cases.

CPU cores generally don’t operate on individual elements of an array but instead grab an entire contiguous block from memory (a cache line/block is typically 64 bytes). Hence, even if you only read/write a single element of an array, under the hood an entire 64 byte block (say 8 Float64 numbers) is loaded into cache. In the context of multithreading this implies that different CPU cores (of different threads), when accessing accu, load cache blocks that are (partially) shared (i.e. refer to the same elements). If one thread modifies its corresponding element in accu and thus modifies the cache line, the other CPU cores (of the other threads) must restore “cache coherence”. That is, their cache block isn’t valid anymore and must be reloaded, which is costly. The fact that in the code they actually don’t care about these neighboring elements in the cache line but only “their own element” doesn’t matter. Your code is free of race conditions (because there is no “logical sharing”) but it will perform badly (because cache lines are “falsely shared”).

To remedy this, you could pad the accu array with artificial zeros between the slots to be used by different threads. However, it is better to mitigate the issue more directly. The new suggested code (with a thread local accumulator) still has false sharing but only once, because each thread only accesses accu once at the end. You could avoid false sharing altogether, e.g. bei using an atomic accumulator (and atomic addition) instead of accu.

Note that Polyester does a few (sometimes too) smart things in a @batch block. For example, in contrast to general language semantics, it defaults to @views for array slices, automatically implies @inbounds, and more. I don’t know how it mitigates false sharing but I’m not surprised. Btw, speaking of @inbounds you need to put @inbounds into the @threads block for it to have an effect, because the latter creates a closure.

Like I said above, it’s much more likely that the multithreading pattern is/was inefficient

As for your application, it’s hard to say without seeing any code. But from what you wrote I would assume that false sharing is less likely to be an issue.

Bruno_Amorim · January 11, 2024, 1:33pm

Thank you for the excellent in depth explanation on false sharing and how CPU access arrays! I really appreciate it!

Also thank you for pointing out the need of using @inbounds inside of @threads. That might speed things a bit.

So from your explanation I understood that false sharing is an issue if the vector of accumlators accu is stored contiguously in memory, as it occurs for a Vector{Float64}. But a Vector{SparseMatrixCSC{Float64, Int}} is not stored contiguously in memory, it should be just a vector of pointers, correct? Therefore, false sharing shouldn’t be an issue here?

I believe the slowness of my code should be in the reduction step, where the output matrix C is assembled from the individual chunks. I haven’t multithreaded that step, but it should be possible. I will play a bit more with the code.

If I end up with something useful, I might try to make it into a package.