Multithreaded parallel prefix scan not faster

I am trying to write the parallel prefix scan as a multithreaded CPU function. Since the number of processors is small relative to the length of the array It wouldn’t be fruitful to use Hillis-Steele or Belloch scan.

Instead for n processors my algorithm divides the array into n parts and then spawns a task to each thread. Storing each part’s sum to a different array and then performing a prefix scan on this array. Finally broadcasting back these sums to n - 1 parts which need the sum of previous parts.

So, in theory for 2 processors the new algorithm should take \frac{t}{2} + \frac{t}{4} =\frac{3t}{4}.
And for 6 processors (my case), \frac{t}{6} + \frac{5t}{36} = \frac{11t}{36}

function scan_naive!(op, output::Vector{T}, input::Vector{T}) where T <: Number
    output[1] = input[1]
    for i = 2:length(input)
        @inbounds output[i] = op(output[i - 1], input[i])
    end
    output
end
using Plots
sizes = 1 .<< collect(10:29)

function timer(f, op, len)
    A = ones(Int32, len)
    B = similar(A)
    x = Base.@elapsed f(op, B, A)
    y = Base.@elapsed f(op, B, A)
    z = Base.@elapsed f(op, B, A)
    min(x, y, z) # Unstable results without multiple trials for multithreaded version
end
t = [timer(scan_naive!, +, i) for i in sizes]
plot(sizes, t, title="Benchmark for prefix scan", scale=:log10, marker = :circle, legend=:bottomright)

image

function threaded_scan!(op, output::Vector{T}, input::Vector{T}) where T <: Number
    segment_length = length(input) ÷ nthreads()
    sums = Array{T}(undef, nthreads())
    @threads for i = 1:nthreads()
        low = 1 + (threadid() - 1)*segment_length
        high = threadid()*segment_length
        threadid() == nthreads() && (high = length(input))
        
        @inbounds output[low] = input[low]
        for j in (low + 1):high
            @inbounds output[j] = op(input[j], output[j - 1])
        end
        @inbounds sums[threadid()] = output[high]
    end
    scan_naive!(op, sums, sums)
    
    @threads for i in (segment_length + 1):length(input)
        segment = (i - 1) ÷ segment_length 
        i >= nthreads()*segment_length && (segment = nthreads() - 1)
        @inbounds output[i] = op(sums[segment], output[i])
    end
    output
end
t2 = [timer(threaded_scan!, +, i) for i in sizes]
plot(sizes, [t, t2], title="Benchmark for prefix scan", scale=:log10, marker = :circle, legend = :bottomright)

image

This result was surprising to me as I expected the multithreaded version to overtake the single threaded version by the end which is an array of size 2^{28} \approx 5\cdot10^8.

Then I made a small optimization which is that I use only n - 1 processors in the latter stage to broadcast partial sums which will remove the overhead of computing the segment and kick in SIMD instructions.

function threaded_scan2!(op, output::Vector{T}, input::Vector{T}) where T <: Number
    segment_length = length(input) ÷ nthreads()
    sums = Array{T}(undef, nthreads())
    @threads for i = 1:nthreads()
        low = 1 + (threadid() - 1)*segment_length
        high = threadid()*segment_length
        threadid() == nthreads() && (high = length(input))
        
        @inbounds output[low] = input[low]
        for j in (low + 1):high
            @inbounds output[j] = input[j] + output[j - 1]
        end
        @inbounds sums[threadid()] = output[high]
    end
    scan_naive!(op, sums, sums)
    
    tasks = []
    
    for i in 2:nthreads()
        low = 1 + (i- 1)*segment_length
        high = i*segment_length
        v = view(output, low:high)
        tsk = Threads.@spawn @inbounds v .+= sums[i - 1]  # Only does addition, `op` is useless here 
        push!(tasks, tsk)
    end
    wait.(tasks)
    output
end

image

This appears to be better but still not beating the single threaded version. Since this is a memory intensive kernel maybe that’s the bottleneck that’s slowing this down, is there a way to confirm this suspicion ?

You can take a look at the two macros @btime and @benchmark. From this package https://github.com/JuliaCI/BenchmarkTools.jl
The allocations (MB and count) and gc time should give you an idea about memory consumption (and time for garbage collection)

I did that before posting, just didn’t mention it here. (2^{29} elements)

# threaded_scan2!
BenchmarkTools.Trial: 
  memory estimate:  8.83 KiB
  allocs estimate:  82
  --------------
  minimum time:     298.860 ms (0.00% GC)
  median time:      299.207 ms (0.00% GC)
  mean time:        299.275 ms (0.00% GC)
  maximum time:     300.185 ms (0.00% GC)
  --------------
  samples:          17
  evals/sample:     1
# naive_scan
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     192.783 ms (0.00% GC)
  median time:      193.240 ms (0.00% GC)
  mean time:        193.359 ms (0.00% GC)
  maximum time:     195.136 ms (0.00% GC)
  --------------
  samples:          26
  evals/sample:     1

I assumed that the threading is doing the allocations as internally I only declare 2 small arrays which shouldn’t be in the order of kilobytes. I even went ahead and changed tasks = [] to an Array of undef Task with required size. This didn’t have much effect on the benchmark, only the allocs dropped to 76, 8.33 kB which isn’t significant.

What are you running on? On a 24-core Xeon Platinum 8260M I get the expected speedup for your first version with 16 threads:
:
scanspeedupn16

i5 8600k link. My benchmark was a bit noisy altering a bit from run to run. I did be a little dishonest and post on of the better runs here. I changed using the @time to using @benchmark along with a setup and things turned out much better. Visually I couldn’t notice any difference between different runs which was apparent earlier when I took the best time out of three runs.

But, even after that the multithreaded version was slower. I pulled out my system monitor and it appeared as if CPU cores would alternate between each other’s 100% usage. Since this is a very memory bound benchmark the OS’s policies on scheduling and swapping should play a significant part in the benchmark. I switched my OS to windows and it appears that in Windows multithreaded performance is better (sigh) here. Of course, the linux kernel version and the distro’s policies will com into play but this is interesting nonetheless.

On windows I get (with very few changes: @benchmark + using channel instead of Tasks)
image

Oops, I see that I posted results with 16 threads above! Here are my results with 6:
scanspeedupn6
So I only get a modest speedup for your first version, but the improved version gets the expected speedup. As you mention, the first version suffers from extra overhead in the second loop. In your case it’s likely memory pressure since you’re using all 6 cores, whereas I’m only using 6 out of 24.