Are "optimized" single thread algorithms typically different than "optimized" multithread implementations?

Thanks for your help everyone!

I’m relatively new to multithreading and I’m probably wrong, but it seems like the fastest function for the desired output using a single thread doesn’t parallelize well. However, a for-loop implementation for the output is faster using multithreading. I’m using a Macbook Pro 16 inch M1 pro.

  1. Does this generalize, or is it unique to the functions I’m using?
  2. If that’s true, should you build a package optimized for single-thread or multithread performance?

Minimum working example - a Matrix-Scalar multiplication

#Single Thread Optimized Version

x = collect(1:1:10)
y = rand(length(x),length(x), length(x))

@benchmark dropdims(sum(y .* x', dims=2), dims=2)

BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.792 μs … 498.375 μs  ┊ GC (min … max):  0.00% … 98.50%
 Time  (median):     1.879 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.200 μs ±  10.861 μs  ┊ GC (mean ± σ):  11.95% ±  2.41%
 Memory estimate: 9.12 KiB, allocs estimate: 14.

Multi Thread Optimized Version

x = collect(1:1:10)
y = rand(length(x),length(x), length(x))


function Method2(y::Array{Float64,3}, x::Vector{Int64})
    #Preallocate Output
    output = Matrix{Float64}(undef, length(x), size(y,3))
    
    @fastmath @inbounds @spawn for i=1:size(y,3)
        output[:,i] = @view(y[:,:,i]) * x
    end
    return output
end

@benchmark Method2(y, x)

BenchmarkTools.Trial: 10000 samples with 320 evaluations.
 Range (min … max):  348.697 ns … 41.262 μs  ┊ GC (min … max):  0.00% … 95.21%
 Time  (median):     626.106 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   800.846 ns ±  2.444 μs  ┊ GC (mean ± σ):  18.31% ±  5.89%
  349 ns          Histogram: frequency by time         1.05 μs <
Memory estimate: 2.59 KiB, allocs estimate: 15.

Thanks for much better-educated perspectives than my own!

1 Like

I’m not very experienced with parallelism, but I have normally made the opposite observation: if you want good parallel performance, you first optimize single-threaded performance, and then add parallelism. This is at least what I’ve seen for shared-memory parallelism (multi-threading).

The exception is when the algorithm isn’t inherently easily parallelizable, and you need to change the approach completely to achieve it.

In your examples there are, I think, some confounding factors. Your first example isn’t really optimized, and needlessly allocates an intermediate array. The benchmarking approach is also not optimal. You should wrap that code in a function, like you do with the second.

The other function uses @spawn, which I’m unsure whether is correct, I think it should be @threads. Currently, I think you are just spawning tasks and not waiting for them to finish.

Also, the second function calls BLAS, which is inherently multi-threaded, unless you disabled it explicitly. Also, by using * you are creating unnecessary temporary arrays. Try

@views mul!(output[:,i], y[:,:,i], x) 

instead, to work in-place.

6 Likes

In a very, very general sense: the runtime of a parallelized program is dominated by the portion of the work that can’t be parallelized. Why is this?

If you have a total runtime T, which can be split up into two parts, the parallelizable part T_{parallel} (abbreviated T_p) and the non-parallelizable (or serial) part T_{serial} (abbreviated T_s) such that T = T_s + T_p, then as you increase the number of threads/processes to work on your program, T_p goes to 0 (as that time is divided between all processes/threads), leaving T = T_s. This is because we can use p = T_p / T, i.e. the proportion of the total runtime that we can parallelize, to write the above like:

T = T_s + T_p = (1-p)T + pT

If we now throw n parallel processors at the problem, we get

(1-\frac{p}{n})T + \frac{p}{n}T

We can see that as n \to \infty, the first half tends to T, while the second half tends to 0, resulting in the fact that the non-parallelizable portion dominates the runtime (in the limit).

This is called Amdahl’s Law.

The trouble is of course, that p depends heavily on the algorithm you want to calculate, some of which are easily parallelizeable (i.e. have a very large p close to 1), while others are very serial in nature (i.e. have a small p close to 0). Regardless of which it is though, it’s pretty much always beneficial to start optimizing the serial portion of the parallelizable work itself, since that’s the part that will benefit from parallelization.

5 Likes

Parallel and sequential performance do pull in different directions.

  • In sequential code, delay is strongly correlated with the number of operations, so that’s what you minimize. In parallel code, we perform redundant operations to reduce communication/synchronization.
  • In sequential code, minimize space usage. In parallel code, use extra space to reduce synchronization. For example, fast sequential code reuses buffers; parallel code doesn’t.
  • In sequential code, it’s common to have a single accumulator modified from a single thread. In parallel code, there is no single accumulator - values are accumulated in trees.

For more info Organizing Functional Code for Parallel Execution; or, foldl and foldr Considered Slightly Harmful - YouTube and JuliaCon 2016 | A functional approach to High Performance Computing | Erik Schnetter - YouTube.

2 Likes

Thanks for your feedback!
Function wrapping was negligible, so I didn’t bother, but will in the future!
I’d read that re @spawn, but it returns the correct answer consistently…
Love your specific solution for the mwe regardless!

I would prefer Threads.@threads for a for loop, it will always be correct unless you have race conditions and will likely be cheaper than using @spawn in most cases.

Since @threads is using the same functionality as @spawn internally, I’m not sure I follow how @threads can be faster. Both are using the julia native task machinery.

Also, correctness depends on the code being wrapped - if you share data between iterations, @threads will not save you from race conditions. Both @threads and @spawn have the same core limitations & difficulties, one is just more convenient for throwing on a loop than manually dividing the work.

I may be wrong but @threads should only spawn (at most) as many threads as you have cores, and reuse them for later iterations? Is this right?

EDIT:

I’m beginning to think the example was wrong anyway:

Will just spawn a thread to do the for loop sequentially, right? And worse, it won’t wait for the result so benchmarking will be quick. I don’t @threads would be faster, but it would be correct. I was under the impression that there would be a thread spawned for each part of the loop, which I thought would be slower for a large number of iterations than @threads, but with @spawn we also need to manually wait for each thread as well.

Minor nitpick: neither @spawn nor @threads is spawning new threads (in the OS sense of the word). Both are creating julia native Task objects that are scheduled to run on OS threads by the julia runtime. It has been said that calling it @threads instead of @tasks has been a mistake, but there’s nothing we can do about it now since it’s supported API and changing that would be breaking.

Generally yes, but that’s something you can do with @spawn as well if you want, e.g. by querying Threads.nthreads() as the upper limit for the loop. Like I said, both use the same mechanism under the hood. One is just more manual, but may be more beneficial for nontrivial parallelization tasks that don’t neatly fit into a loop.

In the case of OP, it looks like they just spawned a single task in which the for loop runs, without actually waiting on the result, which is why the function “just” takes nanoseconds to run in the first place. It should have been written like

function Method2(y::Array{Float64,3}, x::Vector{Int64})
    #Preallocate Output
    output = Matrix{Float64}(undef, length(x), size(y,3))
    
    task = @spawn @fastmath @inbounds for i=1:size(y,3)
        output[:,i] = @view(y[:,:,i]) * x
    end
    wait(task)
    return output
end

which should have about the same runtime as the first version due to not being parallelized at all (the @fastmath probably doesn’t do a whole lot either - its usefulness is greatly overstated). You can of course wrap the for i=1:size(y,3) in a task spawning loop for dividing the work up and parallelize that way, but that’s more or less exactly what @threads is doing internally. One is not going to be faster than the other and both are equal in terms of possible race conditions.

3 Likes