I cannot get close to full speed up in parallel

I have a code that is embarrassingly parallel, but I cannot get full its speed up.
The dummy code is here:

function points_on_sphere_surface(num_pts=100, R=2) 
    indices = 1:num_pts .+ 0.5

    ϕ = acos.(1 .- 2*indices/num_pts)
    θ = π*(1 + sqrt(5)).*indices

    x, y, z = R.*cos.(θ).*sin.(ϕ), R.*sin.(θ).*sin.(ϕ), R.*cos.(ϕ)
    return [x y z]

function original_foo(sphere, r, β)
    N = length(β)
    nx = sphere[1]
    ny = sphere[2]
    nz = sphere[3]
    intensity = zeros(N, N)
    for n=1:N
         for m=1:N 
            rx = r[1,n] - r[1, m]
            ry = r[2,n] - r[2, m]
            rz = r[3,n] - r[3, m]
            dot_n_r = nx*rx + ny*ry + nz*rz
            intensity[n,m] = real(conj(β[n])*β[m]*exp(im*dot_n_r)) # I need to define a matrix 
    return sum(intensity)

N = 500
r = rand(3, N)
β = rand(ComplexF64, N)
sphere = points_on_sphere_surface(200, 10)

time_sequencial = @elapsed begin
    x = zeros(200)
    for i = 1:200
        point_in_sphere = sphere[i,:]
        x[i] = original_foo(point_in_sphere, r, β)

time_parallel = @elapsed begin
    x = []
    for i = 1:200
        point_in_sphere = sphere[i,:]
        push!(x, Threads.@spawn original_foo(point_in_sphere, r, β))
    y = fetch.(x);

speedup = time_sequencial/time_parallel

My notebook has 4cores/8threads. and I initiate Julia with 8 threads.
However, my best case scenario I have around 4 times speed up.

Is there any limitation in my code that sets this upper boundary ?

The main problem here is that with 4 cores, 4x speedup is all you should expect.


So what’s happening here is that sleep suspends the thread if you are @spawning things, so when you have the version with sleep, the only thing that is happening that’s more “parallel” is that when 1 thread is sleeping, another is running the computation. The way computers have more threads than cores is a technique called hyper-threading (with AMD processors it’s called SMT). What this does is allows 1 core to operate 2 programs at once, but it doesn’t actually add extra capability per core. That means that if your program is making full use of the CPU, hyperthreading provides essentially no speedup in most cases.


Indeed, the sleep function led me to misleading conclusiosn. I just had to change sleep(0.4) for inv(rand(1800,1800)) to see that the real speed up was around 2 times.

Note that you are allocating an array just to sum it. Much better to simply accumulate the sum directly and never allocate the array. (If in your real application you need the intensity matrix, note that your for n=1:N; for m=1:N loops are in the wrong order for spatial locality.)

(The code would probably be somewhat cleaner, and at least as fast, if you used StaticArrays.jl for storing and computing with 3-component coordinate vectors — this is what StaticArrays are perfect for.)

PS. There is a function cis(φ) that more efficiently computes exp(im*φ).