# I cannot get close to full speed up in parallel

I have a code that is embarrassingly parallel, but I cannot get full its speed up.
The dummy code is here:

``````function points_on_sphere_surface(num_pts=100, R=2)
indices = 1:num_pts .+ 0.5

ϕ = acos.(1 .- 2*indices/num_pts)
θ = π*(1 + sqrt(5)).*indices

x, y, z = R.*cos.(θ).*sin.(ϕ), R.*sin.(θ).*sin.(ϕ), R.*cos.(ϕ)
return [x y z]
end

function original_foo(sphere, r, β)
N = length(β)
nx = sphere
ny = sphere
nz = sphere
intensity = zeros(N, N)
for n=1:N
for m=1:N
rx = r[1,n] - r[1, m]
ry = r[2,n] - r[2, m]
rz = r[3,n] - r[3, m]

dot_n_r = nx*rx + ny*ry + nz*rz
intensity[n,m] = real(conj(β[n])*β[m]*exp(im*dot_n_r)) # I need to define a matrix
end
end
return sum(intensity)
end

N = 500
r = rand(3, N)
β = rand(ComplexF64, N)
sphere = points_on_sphere_surface(200, 10)

time_sequencial = @elapsed begin
x = zeros(200)
for i = 1:200
point_in_sphere = sphere[i,:]
x[i] = original_foo(point_in_sphere, r, β)
end
end

time_parallel = @elapsed begin
x = []
for i = 1:200
point_in_sphere = sphere[i,:]
end
y = fetch.(x);
end

speedup = time_sequencial/time_parallel
``````

However, my best case scenario I have around 4 times speed up.

Is there any limitation in my code that sets this upper boundary ?

The main problem here is that with 4 cores, 4x speedup is all you should expect.

3 Likes

So what’s happening here is that `sleep` suspends the thread if you are `@spawn`ing things, so when you have the version with `sleep`, the only thing that is happening that’s more “parallel” is that when 1 thread is sleeping, another is running the computation. The way computers have more threads than cores is a technique called hyper-threading (with AMD processors it’s called SMT). What this does is allows 1 core to operate 2 programs at once, but it doesn’t actually add extra capability per core. That means that if your program is making full use of the CPU, hyperthreading provides essentially no speedup in most cases.

3 Likes

Indeed, the` sleep` function led me to misleading conclusiosn. I just had to change `sleep(0.4)` for `inv(rand(1800,1800))` to see that the real speed up was around 2 times.

Note that you are allocating an array just to sum it. Much better to simply accumulate the sum directly and never allocate the array. (If in your real application you need the `intensity` matrix, note that your `for n=1:N; for m=1:N` loops are in the wrong order for spatial locality.)

(The code would probably be somewhat cleaner, and at least as fast, if you used StaticArrays.jl for storing and computing with 3-component coordinate vectors — this is what StaticArrays are perfect for.)

PS. There is a function `cis(φ)` that more efficiently computes `exp(im*φ)`.

5 Likes