I cannot get close to full speed up in parallel

Noel_Araujo · January 21, 2021, 11:34pm

I have a code that is embarrassingly parallel, but I cannot get full its speed up.
The dummy code is here:

function points_on_sphere_surface(num_pts=100, R=2) 
    indices = 1:num_pts .+ 0.5

    ϕ = acos.(1 .- 2*indices/num_pts)
    θ = π*(1 + sqrt(5)).*indices

    x, y, z = R.*cos.(θ).*sin.(ϕ), R.*sin.(θ).*sin.(ϕ), R.*cos.(ϕ)
    return [x y z]
end


function original_foo(sphere, r, β)
    N = length(β)
    nx = sphere[1]
    ny = sphere[2]
    nz = sphere[3]
    intensity = zeros(N, N)
    for n=1:N
         for m=1:N 
            rx = r[1,n] - r[1, m]
            ry = r[2,n] - r[2, m]
            rz = r[3,n] - r[3, m]
    
            dot_n_r = nx*rx + ny*ry + nz*rz
            intensity[n,m] = real(conj(β[n])*β[m]*exp(im*dot_n_r)) # I need to define a matrix 
        end
    end
    return sum(intensity)
end

N = 500
r = rand(3, N)
β = rand(ComplexF64, N)
sphere = points_on_sphere_surface(200, 10)

time_sequencial = @elapsed begin
    x = zeros(200)
    for i = 1:200
        point_in_sphere = sphere[i,:]
        x[i] = original_foo(point_in_sphere, r, β)
    end
end


time_parallel = @elapsed begin
    x = []
    for i = 1:200
        point_in_sphere = sphere[i,:]
        push!(x, Threads.@spawn original_foo(point_in_sphere, r, β))
    end
    y = fetch.(x);
end


speedup = time_sequencial/time_parallel

My notebook has 4cores/8threads. and I initiate Julia with 8 threads.
However, my best case scenario I have around 4 times speed up.

Is there any limitation in my code that sets this upper boundary ?

Oscar_Smith · January 22, 2021, 12:47am

The main problem here is that with 4 cores, 4x speedup is all you should expect.

Oscar_Smith · January 22, 2021, 2:11am

So what’s happening here is that sleep suspends the thread if you are @spawning things, so when you have the version with sleep, the only thing that is happening that’s more “parallel” is that when 1 thread is sleeping, another is running the computation. The way computers have more threads than cores is a technique called hyper-threading (with AMD processors it’s called SMT). What this does is allows 1 core to operate 2 programs at once, but it doesn’t actually add extra capability per core. That means that if your program is making full use of the CPU, hyperthreading provides essentially no speedup in most cases.

Noel_Araujo · January 22, 2021, 2:49am

Indeed, the sleep function led me to misleading conclusiosn. I just had to change sleep(0.4) for inv(rand(1800,1800)) to see that the real speed up was around 2 times.

stevengj · January 22, 2021, 3:37am

Note that you are allocating an array just to sum it. Much better to simply accumulate the sum directly and never allocate the array. (If in your real application you need the intensity matrix, note that your for n=1:N; for m=1:N loops are in the wrong order for spatial locality.)

(The code would probably be somewhat cleaner, and at least as fast, if you used StaticArrays.jl for storing and computing with 3-component coordinate vectors — this is what StaticArrays are perfect for.)

PS. There is a function cis(φ) that more efficiently computes exp(im*φ).

Topic		Replies	Views
Embarrassingly parallel multi-threading doesn't scale Performance multithreading	17	1619	October 16, 2021
Trouble Speeding up Vectorized Functions with Parallelism Performance	12	1362	May 28, 2018
Decrease in performance using Threads.@threads in Linux Julia at Scale	16	1992	July 23, 2019
Why doesn't multithreading help here? Performance	12	1417	August 22, 2020
Poor parallel speedup Performance multithreading	2	227	January 17, 2025

I cannot get close to full speed up in parallel

Related topics