Poor parallel speedup

billmclean · January 15, 2025, 4:00am

I want to understand why some code exhibits fairly poor parallel speedup.
In simplified form, I am just computing a discrete Laplacian over a 3D grid:

function discrete_laplacian!(AU::Array{T,3}, U::OffsetArray{T,3}) where T
    Nx, Ny, Nz = lastindex(U, 1), lastindex(U, 2), lastindex(U, 3)
    cx, cy, cz = Nx^2, Ny^2, Nz^2
    nt = Threads.nthreads()
    Threads.@threads for k = 1:Nz-1
        for j = 1:Ny-1
            for i = 1:Nx-1
                @inbounds AU[i,j,k] = (
                          cx * ( -U[i+1,j,k] + 2U[i,j,k] - U[i-1,j,k] )
                        + cy * ( -U[i,j+1,k] + 2U[i,j,k] - U[i,j-1,k] )
                        + cz * ( -U[i,j,k+1] + 2U[i,j,k] - U[i,j,k-1] ) )
            end
        end
    end
end

For comparison, I wrote a Fortran OpenMP version of the same function, and
another Julia version that uses ParallelStencil.FiniteDifferences3D.

@parallel function PS_discrete_laplacian!(AU, U, cx, cy, cz)
    @inn(AU) = cx * @d2_xi(U) + cy * @d2_yi(U) + @d2_zi(U)
    return
end

In all cases below, Nx = Ny = Nz = 800.

On a Core i9-12900, I obtained the following runtimes (in seconds).

threads	Julia	Fortran	PS
1	0.457729	1.111	2.040
2	0.351813	0.651	1.080
4	0.324158	0.387	0.559
8	0.286392	0.284	0.349

(The Fortran version was compiled with gfortran -Wall -O3 -march=native.)
Thus, for a single thread Julia is significantly faster than Fortran, which
in turn is significantly faster than PS. However, the Fortran and PS versions
have better parallel speedup.

Here are corresponding results on a Ryzen 7 9700X.

threads	julia	fortran	PS
1	0.392443	1.011	1.507
2	0.335488	0.590	0.785
4	0.350831	0.418	0.422
8	0.380565	0.388	0.365

Julia is still faster when using 1 thread, but the parallel speedup is
worse for all versions.

On my GPU (Radeon RX 7700 XT), the PS version took only 0.044 seconds, so
that is the clear winner. Still, what explains the poor speedup on the CPU?

mbauman · January 15, 2025, 4:20am

Given that both Julia and Fortran are asymptoting to pretty much exactly the same runtime, my lazy shot at a guess here is that you’re approaching your system’s memory bandwidth. It’s a relatively cheap operation over lots of data (800*800*800*4 or 8 == 2 or 4GB); the cost of shuffling the data around is likely to be the limiting factor.

This is also where GPUs excel.

billmclean · January 17, 2025, 3:14am

Thanks. That makes sense.

Topic		Replies	Views
Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads Julia at Scale fortran , parallel , linearalgebra , threads	73	1908	January 22, 2024
A question about parallel performance in multithreading Performance question , performance , multithreading	10	649	May 12, 2022
Parellelization for large ODEs, Multithreading fails Performance performance , multithreading , differentialequation	16	309	March 30, 2023
Parallelization on the CPU isn't effective General Usage	19	525	November 19, 2021
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	391	November 13, 2023

Poor parallel speedup

Related topics