Poor parallel speedup

I want to understand why some code exhibits fairly poor parallel speedup.
In simplified form, I am just computing a discrete Laplacian over a 3D grid:

function discrete_laplacian!(AU::Array{T,3}, U::OffsetArray{T,3}) where T
    Nx, Ny, Nz = lastindex(U, 1), lastindex(U, 2), lastindex(U, 3)
    cx, cy, cz = Nx^2, Ny^2, Nz^2
    nt = Threads.nthreads()
    Threads.@threads for k = 1:Nz-1
        for j = 1:Ny-1
            for i = 1:Nx-1
                @inbounds AU[i,j,k] = (
                          cx * ( -U[i+1,j,k] + 2U[i,j,k] - U[i-1,j,k] )
                        + cy * ( -U[i,j+1,k] + 2U[i,j,k] - U[i,j-1,k] )
                        + cz * ( -U[i,j,k+1] + 2U[i,j,k] - U[i,j,k-1] ) )
            end
        end
    end
end

For comparison, I wrote a Fortran OpenMP version of the same function, and
another Julia version that uses ParallelStencil.FiniteDifferences3D.

@parallel function PS_discrete_laplacian!(AU, U, cx, cy, cz)
    @inn(AU) = cx * @d2_xi(U) + cy * @d2_yi(U) + @d2_zi(U)
    return
end

In all cases below, Nx = Ny = Nz = 800.

On a Core i9-12900, I obtained the following runtimes (in seconds).

threads Julia Fortran PS
1 0.457729 1.111 2.040
2 0.351813 0.651 1.080
4 0.324158 0.387 0.559
8 0.286392 0.284 0.349

(The Fortran version was compiled with gfortran -Wall -O3 -march=native.)
Thus, for a single thread Julia is significantly faster than Fortran, which
in turn is significantly faster than PS. However, the Fortran and PS versions
have better parallel speedup.

Here are corresponding results on a Ryzen 7 9700X.

threads julia fortran PS
1 0.392443 1.011 1.507
2 0.335488 0.590 0.785
4 0.350831 0.418 0.422
8 0.380565 0.388 0.365

Julia is still faster when using 1 thread, but the parallel speedup is
worse for all versions.

On my GPU (Radeon RX 7700 XT), the PS version took only 0.044 seconds, so
that is the clear winner. Still, what explains the poor speedup on the CPU?

Given that both Julia and Fortran are asymptoting to pretty much exactly the same runtime, my lazy shot at a guess here is that you’re approaching your system’s memory bandwidth. It’s a relatively cheap operation over lots of data (800*800*800*4 or 8 == 2 or 4GB); the cost of shuffling the data around is likely to be the limiting factor.

This is also where GPUs excel.

7 Likes

Thanks. That makes sense.