Help Optimizing Slow Distributed Array Computation

I’m in the process of converting some old C code to Julia but am having issues getting the performance to be even close. The specific function that takes most of the time, in the Julia code atleast, is the following

@everywhere const np = nprocs()
@everywhere using DistributedArrays
@everywhere using DistributedArrays.SPMD
@everywhere const nkmax = 21
@everywhere const N = 150

@everywhere function slow_computation(b,dk,dr)
    pids = procs(dr)[:,1]
    pstart = minimum(pids)

    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr[:L] = dk[:L][:,:,:,j1] .*
                               dk[:L][:,:,:,j2] .*
                               dk[:L][:,:,:,j3]
            barrier(;pids=pids)
            if myid() == pstart
                b[j1,j2,j3] = sum(dr)/N^3
            end
            barrier(;pids=pids)
        end
    end
end

function overall_computation()
    
    b = Array{Float64,3}(undef,nkmax,nkmax,nkmax)

    dk = dzeros((N,N,N,nkmax), pids, [np,1,1,1])
    dr = dzeros((N,N,N), pids, [np,1,1])

    # Do some stuff to compute values of dk and dr
    
    spmd(slow_computation, b, dk, dr; pids=pids)

end

When I profile the code it seems to spend the overwhelming majority of its time in either get or set index.
Nothing I’ve tried has been faster than this version but below is a graph of N vs time for the whole computation, which is dominated by this function.image
I think that this should be a function that Julia should be comparable to C in so I’m seeking help from people more knowledgeable about Julia to hopefully find what poor performance choice I made.

You can setup https://github.com/JuliaParallel/ClusterManagers.jl to use MPI as its backend:

https://github.com/JuliaParallel/MPI.jl#usage--mpi-and-julia-parallel-constructs-together

I am curious how that changes the timings.

The most alarming thing here is not the performance for large domain length, but for small domain length: by eye the Julia:C ratio is nearly infinite. You should first optimize that, even on single-threaded, before worrying about parallelism. There should not be any reason for C to have such an advantage.

It’s unclear from your example what kind of objects most of your arguments are, so it’s not really possible to offer much further advice. (Sorry, I was misled by the scrollbars, I did not notice overall_computation. Apologies!)

1 Like

The objects are relatively simple for the purposes of this computation. They are simply arrays of Float64 with relatively large sizes, real computations generally have N>1024 and nkmax around 100.

I agree that the issue is likely not with the parallelism, as other parts of the problem that are parallel have roughly comparable performance with C. I think it is likely the structure of the computation each process is performing that is not properly structured for Julia.

Are there other aspects of the input data that would be helpful to know?

My DistributedArrays is a bit rusty in general and I haven’t tried DistributedArrays.SPMD so I don’t know how much the following advice applies but FWIW, I made 4 different serial versions of your code and the differences in the timings between the versions (shown below) are quite stark. Assuming that the below applies to your code (i.e. assuming DistributedArrays isn’t somehow doing something smarter than what would happen with regular Julia arrays) the lesson is that array slices create copies, so use views or write even more nested loops to stick to scalar indexing. I ran the following in Julia 1.0.1

using BenchmarkTools
nkmax = 21
N = 150

function computation1!(b::Array{T,3},dk::Array{T,4},dr::Array{T,3}) where T
    nkmax = size(dk,4)
    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr = dk[:,:,:,j1] .*
                           dk[:,:,:,j2] .*
                           dk[:,:,:,j3]
            b[j1,j2,j3] = sum(dr)/N^3
        end
    end
    return  b
end

function computation2!(b::Array{T,3},dk::Array{T,4},dr::Array{T,3}) where T
    nkmax = size(dk,4)
    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr .= dk[:,:,:,j1] .*
                           dk[:,:,:,j2] .*
                           dk[:,:,:,j3]
            b[j1,j2,j3] = sum(dr)/N^3
        end
    end
    return  b
end

function computation3!(b::Array{T,3},dk::Array{T,4},dr::Array{T,3}) where T
    nkmax = size(dk,4)
    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr .= @views dk[:,:,:,j1] .*
                           dk[:,:,:,j2] .*
                           dk[:,:,:,j3]
            b[j1,j2,j3] = sum(dr)/N^3
        end
    end
    return  b
end

dk = randn(N,N,N,nkmax)
dr = Array{Float64}(undef, N,N,N)
b  = Array{Float64,3}(undef,nkmax,nkmax,nkmax)

@btime computation1!($b,$dk,$dr);
@btime computation2!($b,$dk,$dr);
@btime computation3!($b,$dk,$dr);

and got the following results

julia> include("distArrays-noDist.jl");
  82.032 s (12672 allocations: 106.22 GiB)
  23.959 s (10560 allocations: 79.66 GiB)
  8.991 s (7392 allocations: 264.00 KiB)

Version 2 uses in place assignment (.=) of dr to avoid allocating new memory and rebinding dr every time through the loop. However, the array slicing is still creating copies in this version. In version 3 array views are created for the slices instead of copies. There’s still some overhead involved in the creation of the views, which is causing the allocations, but you can see the total memory allocated is waaay smaller than the other two versions. Check out the performance tips views section of the manual for more info, and also this recent discourse thread. I’m not sure if you can get down to zero allocations without just manually coding more nested loops. That discourse thread is good reading on the subtleties involved there.

I apologize if you already know all this and this is just a DistributedArrays thing. I didn’t have a chance to test if DistributedArray slices create copies like normal arrays.

1 Like

I appreciate the assistance.
I originally tried these same things but once I discovered that they did not play nice with the distributed array special indexing, i.e. @views causes an error when used on an array indexed like dk[:L] as does .=, I focused my time on other aspects but seeing how dramatic the performance gain was I spent some time looking into it and the following is significantly faster.
For future clarity I replaced

@inbounds dr[:L] = dk[:L][:,:,:,j1] .*
                   dk[:L][:,:,:,j2] .*
                   dk[:L][:,:,:,j3]

with

@inbounds broadcast!(*, dr[:L], view(dk[:L], :,:,:,j1), 
    view(dk[:L], :,:,:,j2), view(dk[:L], :,:,:,j3))

and the quickest benchmark I could spin up went from

BenchmarkTools.Trial: 
  memory estimate:  55.61 GiB
  allocs estimate:  686869
  --------------
  minimum time:     53.092 s (16.74% GC)
  median time:      53.584 s (16.99% GC)
  mean time:        53.761 s (16.91% GC)
  maximum time:     54.783 s (16.94% GC)

to

BenchmarkTools.Trial: 
  memory estimate:  2.50 GiB
  allocs estimate:  678629
  --------------
  minimum time:     14.487 s (5.03% GC)
  median time:      15.042 s (4.82% GC)
  mean time:        15.067 s (4.85% GC)
  maximum time:     16.141 s (5.09% GC)

Note that this is for the entire computation and not just the function we optimized but its still a significant gain.
If I go ahead and apply these same optimizations to the other parts of the code, which now might be dominant I’d have to check, I get

BenchmarkTools.Trial: 
  memory estimate:  2.24 GiB
  allocs estimate:  682227
  --------------
  minimum time:     13.484 s (2.82% GC)
  median time:      13.624 s (2.84% GC)
  mean time:        13.674 s (2.84% GC)
  maximum time:     14.007 s (2.67% GC)

I really appreciate all the help even though it ended up being standard optimizations that just had to be tuned for use on DistributedArrays.

@ChrisRackauckas
I’m still curious about replacing the backend with MPI to see the performance difference but I don’t really understand the MPI.jl documentation on MPIManager.

Is there some straightforward way to replace the backend that I’m not seeing?

@vchuravy would know more.