Help Optimizing Slow Distributed Array Computation

My DistributedArrays is a bit rusty in general and I haven’t tried DistributedArrays.SPMD so I don’t know how much the following advice applies but FWIW, I made 4 different serial versions of your code and the differences in the timings between the versions (shown below) are quite stark. Assuming that the below applies to your code (i.e. assuming DistributedArrays isn’t somehow doing something smarter than what would happen with regular Julia arrays) the lesson is that array slices create copies, so use views or write even more nested loops to stick to scalar indexing. I ran the following in Julia 1.0.1

using BenchmarkTools
nkmax = 21
N = 150

function computation1!(b::Array{T,3},dk::Array{T,4},dr::Array{T,3}) where T
    nkmax = size(dk,4)
    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr = dk[:,:,:,j1] .*
                           dk[:,:,:,j2] .*
                           dk[:,:,:,j3]
            b[j1,j2,j3] = sum(dr)/N^3
        end
    end
    return  b
end

function computation2!(b::Array{T,3},dk::Array{T,4},dr::Array{T,3}) where T
    nkmax = size(dk,4)
    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr .= dk[:,:,:,j1] .*
                           dk[:,:,:,j2] .*
                           dk[:,:,:,j3]
            b[j1,j2,j3] = sum(dr)/N^3
        end
    end
    return  b
end

function computation3!(b::Array{T,3},dk::Array{T,4},dr::Array{T,3}) where T
    nkmax = size(dk,4)
    for j1 = 1:nkmax, j2 = 1:j1, j3 = (j1-j2):j2
        if j3 > 0
            @inbounds dr .= @views dk[:,:,:,j1] .*
                           dk[:,:,:,j2] .*
                           dk[:,:,:,j3]
            b[j1,j2,j3] = sum(dr)/N^3
        end
    end
    return  b
end

dk = randn(N,N,N,nkmax)
dr = Array{Float64}(undef, N,N,N)
b  = Array{Float64,3}(undef,nkmax,nkmax,nkmax)

@btime computation1!($b,$dk,$dr);
@btime computation2!($b,$dk,$dr);
@btime computation3!($b,$dk,$dr);

and got the following results

julia> include("distArrays-noDist.jl");
  82.032 s (12672 allocations: 106.22 GiB)
  23.959 s (10560 allocations: 79.66 GiB)
  8.991 s (7392 allocations: 264.00 KiB)

Version 2 uses in place assignment (.=) of dr to avoid allocating new memory and rebinding dr every time through the loop. However, the array slicing is still creating copies in this version. In version 3 array views are created for the slices instead of copies. There’s still some overhead involved in the creation of the views, which is causing the allocations, but you can see the total memory allocated is waaay smaller than the other two versions. Check out the performance tips views section of the manual for more info, and also this recent discourse thread. I’m not sure if you can get down to zero allocations without just manually coding more nested loops. That discourse thread is good reading on the subtleties involved there.

I apologize if you already know all this and this is just a DistributedArrays thing. I didn’t have a chance to test if DistributedArray slices create copies like normal arrays.

1 Like