Improving perfomance on nested loops

I am trying to improve the performance of script that I am converting from python. To my knowledge, I have: written a kernel function, preallocated my output and accessed memory in order.

While I have improved the scripts performance, the overall it is still noticeably* slower (by 2x ) than the python script.

Here is the MWE,

using ImageFiltering, OffsetArrays,StatsBase
function PA13!(A, Aout, winarray, ndays, latsize, lonsize)
    for d in 1:ndays
        for lat in 1:latsize, lon in 1:lonsize
            t = A[lon,lat,winarray]
            if all(.!ismissing.(t))
                Aout[lon,lat,d] = percentile(skipmissing(t), 90)
            else
                Aout[lon,lat,d] = missing
            end
        end
        #imfilter used to roll window. similir to np.roll
        winarray = imfilter(winarray, OffsetArray([1], -1 - d), "circular")
        winarray = convert(BitArray, winarray)
    end
    Aout
end

function PA13(A, window=15)
    ndays = 365
    latsize = size(A,2)
    lonsize = size(A,1)
    Aout = Array{eltype(A)}(undef, lonsize, latsize, ndays)
    nyrs = 1 
    #
    winarray = BitArray(undef, ndays)
    winarray .= 0
    winarray[[1:window ÷ 2 + 1; (ndays + 1 - window ÷2):end]] .= true
    winarray = repeat(winarray, nyrs)
    PA13!(A, Aout, winarray, ndays, latsize, lonsize)
    Aout
end
const x = rand(223,152,365);
PA13(x);

with the corresponding benchmarking results

BenchmarkTools.Trial: 
  memory estimate:  8.39 GiB
  allocs estimate:  111350557
  --------------
  minimum time:     12.593 s (7.96% GC)
  median time:      12.593 s (7.96% GC)
  mean time:        12.593 s (7.96% GC)
  maximum time:     12.593 s (7.96% GC)
  --------------
  samples:          1
  evals/sample:     1

and code_warntype results

Variables
  #self#::Core.Compiler.Const(PA13, false)
  A::Array{Float64,3}

Body::Array{Float64,3}
1 ─ %1 = (#self#)(A, 15)::Array{Float64,3}
└──      return %1

Are there any gains to be made?

I think you can do this, which cuts the time in half:

if !any(ismissing, t)
   Aout[lon,lat,d] = quantile!(t, 0.9) # from Statistics
else
   Aout[lon,lat,d] = missing
end

This might just be circshift!(winarray, -d).

I suspect that avoiding this would be the next big step. winarray is quite a simple mask, it’s false on some range lo:hi. Could you just keep track of these bounds, and copy the right bit of A into the same t each time?

3 Likes

Thanks. I have implemented your suggestions. They helped shave of a couple seconds and half memory allocations.

In addtion, I changed

t = A[lon,lat,winarray]

to

t = view(A, lon, lat, winarray)

This is an interesting idea. I will try and implement this.

Julia Arrays are column major, and you are slicing along the 3rd dimension, which is very inefficient. Can you re-arrange your arrays to slice along the first dimension?

1 Like

It also looks like you intend to re-use the memory for winarray, but in the quoted code you are allocating new memory for it, twice in each iteration. Both of the above assignments create new arrays.

I have rewritten PA13! based on @mcabbott and @DNF suggestions. I should note that the permuted dims version was not any more performant.

function PA13!(A, Aout, winarray, ndays, latsize, lonsize)
    # Asubset = Array{eltype(A)}(undef, sum(winarray), latsize, lonsize)
    Asubset = Array{eltype(A)}(undef, lonsize, latsize, sum(winarray))
    t = Array{eltype(A)}(undef, sum(winarray))
    for d in 1:ndays
        # permutedims!(Asubset, view(A, :,:, winarray), (3,2,1))
        copy!(Asubset, view(A,:,:,winarray))
        for lat in 1:latsize, lon in 1:lonsize
            copy!(t, view(Asubset, lon, lat, :))
            if !any(ismissing, t)
                Aout[lon, lat,d] = try
                    quantile!(t, 0.9)
                catch
                    quantile!(collect(skipmissing(t)), 0.9)
                end
            else
                Aout[lon, lat, d] = missing
            end
        end
        winarray = circshift(winarray,1)
    end
    Aout
end

and got the following improvements: First, when x is type Array{Float32}

BenchmarkTools.Trial: 
  memory estimate:  102.81 MiB
  allocs estimate:  1105
  --------------
  minimum time:     5.515 s (0.00% GC)
  median time:      5.515 s (0.00% GC)
  mean time:        5.515 s (0.00% GC)
  maximum time:     5.515 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

and then when x is type Array{Union{Missing,Float32}}

BenchmarkTools.Trial: 
  memory estimate:  115.64 MiB
  allocs estimate:  1105
  --------------
  minimum time:     8.652 s (0.00% GC)
  median time:      8.652 s (0.00% GC)
  mean time:        8.652 s (0.00% GC)
  maximum time:     8.652 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

Thanks to you both for the help.