Allocation help - preforming computations over loops

I am trying to optimize my code and I’ve tried to follow all performance tips in the manual and browsing the forum. I’ve avoided slicing arrays with @views, made sure to pre-allocate arrays, etc. However, I don’t really know what the “lower bound” of memory allocation my function should get to. My function basically gets a large array 3D array and loops over the 3rd dimension to perform a calculation (correlation matrix) and store it in a result array. I need to perform this many many times (inside an optimization). Here is the gist and the many attempts to maximize performance:

function testfun(a,b) #my naive implementation
    temp = [[a[:,:,t] b[:,:,t]] for t = 1:10]
    res = Vector{Float64}(undef, Int(10*6*5/2))
    for t = 1:10
        range2 = ((t-1)*15 + 1):15*t
        res[range2] .= @views [cor(temp[t])[i,j] for j = 1:5 for i = (j+1):6][:]
    end
    res[isnan.(res)] .= -10.0
    return res
end
function testfun2(a,b) #pre-allocate intermediate and output arrays
    temp = Matrix{Float64}(undef, size(a,1), size(a,2)+size(b,2))
    res = Vector{Float64}(undef, Int(10*6*5/2))
    for t = 1:10
        temp .= @views [a[:,:,t] b[:,:,t]]
        range2 = ((t-1)*15 + 1):15*t
        res[range2] .= @views [cor(temp)[i,j] for j = 1:5 for i = (j+1):6][:]
    end
    res[isnan.(res)] .= -10.0
    return res
end
function testfun3!(res,a,b,temp) # use arrays as inputs
    for t = 1:10
        temp .= @views [a[:,:,t] b[:,:,t]]
        range2 = ((t-1)*15 + 1):15*t
        res[range2] .= @views [cor(temp)[i,j] for j = 1:5 for i = (j+1):6][:]
    end
    res[isnan.(res)] .= -10.0
    return res
end
function testfun4!(res,c,temp) # instead of concatenating, use as input
    for t = 1:10
        temp .= @views c[t]
        range2 = ((t-1)*15 + 1):15*t
        res[range2] .= @views [cor(temp)[i,j] for j = 1:5 for i = (j+1):6][:]
    end
    res[isnan.(res)] .= -10.0
    return res
end
a = rand(300_000, 2, 10)
b = rand(300_000, 4, 10)
temp = Matrix{Float64}(undef, size(a,1), size(a,2)+size(b,2))
res = Vector{Float64}(undef, Int(10*6*5/2))
c = [[a[:,:,t] b[:,:,t]] for t = 1:10]

@btime testfun($a,$b) #2.456s (1755 allocations: 2.28 GiB)
@btime testfun2($a,$b) #1.690s (1716 allocations: 2.16 GiB)
@btime testfun3!($res,$a,$b,$temp) #1.511s (1713 allocations: 2.15 GiB)
@btime testfun4!($res,$c,$temp) #1.597s (1693 allocations: 2.01GiB)

Is there something else I am missing?

This is inefficient. c is a vector, and this copies over all the data over and over. Better to just do temp = c[t] which does basically no work.

You’re calculating cor(temp) for each iteration. Move that calculation outside the comprehension. Also, the @views and [:] make no sense. Why are they there?

Yes, this is all quite inefficient. You are creating a lot of arrays, needlessly. Don’t create the comprehension, just do the cor once and then loop over its output, writing it directly into res.

Also this:

Creates an unnecessary array. Instead do something like

replace!(x->isnan(x) ? -10 : x, res)
1 Like

This line is probably allocating.

res[isnan.(res)] .= -10.0

You could replace it with a simple non-allocating loop.

julia> h(v) = @inbounds for i in eachindex(v) if isnan(v[i]) v[i] = -10.0 end end
h (generic function with 1 method)

julia> g(v) = v[isnan.(v)] .= -10.0
g (generic function with 1 method)

julia> @benchmark g($tmp1)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.790 μs … 19.300 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.190 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.603 μs ±  1.417 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▇█▆▅▄▄▂▃▁                                   ▁▁▁          ▂
  █████████████▇▆██▇▆▅▅▇███▇▇▆███▇▇▇▆▅▅▄▄▃▄▄▁▃▅▇████▇▇▆▇▄▄▅▆ █
  1.79 μs      Histogram: log(frequency) by time     8.29 μs <

 Memory estimate: 5.61 KiB, allocs estimate: 4.

julia> @benchmark h($tmp1)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
 Range (min … max):  609.551 ns …  1.052 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     705.618 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   705.336 ns ± 17.602 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▂▁     ▁█▇▁      ▁
  ▂▂▂▂▁▁▁▁▂▂▂▁▂▂▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▃███▅▃▂▂▄████▄▃▂▂▂▆█▆▄▂▂▂▂▂▂ ▃
  610 ns          Histogram: frequency by time          744 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I tried a variety of more idiomatic Julia alternatives but the simple loop was by far the fastest.

1 Like

Did you compare it to replace!()?

Yes, the simple loop was a little faster than replace!. Not sure why the compiler isn’t generating optimal code in this case, or for the other alternatives I tried that all seemed to me should generate efficient code.

julia> ii(res) = replace!(x->isnan(x) ? -10 : x, res)
ii (generic function with 1 method)

julia> @benchmark ii($tmp1)
BenchmarkTools.Trial: 10000 samples with 21 evaluations.
 Range (min … max):  957.143 ns …  2.052 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.013 μs ± 49.128 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂ ▆▄▃▃ ▇█▂ ▂▁▆▆                           ▂      ▃▃     ▂ ▂
  ▃███▁████▁███▁████▁▆▅▄▅▁▇▅▅▁▅▆▆▇▁▆▅▄▁▄▁▁▁▁▁▁▇█▁█▅▅▁▆██▆▁▆▆▇█ █
  957 ns        Histogram: log(frequency) by time      1.18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

:woozy_face: I did not notice I was computing correlations so many times, obviously that is the biggest time consumption in this code. The @views and [:] definitely don’t make sense either.

Regarding the loop vs replace for the isnan, they end up being the same:

function testfun6!(res,c)
    for t = 1:10
        temp = cor(c[t])
        range2 = ((t-1)*15 + 1):15*t
        res[range2] .= [temp[i,j] for j = 1:5 for i = (j+1):6]
    end
    for i in eachindex(res)
        if isnan(res[i])
            res[i] = -10.0
        end
    end
    #res[isnan.(res)] .= -10.0
    return res
end

function testfun7!(res,c)
    for t = 1:10
        temp = cor(c[t])
        range2 = ((t-1)*15 + 1):15*t
        res[range2] .= [temp[i,j] for j = 1:5 for i = (j+1):6]
    end
    replace!(x->isnan(x) ? -10 : x, res)
    return res
end

@btime testfun6!($res, $c) #84.769 ms (150 allocations: 137.35 MiB)
@btime testfun7!($res, $c) #82.930 ms (150 allocations: 137.35 MiB)

That’s a nice improvement :+1: But there’s still a classic mistake:

Here, you create an array on the right hand side, and then copy the elements over. It’s better to never create that rhs array. Just loop over i and j and directly write the values into res.

1 Like