# Allocation help - preforming computations over loops

I am trying to optimize my code and I’ve tried to follow all performance tips in the manual and browsing the forum. I’ve avoided slicing arrays with `@views`, made sure to pre-allocate arrays, etc. However, I don’t really know what the “lower bound” of memory allocation my function should get to. My function basically gets a large array 3D array and loops over the 3rd dimension to perform a calculation (correlation matrix) and store it in a result array. I need to perform this many many times (inside an optimization). Here is the gist and the many attempts to maximize performance:

``````function testfun(a,b) #my naive implementation
temp = [[a[:,:,t] b[:,:,t]] for t = 1:10]
res = Vector{Float64}(undef, Int(10*6*5/2))
for t = 1:10
range2 = ((t-1)*15 + 1):15*t
res[range2] .= @views [cor(temp[t])[i,j] for j = 1:5 for i = (j+1):6][:]
end
res[isnan.(res)] .= -10.0
return res
end
function testfun2(a,b) #pre-allocate intermediate and output arrays
temp = Matrix{Float64}(undef, size(a,1), size(a,2)+size(b,2))
res = Vector{Float64}(undef, Int(10*6*5/2))
for t = 1:10
temp .= @views [a[:,:,t] b[:,:,t]]
range2 = ((t-1)*15 + 1):15*t
res[range2] .= @views [cor(temp)[i,j] for j = 1:5 for i = (j+1):6][:]
end
res[isnan.(res)] .= -10.0
return res
end
function testfun3!(res,a,b,temp) # use arrays as inputs
for t = 1:10
temp .= @views [a[:,:,t] b[:,:,t]]
range2 = ((t-1)*15 + 1):15*t
res[range2] .= @views [cor(temp)[i,j] for j = 1:5 for i = (j+1):6][:]
end
res[isnan.(res)] .= -10.0
return res
end
function testfun4!(res,c,temp) # instead of concatenating, use as input
for t = 1:10
temp .= @views c[t]
range2 = ((t-1)*15 + 1):15*t
res[range2] .= @views [cor(temp)[i,j] for j = 1:5 for i = (j+1):6][:]
end
res[isnan.(res)] .= -10.0
return res
end
a = rand(300_000, 2, 10)
b = rand(300_000, 4, 10)
temp = Matrix{Float64}(undef, size(a,1), size(a,2)+size(b,2))
res = Vector{Float64}(undef, Int(10*6*5/2))
c = [[a[:,:,t] b[:,:,t]] for t = 1:10]

@btime testfun(\$a,\$b) #2.456s (1755 allocations: 2.28 GiB)
@btime testfun2(\$a,\$b) #1.690s (1716 allocations: 2.16 GiB)
@btime testfun3!(\$res,\$a,\$b,\$temp) #1.511s (1713 allocations: 2.15 GiB)
@btime testfun4!(\$res,\$c,\$temp) #1.597s (1693 allocations: 2.01GiB)
``````

Is there something else I am missing?

This is inefficient. `c` is a vector, and this copies over all the data over and over. Better to just do `temp = c[t]` which does basically no work.

You’re calculating `cor(temp)` for each iteration. Move that calculation outside the comprehension. Also, the `@views` and `[:]` make no sense. Why are they there?

Yes, this is all quite inefficient. You are creating a lot of arrays, needlessly. Don’t create the comprehension, just do the `cor` once and then loop over its output, writing it directly into `res`.

Also this:

Creates an unnecessary array. Instead do something like

``````replace!(x->isnan(x) ? -10 : x, res)
``````
1 Like

This line is probably allocating.

``````res[isnan.(res)] .= -10.0
``````

You could replace it with a simple non-allocating loop.

``````julia> h(v) = @inbounds for i in eachindex(v) if isnan(v[i]) v[i] = -10.0 end end
h (generic function with 1 method)

julia> g(v) = v[isnan.(v)] .= -10.0
g (generic function with 1 method)

julia> @benchmark g(\$tmp1)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max):  1.790 μs … 19.300 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     2.190 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   2.603 μs ±  1.417 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

▃▇█▆▅▄▄▂▃▁                                   ▁▁▁          ▂
█████████████▇▆██▇▆▅▅▇███▇▇▆███▇▇▇▆▅▅▄▄▃▄▄▁▃▅▇████▇▇▆▇▄▄▅▆ █
1.79 μs      Histogram: log(frequency) by time     8.29 μs <

Memory estimate: 5.61 KiB, allocs estimate: 4.

julia> @benchmark h(\$tmp1)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
Range (min … max):  609.551 ns …  1.052 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     705.618 ns              ┊ GC (median):    0.00%
Time  (mean ± σ):   705.336 ns ± 17.602 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▂▁     ▁█▇▁      ▁
▂▂▂▂▁▁▁▁▂▂▂▁▂▂▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▃███▅▃▂▂▄████▄▃▂▂▂▆█▆▄▂▂▂▂▂▂ ▃
610 ns          Histogram: frequency by time          744 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

I tried a variety of more idiomatic Julia alternatives but the simple loop was by far the fastest.

1 Like

Did you compare it to `replace!()`?

Yes, the simple loop was a little faster than `replace!`. Not sure why the compiler isn’t generating optimal code in this case, or for the other alternatives I tried that all seemed to me should generate efficient code.

``````julia> ii(res) = replace!(x->isnan(x) ? -10 : x, res)
ii (generic function with 1 method)

julia> @benchmark ii(\$tmp1)
BenchmarkTools.Trial: 10000 samples with 21 evaluations.
Range (min … max):  957.143 ns …  2.052 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):       1.000 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):     1.013 μs ± 49.128 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▂ ▆▄▃▃ ▇█▂ ▂▁▆▆                           ▂      ▃▃     ▂ ▂
▃███▁████▁███▁████▁▆▅▄▅▁▇▅▅▁▅▆▆▇▁▆▅▄▁▄▁▁▁▁▁▁▇█▁█▅▅▁▆██▆▁▆▆▇█ █
957 ns        Histogram: log(frequency) by time      1.18 μs <

Memory estimate: 0 bytes, allocs estimate: 0.
`````` I did not notice I was computing correlations so many times, obviously that is the biggest time consumption in this code. The `@views` and `[:]` definitely don’t make sense either.

Regarding the loop vs replace for the `isnan`, they end up being the same:

``````function testfun6!(res,c)
for t = 1:10
temp = cor(c[t])
range2 = ((t-1)*15 + 1):15*t
res[range2] .= [temp[i,j] for j = 1:5 for i = (j+1):6]
end
for i in eachindex(res)
if isnan(res[i])
res[i] = -10.0
end
end
#res[isnan.(res)] .= -10.0
return res
end

function testfun7!(res,c)
for t = 1:10
temp = cor(c[t])
range2 = ((t-1)*15 + 1):15*t
res[range2] .= [temp[i,j] for j = 1:5 for i = (j+1):6]
end
replace!(x->isnan(x) ? -10 : x, res)
return res
end

@btime testfun6!(\$res, \$c) #84.769 ms (150 allocations: 137.35 MiB)
@btime testfun7!(\$res, \$c) #82.930 ms (150 allocations: 137.35 MiB)
``````

That’s a nice improvement But there’s still a classic mistake:

Here, you create an array on the right hand side, and then copy the elements over. It’s better to never create that rhs array. Just loop over `i` and `j` and directly write the values into `res`.

1 Like