Can someone please explain to me why the following happens. I know that I am accessing the array in the wrong order, it is necessary to reproduce the allocation and slowdown. This came up in an actual application and I could not figure out why:

function f1(A)
n1=size(A,1)
n2=size(A,2)
s=0.
for i=1:n1,j=1:n2-1:n2
s+=A[i,j]
end
return s
end
function f2(A)
n1=size(A,1)
n2=size(A,2)
s=0.
for i=1:n1, j in [1,n2]
s+=A[i,j]
end
return s
end
n=10;A=rand(n,n)
f1(A);f2(A);
n=10000;A=rand(n,n)
@time s1=f1(A);
@time s2=f2(A);
s1-s2
0.000608 seconds (5 allocations: 176 bytes)
0.001324 seconds (20.00 k allocations: 1.068 MB)
0.0

The functions were compiled before, I didn’t post it.

I think you are right about the array issue if you change the function to:

function f1(A)
n1=size(A,1)
n2=size(A,2)
s=0.
for i=1:n1,j=1:n2-1:n2
s+=A[i,j]
end
return s
end
function f2(A)
n1=size(A,1)
n2=size(A,2)
s=0.
inds=[1,n2]
for i=1:n1, j in inds
s+=A[i,j]
end
return s
end
n=10;A=rand(n,n)
f1(A);f2(A);
n=10000;A=rand(n,n)
@time s1=f1(A);
@time s2=f2(A);
@time s1=f1(A);
@time s2=f2(A);
s1-s2
0.000602 seconds (5 allocations: 176 bytes)
0.000083 seconds (7 allocations: 288 bytes)
0.000581 seconds (5 allocations: 176 bytes)
0.000080 seconds (7 allocations: 288 bytes)

It becomes even faster! Which I don’t understand either.

I guess in the first version a 2 vector keeps getting created and garbage collected; I thought the compiler could figure out that it was a constant and pre-compute it but it didn’t.

Or I am just lost, which is a perfectly good explanation

Can someone please explain to me why the following happens. I know
that I am accessing the array in the wrong order, it is necessary to
reproduce the allocation and slowdown. This came up in an actual
application and I could not figure out why:

Cannot answer your question as such (I’m definitely a n00b in Julia) but
do note that the two ranges for j are different?

The j ranges are the same. The issue was that the vector [1,n2] was being created and garbage collected inside the loop at every iteration, which is slow. With the step range 1:n2-1:n2 that does not happen.

I am not sure if you have intended to use the code for matrices of dimensions min(m,n) > 1, but in f1, there is a bug: when n2 == 1, you get an error. Plus, why would you iterate if you know ahead of time that you will use the first and last columns of the matrix?

Below code excerpt will save you one loop, and hence, is more efficient:

function f1(A)
n1=size(A,1)
n2=size(A,2)
s=0.
for i=1:n1,j=1:n2-1:n2
s+=A[i,j]
end
return s
end
function f2(A)
n1=size(A,1)
n2=size(A,2)
s=0.
inds=[1,n2]
for i=1:n1, j in inds
s+=A[i,j]
end
return s
end
function f3(A::AbstractMatrix)
n1, n2 = size(A)
s = zero(eltype(A))
for row in 1:n1
s += A[row, 1] + A[row, n2]
end
s
end
function f4{T<:AbstractFloat}(A::AbstractMatrix{T})
n1, n2 = size(A)
s = zero(T)
for row in 1:n1
s += A[row, 1] + A[row, n2]
end
s
end
function f4{T<:Integer}(A::AbstractMatrix{T})
n1, n2 = size(A)
s = zero(widen(T))
for row in 1:nrows
s += A[row, 1] + A[row, n2]
end
s
end
using BenchmarkTools
n=10000; A = rand(n, n);
@benchmark f1(A) # 181.749 us
@benchmark f2(A) # 24.275 us
@benchmark f3(A) # 11.238 us
@benchmark f4(A) # 11.240 us

If you still want to use iterations, you should put an assertion for n2 > 1. And in f3 above, for matrices having Integer elements, s might overflow. Overloading {T<:Integer}(A::AbstractMatrix{T}) with s = zero(widen(eltype(A))) would solve this problem. For AbstractFloats, you don’t have this problem (cf. f4 above).

This code is just meant to reproduce the behavior I was seeing in another application, n>>>1 and I am not all that interested in summing matrices .

This typically comes up when you iterate on an array, and then you want to do something different at the ends (think boundary conditions). Keeping the same code structure makes for a bit more elegant code. I know that it will always be faster if I rewrote the end cases as two explicit loops, but as dimensions grow, it gets tedious. The end conditions should be much less costly than the body of the loop, until something like my original f2() happens.

For boundary conditions at the ends, the most elegant, flexible, and efficient technique is often to use ghost elements. See my comment about this in another thread on discourse.

as a tule of thumb, instead of a constant array, you should always try to use a constant tuple. here is my version, which is faster than the Range version according to my measurements:

function f3(A)
n1=size(A,1)
n2=size(A,2)
s=0.0
for i=1:n1,j=(1,n2)
s+=A[i,j]
end
return s
end