# Why substituting "for loop" with direct assignment of Matrix worsen the performance time?

Why the execution time of “testS()” is faster than “testP()”, although the later has reduced a “for loop” (which is supposed to consume time) by a direct assignment of the Matrix, as shown below:

using BenchmarkTools

function testS(S,r)

for i in size(S,1)

if r==1

for j in size(S,2)

S[i,j] = S[i,j] +1;

end

elseif r==2

for j in size(S,2)

S[i,j] = S[i,j] +2;

end

end

end

end

function testP(S,r)

for i in size(S,1)

if r==1

S[i,:] = [S[i,1]+1 S[i,2]+1 S[i,3]+1 S[i,4]+1];

elseif r==2

S[i,:] = [S[i,1]+2 S[i,2]+2 S[i,3]+2 S[i,4]+2];

end

end

end

dt = 0.0001;

tmin = 0;

tmax = 1;

timeSim = tmin:dt:tmax;

S = [0 0 0 0; 0 0 0 0; 0 0 0 0]

r = 1;

@btime begin for t in tmin:dt:tmax testS(S,r) end  end
# It gives 595.700 μs

@btime begin for t in tmin:dt:tmax testP(S,r) end  end
# It gives 949.500 μs

[S[i,1]+1 S[i,2]+1 S[i,3]+1 S[i,4]+1]; allocates an array. If you instead wrote S[i,:] .+= 1, you would recover the performance.

The problem is that in my real code the assignment is not the same for the matrix elements i.e.,

....
S[i,:] = [S[i,1]+4 S[i,2]+3 S[i,3]+6 S[i,4]+2];
...

Performance aside, your code is actually doing nothing other than S[end] += r. Also the matrix version is doing a different thing. Note: you might be used to Python’s for i in range(n) which will loop from 0 to n-1, in Julia, you must state explicitly for i in 0:n-1. In your code, for i in size(S,1) means i = 3 and for J in size(S,2) means j = 4, so what you do is actually S[3,4] += r and the matrix form is different.

1 Like

Yes, but lets say the following:

using BenchmarkTools

function testS(S,r)
for i in 1:size(S,1)
if r==1
for j in 1:size(S,2)
S[i,j] = S[i,j] +i;
end
elseif r==2
for j in 1:size(S,2)
S[i,j] = S[i,j] +j;
end
end
end
end

function testP(S,r)
for i in 1:size(S,1)
if r==1
S[i,:] = [S[i,1]+1 S[i,2]+4 S[i,3]+5 S[i,4]+6];
elseif r==2
S[i,:] = [S[i,1]+4 S[i,2]+2 S[i,3]+5 S[i,4]+2];
end
end
end

dt = 0.0001;
tmin = 0;
tmax = 1;
timeSim = tmin:dt:tmax;
S = [0 0 0 0; 0 0 0 0; 0 0 0 0]
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end    # It gives 660.100 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end     # It gives 1.550 ms

@Seif_Shebl @Oscar_Smith please have a look here!

Again, what your loop does is this:

if r == 1
S .+= 1:size(S,1)
elseif r == 2
S .+= (1:size(S,2))'
end

You should take care of the dot in .+= as @Oscar_Smith said.
And the matrix form does:

if r == 1
S .+= [4 3 6 2]
elseif r == 2
S .+= [4 2 5 2]
end
1 Like

In my real code, the situation is as below:

using BenchmarkTools

function testS(S,r)
for i in 1:size(S,1)
if r==1
for j in 1:size(S,2)
S[i,j] = S[i,j] +i;
end
elseif r==2
for j in 1:size(S,2)
S[i,j] = S[i,j] +j;
end
end
end
end

function testP(S,r)
for i in 1:size(S,1)
if r==1
S[i,:] = [S[i,1]+1 S[i,2]+4 S[i,3]+5 S[i,4]+6];
elseif r==2
S[i,:] = [S[i,1]+4 S[i,2]+2 S[i,3]+5 S[i,4]+2];
end
end
end

dt = 0.0001;
tmin = 0;
tmax = 1;
timeSim = tmin:dt:tmax;
S = [0 0 0 0; 0 0 0 0; 0 0 0 0]
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end    # It gives 660.100 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end     # It gives 1.550 ms

As I said, you’re comparing two different things, here is the same code with the matrix form actually faster:

function testS(S,r)
s1 = [1, 4, 5, 6]
s2 = [4, 2, 5, 2]
for i in 1:size(S,1)
if r==1
for j in 1:size(S,2)
S[i,j] = S[i,j] + s1[j];
end
elseif r==2
for j in 1:size(S,2)
S[i,j] = S[i,j] + s2[j];
end
end
end
end

function testP(S,r)
if r==1
S .+= [1 4 5 6]
elseif r==2
S .+= [4 2 5 2]
end
end

@btime begin for t in tmin:dt:tmax testS(\$S,\$r) end  end
@btime begin for t in tmin:dt:tmax testP(\$S,\$r) end  end
1.320 ms (49495 allocations: 2.43 MiB)
1.191 ms (39494 allocations: 1.52 MiB)

@Seif_Shebl @Oscar_Smith thank you for your reply. Below, is my real code, in which I believe I am comparing between two similar things, but again the “for loop” gives faster performance:

using BenchmarkTools
using Parameters, Base;

Base.@kwdef mutable struct SSs
PP::Vector{Float64} = [0,0,0]
LL::Vector{Float64} = [0,0,0]
end #mutable struct RLC

function testS(S,r)
for i in 1:size(S,1)
if r==1
for j in 1:3
S[i].LL[j] = S[i].PP[j]*S[i].PP[j]+1;
end
elseif r==2
for j in 1:3
S[i].LL[j] = S[i].PP[j]*S[i].PP[j]+2;
end
end
end
end

function testP(S,r)
for i in 1:size(S,1)
if r==1
S[i].LL = S[i].PP.*S[i].PP.+1;
elseif r==2
S[i].LL = S[i].PP.*S[i].PP.+2;
end
end
end

dt = 0.0001;
tmin = 0;
tmax = 1;
timeSim = tmin:dt:tmax;
S = SSs[];
push!(S, SSs());
S[1].PP = [2,3,1];
push!(S, SSs());
S[2].PP = [4,5,6];
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end
640.000 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end
1.166 ms

You’re defining a new matrix here (which allocates more memory) and assigning it to the mutable slot LL:

S[i].LL = S[i].PP.*S[i].PP.+1

You need to instead use dot-assignment:

S[i].LL .= S[i].PP.*S[i].PP.+1

An abbreviated way to insert all those dots is the @. broadcast macro:

@. S[i].LL = S[i].PP^2 + 1

For such small arrays, you’re likely to get an additional speedup by substituting static/mutable arrays from StaticArrays.jl.

2 Likes

Actually, I didtn get why in my way, I am defining a new matrix. In other words, what does the dot is doing here?

Can you give me an example here?

Take a look at this example, to understand what is going on:

julia> Base.@kwdef mutable struct SSs
PP::Vector{Float64} = [0,0,0]
LL::Vector{Float64} = [0,0,0]
end
SSs

julia> pp1 = [1.,1.,1.]; ll1 = [2.,2.,2.];

julia> s1 = SSs(pp1,ll1)
SSs([1.0, 1.0, 1.0], [2.0, 2.0, 2.0])

julia> s1.PP = [0.,0.,0.] # this replaces the array PP
3-element Vector{Float64}:
0.0
0.0
0.0

julia> pp1 # the original pp1 was not modified
3-element Vector{Float64}:
1.0
1.0
1.0

julia> s1.LL .= [0.,0.,0.]; # note the . : this mutates s1.LL

julia> ll1 # the original array was modified
3-element Vector{Float64}:
0.0
0.0
0.0

That said, only by writting s1.LL = [0.,0.,0.] (for example), you create the array [0.,0.,0.] which is a mutable object, and allocates memory. For this specific case you could use s1.LL .= (0,0,0), which creates a tuple on the right side, but being immutable, tuples do not allocate memory (the compiler can optimize them out, it knows that their values or length won’t change).

The StaticArrays version of that would be:

julia> using StaticArrays

julia> Base.@kwdef mutable struct S
PP::SVector{3,Float64} = @SVector [0,0,0]
LL::SVector{3,Float64} = @SVector [0,0,0]
end
S

julia> s = S()
S([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> s.PP = @SVector [1.,1.,1.];  # one way

julia> s.LL = SVector{3,Float64}(2,2,2);  # another way

julia> s
S([1.0, 1.0, 1.0], [2.0, 2.0, 2.0])

1 Like

Or just s1.LL .= 0

1 Like

To the OP: unless you are performing matrix algebra, you can normally assume that ordinary loops will give the best performance.

Also, remember that Julia Arrays are column major, so try to let your inner loop iterate over the first index, not the second, as shown in your MWE.

Of course

@lmiq Thank you very much for your explanation.
Is the below correct for (3x3 Matrix{Float64, 2} )in the StaticArrays version?

Base.@kwdef mutable struct AAs
Lss::SArray{Float64, 2} =@SArray [0 0 0; 0 0 0; 0 0 0]
end

@SMatrix I think

1 Like

Since you’re mutating your matrix, you’ll want to use MMatrix instead of SMatrix, with this type signature:

MMatrix{3, 3, Float64, 9}

With a mutable matrix, you don’t need the container struct to be mutable itself.

1 Like

@DNF thank you for your explanation.
What does “Julia Arrays are column major,” mean. Can you give me an example for " inner loop iterate over the first index, not the second"?

Arrays are stored in contiguous, linear blocks of memory:

julia> v = [1:6...]
6-element Vector{Int64}:
1
2
3
4
5
6

Column-major matrices are contiguous along the first dimension, while row-major matrices are contiguous along the second:

julia> reshape(v, 2, 3) # column-major
2×3 Matrix{Int64}:
1  3  5
2  4  6

julia> reshape(v, 2, 3)' # transposed to row-major
1  2
3  4
5  6

The CPU is faster when it can operate on contiguous blocks of memory, so it’s better for the inner loop (with the fastest-changing index) to correspond to those contiguous blocks. For column-major languages like Julia, that means w[1,1], w[2,1], w[3,1] rather than w[1,1], w[1,2], w[1,3]. For example,

julia> function alongcolumns(w)
for j in axes(w, 2)
for i in axes(w, 1)
w[i, j] += 1
end
end
end
alongcolumns (generic function with 1 method)

julia> function alongrows(w)
for i in axes(w, 2)
for j in axes(w, 1)
w[i, j] += 1
end
end
end
alongrows (generic function with 1 method)

julia> w = rand(100, 100);

julia> @btime alongcolumns(\$w)
710.000 ns (0 allocations: 0 bytes)

julia> @btime alongrows(\$w)
2.889 μs (0 allocations: 0 bytes)

That’s a 4x difference, just by switching the loop order!

3 Likes