Why substituting "for loop" with direct assignment of Matrix worsen the performance time?

Amro · September 22, 2021, 8:01pm

Why the execution time of “testS()” is faster than “testP()”, although the later has reduced a “for loop” (which is supposed to consume time) by a direct assignment of the Matrix, as shown below:

using BenchmarkTools

function testS(S,r) 

  for i in size(S,1)

    if r==1

      for j in size(S,2)

        S[i,j] = S[i,j] +1;

      end

    elseif r==2

      for j in size(S,2)

        S[i,j] = S[i,j] +2;

      end

    end 

  end 

end

function testP(S,r) 

  for i in size(S,1)

    if r==1

      S[i,:] = [S[i,1]+1 S[i,2]+1 S[i,3]+1 S[i,4]+1];

    elseif r==2

      S[i,:] = [S[i,1]+2 S[i,2]+2 S[i,3]+2 S[i,4]+2];

    end 

  end 

end

dt = 0.0001; 

tmin = 0;  

tmax = 1;  

timeSim = tmin:dt:tmax;

S = [0 0 0 0; 0 0 0 0; 0 0 0 0]

r = 1;

@btime begin for t in tmin:dt:tmax testS(S,r) end  end 
# It gives 595.700 μs

@btime begin for t in tmin:dt:tmax testP(S,r) end  end
# It gives 949.500 μs

Oscar_Smith · September 22, 2021, 8:05pm

[S[i,1]+1 S[i,2]+1 S[i,3]+1 S[i,4]+1]; allocates an array. If you instead wrote S[i,:] .+= 1, you would recover the performance.

Amro · September 22, 2021, 8:26pm

The problem is that in my real code the assignment is not the same for the matrix elements i.e.,

....
S[i,:] = [S[i,1]+4 S[i,2]+3 S[i,3]+6 S[i,4]+2];
...

Seif_Shebl · September 22, 2021, 8:48pm

Performance aside, your code is actually doing nothing other than S[end] += r. Also the matrix version is doing a different thing. Note: you might be used to Python’s for i in range(n) which will loop from 0 to n-1, in Julia, you must state explicitly for i in 0:n-1. In your code, for i in size(S,1) means i = 3 and for J in size(S,2) means j = 4, so what you do is actually S[3,4] += r and the matrix form is different.

Amro · September 22, 2021, 8:54pm

Yes, but lets say the following:

using BenchmarkTools

function testS(S,r) 
  for i in 1:size(S,1)
    if r==1
      for j in 1:size(S,2)
        S[i,j] = S[i,j] +i;
      end
    elseif r==2
      for j in 1:size(S,2)
        S[i,j] = S[i,j] +j;
      end
    end 
  end 
end

function testP(S,r) 
  for i in 1:size(S,1)
    if r==1
      S[i,:] = [S[i,1]+1 S[i,2]+4 S[i,3]+5 S[i,4]+6];
    elseif r==2
      S[i,:] = [S[i,1]+4 S[i,2]+2 S[i,3]+5 S[i,4]+2];
    end 
  end 
end

dt = 0.0001; 
tmin = 0;  
tmax = 1;  
timeSim = tmin:dt:tmax;
S = [0 0 0 0; 0 0 0 0; 0 0 0 0]
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end    # It gives 660.100 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end     # It gives 1.550 ms

@Seif_Shebl @Oscar_Smith please have a look here!

Seif_Shebl · September 22, 2021, 9:03pm

Again, what your loop does is this:

if r == 1
    S .+= 1:size(S,1)
elseif r == 2
    S .+= (1:size(S,2))' 
end

You should take care of the dot in .+= as @Oscar_Smith said.
And the matrix form does:

if r == 1
    S .+= [4 3 6 2]
elseif r == 2
    S .+= [4 2 5 2] 
end

Amro · September 22, 2021, 9:52pm

Amro:

using BenchmarkTools

function testS(S,r) 
  for i in 1:size(S,1)
    if r==1
      for j in 1:size(S,2)
        S[i,j] = S[i,j] +i;
      end
    elseif r==2
      for j in 1:size(S,2)
        S[i,j] = S[i,j] +j;
      end
    end 
  end 
end

function testP(S,r) 
  for i in 1:size(S,1)
    if r==1
      S[i,:] = [S[i,1]+1 S[i,2]+4 S[i,3]+5 S[i,4]+6];
    elseif r==2
      S[i,:] = [S[i,1]+4 S[i,2]+2 S[i,3]+5 S[i,4]+2];
    end 
  end 
end

dt = 0.0001; 
tmin = 0;  
tmax = 1;  
timeSim = tmin:dt:tmax;
S = [0 0 0 0; 0 0 0 0; 0 0 0 0]
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end    # It gives 660.100 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end     # It gives 1.550 ms

In my real code, the situation is as below:

using BenchmarkTools

function testS(S,r) 
  for i in 1:size(S,1)
    if r==1
      for j in 1:size(S,2)
        S[i,j] = S[i,j] +i;
      end
    elseif r==2
      for j in 1:size(S,2)
        S[i,j] = S[i,j] +j;
      end
    end 
  end 
end

function testP(S,r) 
  for i in 1:size(S,1)
    if r==1
      S[i,:] = [S[i,1]+1 S[i,2]+4 S[i,3]+5 S[i,4]+6];
    elseif r==2
      S[i,:] = [S[i,1]+4 S[i,2]+2 S[i,3]+5 S[i,4]+2];
    end 
  end 
end

dt = 0.0001; 
tmin = 0;  
tmax = 1;  
timeSim = tmin:dt:tmax;
S = [0 0 0 0; 0 0 0 0; 0 0 0 0]
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end    # It gives 660.100 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end     # It gives 1.550 ms

@Seif_Shebl @Oscar_Smith

Seif_Shebl · September 22, 2021, 9:58pm

As I said, you’re comparing two different things, here is the same code with the matrix form actually faster:

function testS(S,r) 
  s1 = [1, 4, 5, 6]
  s2 = [4, 2, 5, 2]
  for i in 1:size(S,1)
    if r==1
      for j in 1:size(S,2)
        S[i,j] = S[i,j] + s1[j];
      end
    elseif r==2
      for j in 1:size(S,2)
        S[i,j] = S[i,j] + s2[j];
      end
    end 
  end 
end

function testP(S,r) 
  if r==1
    S .+= [1 4 5 6]
  elseif r==2
    S .+= [4 2 5 2]
  end 
end

@btime begin for t in tmin:dt:tmax testS($S,$r) end  end   
@btime begin for t in tmin:dt:tmax testP($S,$r) end  end     
  1.320 ms (49495 allocations: 2.43 MiB)
  1.191 ms (39494 allocations: 1.52 MiB)

Amro · September 22, 2021, 11:16pm

@Seif_Shebl @Oscar_Smith thank you for your reply. Below, is my real code, in which I believe I am comparing between two similar things, but again the “for loop” gives faster performance:

using BenchmarkTools
using Parameters, Base;

Base.@kwdef mutable struct SSs
  PP::Vector{Float64} = [0,0,0]
  LL::Vector{Float64} = [0,0,0]
end #mutable struct RLC

function testS(S,r) 
  for i in 1:size(S,1)
    if r==1
      for j in 1:3
        S[i].LL[j] = S[i].PP[j]*S[i].PP[j]+1;
      end
    elseif r==2
      for j in 1:3
        S[i].LL[j] = S[i].PP[j]*S[i].PP[j]+2;
      end
    end 
  end 
end

function testP(S,r) 
  for i in 1:size(S,1)
    if r==1
      S[i].LL = S[i].PP.*S[i].PP.+1;
    elseif r==2
      S[i].LL = S[i].PP.*S[i].PP.+2;
    end 
  end 
end

dt = 0.0001; 
tmin = 0;  
tmax = 1;  
timeSim = tmin:dt:tmax;
S = SSs[];
push!(S, SSs());
S[1].PP = [2,3,1];
push!(S, SSs());
S[2].PP = [4,5,6];
r = 1;
@btime begin for t in tmin:dt:tmax testS(S,r) end  end  
    640.000 μs
@btime begin for t in tmin:dt:tmax testP(S,r) end  end
     1.166 ms

stillyslalom · September 23, 2021, 2:38am

You’re defining a new matrix here (which allocates more memory) and assigning it to the mutable slot LL:

S[i].LL = S[i].PP.*S[i].PP.+1

You need to instead use dot-assignment:

S[i].LL .= S[i].PP.*S[i].PP.+1

An abbreviated way to insert all those dots is the @. broadcast macro:

@. S[i].LL = S[i].PP^2 + 1

For such small arrays, you’re likely to get an additional speedup by substituting static/mutable arrays from StaticArrays.jl.

Amro · September 23, 2021, 7:06pm

@stillyslalom Thanks for your reply.
Actually, I didtn get why in my way, I am defining a new matrix. In other words, what does the dot is doing here?

Can you give me an example here?

lmiq · September 23, 2021, 7:18pm

Take a look at this example, to understand what is going on:

julia> Base.@kwdef mutable struct SSs
         PP::Vector{Float64} = [0,0,0]
         LL::Vector{Float64} = [0,0,0]
       end
SSs

julia> pp1 = [1.,1.,1.]; ll1 = [2.,2.,2.];

julia> s1 = SSs(pp1,ll1)
SSs([1.0, 1.0, 1.0], [2.0, 2.0, 2.0])

julia> s1.PP = [0.,0.,0.] # this replaces the array PP
3-element Vector{Float64}:
 0.0
 0.0
 0.0

julia> pp1 # the original pp1 was not modified
3-element Vector{Float64}:
 1.0
 1.0
 1.0

julia> s1.LL .= [0.,0.,0.]; # note the . : this mutates s1.LL

julia> ll1 # the original array was modified
3-element Vector{Float64}:
 0.0
 0.0
 0.0

That said, only by writting s1.LL = [0.,0.,0.] (for example), you create the array [0.,0.,0.] which is a mutable object, and allocates memory. For this specific case you could use s1.LL .= (0,0,0), which creates a tuple on the right side, but being immutable, tuples do not allocate memory (the compiler can optimize them out, it knows that their values or length won’t change).

The StaticArrays version of that would be:

julia> using StaticArrays

julia> Base.@kwdef mutable struct S
         PP::SVector{3,Float64} = @SVector [0,0,0]
         LL::SVector{3,Float64} = @SVector [0,0,0]
       end
S

julia> s = S()
S([0.0, 0.0, 0.0], [0.0, 0.0, 0.0])

julia> s.PP = @SVector [1.,1.,1.];  # one way

julia> s.LL = SVector{3,Float64}(2,2,2);  # another way

julia> s
S([1.0, 1.0, 1.0], [2.0, 2.0, 2.0])

DNF · September 23, 2021, 7:29pm

Or just s1.LL .= 0

DNF · September 23, 2021, 7:33pm

To the OP: unless you are performing matrix algebra, you can normally assume that ordinary loops will give the best performance.

Also, remember that Julia Arrays are column major, so try to let your inner loop iterate over the first index, not the second, as shown in your MWE.

lmiq · September 23, 2021, 7:34pm

Of course

Amro · September 23, 2021, 7:51pm

@lmiq Thank you very much for your explanation.
Is the below correct for (3x3 Matrix{Float64, 2} )in the StaticArrays version?

Base.@kwdef mutable struct AAs
       Lss::SArray{Float64, 2} =@SArray [0 0 0; 0 0 0; 0 0 0]
       end

lmiq · September 23, 2021, 7:53pm

@SMatrix I think

stillyslalom · September 23, 2021, 7:59pm

Since you’re mutating your matrix, you’ll want to use MMatrix instead of SMatrix, with this type signature:

MMatrix{3, 3, Float64, 9}

With a mutable matrix, you don’t need the container struct to be mutable itself.

Amro · September 23, 2021, 8:22pm

@DNF thank you for your explanation.
What does “Julia Arrays are column major,” mean. Can you give me an example for " inner loop iterate over the first index, not the second"?

stillyslalom · September 23, 2021, 8:42pm

Arrays are stored in contiguous, linear blocks of memory:

julia> v = [1:6...]
6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

Column-major matrices are contiguous along the first dimension, while row-major matrices are contiguous along the second:

julia> reshape(v, 2, 3) # column-major
2×3 Matrix{Int64}:
 1  3  5
 2  4  6

julia> reshape(v, 2, 3)' # transposed to row-major
3×2 adjoint(::Matrix{Int64}) with eltype Int64:
 1  2
 3  4
 5  6

The CPU is faster when it can operate on contiguous blocks of memory, so it’s better for the inner loop (with the fastest-changing index) to correspond to those contiguous blocks. For column-major languages like Julia, that means w[1,1], w[2,1], w[3,1] rather than w[1,1], w[1,2], w[1,3]. For example,

julia> function alongcolumns(w)
           for j in axes(w, 2)
               for i in axes(w, 1)
                   w[i, j] += 1
               end
           end
       end
alongcolumns (generic function with 1 method)

julia> function alongrows(w)
           for i in axes(w, 2)
               for j in axes(w, 1)
                   w[i, j] += 1
               end
           end
       end
alongrows (generic function with 1 method)

julia> w = rand(100, 100);

julia> @btime alongcolumns($w)
  710.000 ns (0 allocations: 0 bytes)

julia> @btime alongrows($w)
  2.889 μs (0 allocations: 0 bytes)

That’s a 4x difference, just by switching the loop order!

Topic		Replies	Views
help understanding performance with assignment to allocated matrix Performance question	6	1687	January 4, 2018
Performance issue with Vectors vs. 1 Column Matrices Performance question , performance , benchmarktools	8	226	October 13, 2024
Speed up column looping in matrix Performance	6	756	July 26, 2021
Vector of indices vs. for loop to assign values in an array Performance question , arrays	4	1381	August 8, 2021
Why so large memory allocations? Performance memory-allocation	6	841	February 3, 2022

Why substituting "for loop" with direct assignment of Matrix worsen the performance time?

Related topics