Loop vs vectorization

Hi all,
I have a partial differential equation to solve using finite difference. I am trying to understand if it is better to use a vectorized version or a loop version to calculate the right hand side of the equation, which has the spatial discretization. What I am struggling to understand is that for a given grid size (e.g. number of points at which the RHS is evaluated) the vector version performs more poorly than the loop version.
However if I increase the grid size the difference in performance between the two seems to become less.
In order to improve my understanding if Julia performs better with loop or vectorized code I wrote the following code

function vect_sum(A,B)                                                                                                                                                                                                                                         
    C .= A + B;                                                                                                                                                                                                                                                
    return C                                                                                                                                                                                                                                                   
end                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                               
function loop_sum(A,B,C,m,n)                                                                                                                                                                                                                                   
    for i=1:m                                                                                                                                                                                                                                                  
        for j=1:n                                                                                                                                                                                                                                              
            C[i,j] = A[i,j] + B[i,j]                                                                                                                                                                                                                           
        end                                                                                                                                                                                                                                                    
    end                                                                                                                                                                                                                                                        
    return C                                                                                                                                                                                                                                                   
end                                                                                                                                                                                                                                                            
n= 1000                                                                                                                                                                                                                                                        
m = 1000                                                                                                                                                                                                                                                       
A = rand(m,n)                                                                                                                                                                                                                                                  
B = rand(m,n)                                                                                                                                                                                                                                                  
C= zeros(m,n);                                                                                                                                                                                                                                                 
@btime vect_sum(A,B);                                                                                                                                                                                                                                          
C= zeros(m,n);                                                                                                                                                                                                                                                 
@btime loop_sum(A,B,C,m,n);            

Here, despite the common sense that loop in Julia are faster, I am getting that the vector version of the code perform better than the loop.
Does anybody have an hint on why this is happening?
Thanks

1 Like

Can you try to fix your backticks that surround your code?
They should look like this
```
On an English keyboard that is often the top left key. On some keyboards you have to press it twice.
Or copy-and-paste mine…

In theory, both should be equally fast. In julia it shouldn’t matter whether you write loops or numpy-style “vectorized” functions, because the loops are compiled. After that, they should do pretty much the same thing as the “vectorized” code, which at some point is also just a bunch of loops. Nevertheless, you can learn some things even on the your simple use case - here are some remarks:

  1. The benchmark is not fair, because vect_sum allocates the output vector C, which takes time. loop_sum only writes to the already allocated output C that you pass in. A good rule is to avoid allocations whenever you can, because this can make your code much faster, depending on the exact problem.
  2. The reason why loop_sum probably is slower is because it performs bounds checks. So it checks whether the index you use to access the array is within allowed limits. This involves comparison operations that take time. You can tell the compiler that all array accesses are within bound by using the @inbounds macro either in front of an expression (@inbounds C[i,j] = ...) or in front of a loop / scope (@inbounds for ... or @inbounds begin ...). That speeds things up a lot, but if you get things wrong, you get a segmentation fault and your program crashes.
  3. Here, A and B have the same size. Therefore you can simply index into them with a linear index like so:
for i in eachindex(A)
  C[i] = A[i] + B[i]
end
  1. This has the benefit of traversing your computer’s memory in a linear fashion, which speeds up things even more.
  2. Speaking about memory access order: julia uses a column major memory layout. So when you perform linear indexing as just described, you traverse the first column of your matrix from top to bottom, row by row, followed by the second column and so on. The rows in the matrix are indexed by the first index in the brackets (here i) and the columns by the second. So to achieve linear memory access with nested loops, you either have to switch the order of the loops or the i and j in the brackets. A good rule is that the index of the innermost loop comes first in the array access, followed by the outer ones, like so:
for i ...
  for j ...
    for k ...
      C[k,j,i] = ...
  1. You don’t have to explicitly tell loop_sum how big your arrays are by passing m and n. This is not C (: So you can either do something like for i in 1:size(A,1) ... to traverse the first dimension of A. But it is much better to use other functions, such as eachindex, axes or CartesianIndices, as they can make your life easier and safer once you understood how to use them. So I highly recommend reading their docs.
  2. You already used the BenchmarkTools package, which is good. It is good practice to “interpolate” the variables using $ to “pre-compute” them before benchmarking, which keeps your results clean from allocations that otherwise could occur. So write @btime vect_sum($A, $B) - however, here it shouldn’t make much of a difference.

Try to implement these hints and then see how it turns out - I’m curious :slight_smile:

Edit: vect_sum already implements the right array access order and inbounds accesses, which is probably why it is faster but seemingly this does not outweigh the allocation, which is why it’s slower.

5 Likes

As said above, the reason is absolutely the order your are running the loops:

julia> @btime vect_sum(A,B);
  1.799 ms (3 allocations: 7.63 MiB)

julia> C= zeros(m,n);

julia> @btime loop_sum(A,B,C,m,n);
  13.313 ms (0 allocations: 0 bytes)

julia> function loop_sum2(A,B,C,m,n)                                                                                                                                                                                                                                   
           for j=1:n                                                                                                                                                                                                                                                  
               for i=1:m                                                                                                                                                                                                                                              
                   C[i,j] = A[i,j] + B[i,j]                                                                                                                                                                                                                           
               end                                                                                                                                                                                                                                                    
           end                                                                                                                                                                                                                                                        
           return C                                                                                                                                                                                                                                                   
       end
loop_sum2 (generic function with 1 method)

julia> @btime loop_sum2(A,B,C,m,n);
  1.160 ms (0 allocations: 0 bytes)

But also note more Julian implementations could be:

julia> function loop3!(C,A,B)
           for i in eachindex(C,A,B)
               C[i] = A[i] + B[i]
           end
           return C
       end
loop3! (generic function with 1 method)

julia> @btime loop3!($C,$A,$B);
  1.172 ms (0 allocations: 0 bytes)

julia> function loop4!(C,A,B)
           for j in axes(C,2), i in axes(C,1)
               C[i,j] = A[i,j] + B[i,j]
           end
           return C
       end
loop4! (generic function with 1 method)

julia> @btime loop4!($C,$A,$B);
  1.185 ms (0 allocations: 0 bytes)

But also note that your vectorized version is not exactly comparable, because there you were allocating some intermediates (and the result). You get the same performance with:

julia> function vect_sum!(C,A,B)
           C .= A .+ B
           return C
       end
vect_sum! (generic function with 1 method)

julia> @btime vect_sum!($C,$A,$B);
  1.192 ms (0 allocations: 0 bytes)
5 Likes

For finite differences, I would tend to prefer loops (properly written, as noted by @lmiq and others above). Loops just give you more flexibility for this kind of thing (as well as excellent performance).

For example, see this post for example code of writing a scalar-wave equation finite-difference routine using loops that, combined with Julia’s CartesianIndex facilities, support arbitrary dimensionality with a single loop: Seemingly unnecessary allocation within for loop - #8 by stevengj