# Loop vs vectorization

Hi all,
I have a partial differential equation to solve using finite difference. I am trying to understand if it is better to use a vectorized version or a loop version to calculate the right hand side of the equation, which has the spatial discretization. What I am struggling to understand is that for a given grid size (e.g. number of points at which the RHS is evaluated) the vector version performs more poorly than the loop version.
However if I increase the grid size the difference in performance between the two seems to become less.
In order to improve my understanding if Julia performs better with loop or vectorized code I wrote the following code

``````function vect_sum(A,B)
C .= A + B;
return C
end

function loop_sum(A,B,C,m,n)
for i=1:m
for j=1:n
C[i,j] = A[i,j] + B[i,j]
end
end
return C
end
n= 1000
m = 1000
A = rand(m,n)
B = rand(m,n)
C= zeros(m,n);
@btime vect_sum(A,B);
C= zeros(m,n);
@btime loop_sum(A,B,C,m,n);
``````

Here, despite the common sense that loop in Julia are faster, I am getting that the vector version of the code perform better than the loop.
Does anybody have an hint on why this is happening?
Thanks

1 Like

They should look like this
```
On an English keyboard that is often the top left key. On some keyboards you have to press it twice.
Or copy-and-paste mineâ€¦

In theory, both should be equally fast. In julia it shouldnâ€™t matter whether you write loops or numpy-style â€śvectorizedâ€ť functions, because the loops are compiled. After that, they should do pretty much the same thing as the â€śvectorizedâ€ť code, which at some point is also just a bunch of loops. Nevertheless, you can learn some things even on the your simple use case - here are some remarks:

1. The benchmark is not fair, because `vect_sum` allocates the output vector C, which takes time. `loop_sum` only writes to the already allocated output C that you pass in. A good rule is to avoid allocations whenever you can, because this can make your code much faster, depending on the exact problem.
2. The reason why `loop_sum` probably is slower is because it performs bounds checks. So it checks whether the index you use to access the array is within allowed limits. This involves comparison operations that take time. You can tell the compiler that all array accesses are within bound by using the `@inbounds` macro either in front of an expression (`@inbounds C[i,j] = ...`) or in front of a loop / scope (`@inbounds for ... or @inbounds begin ...`). That speeds things up a lot, but if you get things wrong, you get a segmentation fault and your program crashes.
3. Here, A and B have the same size. Therefore you can simply index into them with a linear index like so:
``````for i in eachindex(A)
C[i] = A[i] + B[i]
end
``````
1. This has the benefit of traversing your computerâ€™s memory in a linear fashion, which speeds up things even more.
2. Speaking about memory access order: julia uses a column major memory layout. So when you perform linear indexing as just described, you traverse the first column of your matrix from top to bottom, row by row, followed by the second column and so on. The rows in the matrix are indexed by the first index in the brackets (here i) and the columns by the second. So to achieve linear memory access with nested loops, you either have to switch the order of the loops or the i and j in the brackets. A good rule is that the index of the innermost loop comes first in the array access, followed by the outer ones, like so:
``````for i ...
for j ...
for k ...
C[k,j,i] = ...
``````
1. You donâ€™t have to explicitly tell `loop_sum` how big your arrays are by passing m and n. This is not C (: So you can either do something like `for i in 1:size(A,1) ...` to traverse the first dimension of A. But it is much better to use other functions, such as `eachindex`, `axes` or `CartesianIndices`, as they can make your life easier and safer once you understood how to use them. So I highly recommend reading their docs.
2. You already used the `BenchmarkTools` package, which is good. It is good practice to â€śinterpolateâ€ť the variables using \$ to â€śpre-computeâ€ť them before benchmarking, which keeps your results clean from allocations that otherwise could occur. So write `@btime vect_sum(\$A, \$B)` - however, here it shouldnâ€™t make much of a difference.

Try to implement these hints and then see how it turns out - Iâ€™m curious

Edit: `vect_sum` already implements the right array access order and inbounds accesses, which is probably why it is faster but seemingly this does not outweigh the allocation, which is why itâ€™s slower.

5 Likes

As said above, the reason is absolutely the order your are running the loops:

``````julia> @btime vect_sum(A,B);
1.799 ms (3 allocations: 7.63 MiB)

julia> C= zeros(m,n);

julia> @btime loop_sum(A,B,C,m,n);
13.313 ms (0 allocations: 0 bytes)

julia> function loop_sum2(A,B,C,m,n)
for j=1:n
for i=1:m
C[i,j] = A[i,j] + B[i,j]
end
end
return C
end
loop_sum2 (generic function with 1 method)

julia> @btime loop_sum2(A,B,C,m,n);
1.160 ms (0 allocations: 0 bytes)
``````

But also note more Julian implementations could be:

``````julia> function loop3!(C,A,B)
for i in eachindex(C,A,B)
C[i] = A[i] + B[i]
end
return C
end
loop3! (generic function with 1 method)

julia> @btime loop3!(\$C,\$A,\$B);
1.172 ms (0 allocations: 0 bytes)

julia> function loop4!(C,A,B)
for j in axes(C,2), i in axes(C,1)
C[i,j] = A[i,j] + B[i,j]
end
return C
end
loop4! (generic function with 1 method)

julia> @btime loop4!(\$C,\$A,\$B);
1.185 ms (0 allocations: 0 bytes)
``````

But also note that your vectorized version is not exactly comparable, because there you were allocating some intermediates (and the result). You get the same performance with:

``````julia> function vect_sum!(C,A,B)
C .= A .+ B
return C
end
vect_sum! (generic function with 1 method)

julia> @btime vect_sum!(\$C,\$A,\$B);
1.192 ms (0 allocations: 0 bytes)
``````
5 Likes

For finite differences, I would tend to prefer loops (properly written, as noted by @lmiq and others above). Loops just give you more flexibility for this kind of thing (as well as excellent performance).

For example, see this post for example code of writing a scalar-wave equation finite-difference routine using loops that, combined with Juliaâ€™s `CartesianIndex` facilities, support arbitrary dimensionality with a single loop: Seemingly unnecessary allocation within for loop - #8 by stevengj