# Performant-wise, what is the best way to define (many) local arrays?

Hi everyone!

I am new to Julia and so far, I have mostly exprimented basic things. However I realised that the way the arrays are defined may have a strong influence on the performance.

Here is the situation: let say that I have a loop, at each iteration I define a local array which may depend on global variables and then I perform some elementary array operations. Since the loop is typically big, I would like to know the fastest way to define the local array. For the application I have in mind, there are two nested loops but I don’t think that it has an influence.

Reading the Julia doc, I realised that there are many possible ways to define the same array, using semicolons, comma, spaces, etc. I initially thought that all are equivalent but it turns out that they are not and I do not fully understand why (and I did not find anything about performance in the doc). Below is a minimal working example.

First I import some packages and define some global variables.

``````using StaticArrays
using BenchmarkTools

a = 1.
b = 2.
c = 3.
d = 4.
x = 5.
y = 6.
z = 7.

N = 3000  # Loop size
``````

Then, I tested seven ways to do the same thing.

1. First the simplest one, using only spaces and tabs, would be
``````function standard_array()
for i in 1:N
for j in 1:N
A = [a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]  # The local array
u = [b,c,z*y-x^2]             # Some local vector
v = A*u                       # A simple operation
end
end
end
``````
1. The same function can be implemented using semicolons as follows.
``````function semicolon_concat_array()
for i in 1:N
for j in 1:N
A = [[a     ;; b     ;; 0.5*i];
[c     ;; d     ;; 0.];
[x*y-z ;; x*y+z ;; j+1.]]
u = [b,c,z*y-x^2]
v = A*u
end
end
end
``````
1. Since the elements of the array have a cumbersome expression, I may prefer to define the columns first and then concatenate the result. Again it can be done using spaces
``````function column_simple_concat_array()
for i in 1:N
for j in 1:N
e1 = [a, c, x*y-z]
e2 = [b, d, x*y+z]
e3 = [0.5*i, 0., j+1.]
A = [e1 e2 e3]
u = [b,c,z*y-x^2]
v = A*u
end
end
end
``````
1. …or semicolons.
``````function column_semicolon_concat_array()
for i in 1:N
for j in 1:N
e1 = [a, c, x*y-z]
e2 = [b, d, x*y+z]
e3 = [0.5*i, 0., j+1.]
A = [e1 ;; e2 ;; e3]
u = [b,c,z*y-x^2]
v = A*u
end
end
end
``````
1. If the elements are very cumbersome, then maybe even an elementwise definition would be easier to handle.
``````function element_array()
for i in 1:N
for j in 1:N
A = zeros(3,3)
A[1,1] = a
A[2,1] = c
A[3,1] = x*y-z
A[1,2] = b
A[2,2] = d
A[3,2] = x*y+z
A[1,3] = 0.5*i
A[2,3] = 0.
A[3,3] = j+1.
u = [b,c,z*y-x^2]
v = A*u
end
end
end
``````
1. Finally, there are also two more exotic definitions: using static arrays
``````function static_smatrix()
for i in 1:N
for j in 1:N
A = @SMatrix [a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]
u = SVector(b,c,z*y-x^2)
v = A*u
end
end
end
``````
1. …or tuples (with a byhand multiplication):
``````function byhand_tuple()
for i in 1:N
for j in 1:N
a11,a12,a13,a21,a22,a23,a31,a32,a33=a,b,0.5*i,c,d,0.,x*y-z,x*y+z,j+1.
u1,u2,u3 = b,c,z*y-x^2
v1 = a11*u1 + a12*u2 + a13*u3
v2 = a21*u1 + a22*u2 + a23*u3
v3 = a31*u1 + a32*u2 + a33*u3
v = v1,v2,v3
end
end
end
``````

I used `@btime` to benchmark these seven functions and here is what I obtained:

``````  Standard array ("standard_array")
7.252 s (140944979 allocations: 4.78 GiB)

 Semicolon concatenation ("semicolon_concat_array")
86.949 s (725944979 allocations: 23.42 GiB)

 Column concatenation ("column_simple_concat_array")
10.158 s (167944979 allocations: 6.39 GiB)

 Column concatenation with semicolon ("column_semicolon_concat_array")
30.007 s (266944979 allocations: 9.88 GiB)

 Elementwise definition ("element_array")
14.359 s (131944979 allocations: 4.11 GiB)

 Static SMatrix ("static_smatrix")
7.273 s (149944979 allocations: 3.84 GiB)

 Tuples ("byhand_tuple")
4.333 s (212944979 allocations: 3.31 GiB)
``````

There are a few things I think I understand: the function  with tuples is the fastest one, as I expect since the tuples are immutables. Also  and  look similar but  is faster than  because it has less allocations, ok. But I still do not get the following points.

1. The definitions  and  involving semicolons are very bad, 7 to 20 times slower than the fastest definition . Obviously there is something I do not understand with the semicolon operator.

2. What slows down the elementwise definition  ? I would expect it to be similar to  (with a comparable number of allocations) but it is 2 times slower.

3. Static arrays  do not seem to perform better than standard arrays . I would expect them to be similar to tuples  (immutable objects and comparable numbers of allocations).

Sorry for the long post and thank you in advance for your help!

Using a static matrix, a tuple, or manually defining the indexes all should be fine. However, your benchmarks are not very helpful, because you are using global variables (`N`, `b`, `a`, etc.). This is extremely detrimental to performance, and will dominate your benchmark. Pass all these variables as parameters to your functions. Also, your test functions aren’t returning anything from that computation as far as I can see, thus the compiler may well do optimizations that remove the sense of the test completely.

Just an illustration:

``````ulia> function static_smatrix(N,a,b,c,d,x,y,z)
for i in 1:N
for j in 1:N
A = @SMatrix [a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]
u = SVector(b,c,z*y-x^2)
v = A*u
end
end
end
static_smatrix (generic function with 2 methods)

julia> static_smatrix(1000,rand(7)...)

julia> @btime static_smatrix(1000,rand(7)...)
192.284 ns (9 allocations: 240 bytes)

julia> @btime static_smatrix(10^6,rand(7)...)
203.189 ns (10 allocations: 256 bytes)

julia> @btime static_smatrix(10^8,rand(7)...)
200.542 ns (10 allocations: 256 bytes)

``````
2 Likes

I did like you said, for instance redefining the first function as

``````function standard_array(a,b,c,d,x,y,z,N)
v = []
for i in 1:N
for j in 1:N
A = [a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````

and similarly for the others. I obtain the following results:

``````  Standard array
4.076 s (108000001 allocations: 4.16 GiB)

 Semicolon concatenation
49.603 s (369000001 allocations: 13.14 GiB)

 Column concatenation
2.757 s (54000001 allocations: 4.43 GiB)

 Column concatenation with semicolon
12.596 s (108000001 allocations: 5.77 GiB)

 Elementwise definition
1.600 s (27000001 allocations: 2.41 GiB)

 Static SMatrix
802.793 ns (0 allocations: 0 bytes)

 Tuples
802.565 ns (0 allocations: 0 bytes)
``````

Just as you said, the elementwise definition, using a static matrix or a tuple are all equivalent. This solves my second and third points. It remains the first point: why do semicolons lead to such poor results? Also I am a bit surprised that now  is slightly faster than . In the end, maybe all these questions could be summarised by “are there general rules to concatenate arrays/vectors in order to achieve best performance?”

` v = []` creates a `Vector{Any}` give it an element type (eg `v = Float64[]`.

Ok thanks for the tip. I did that and the results are similar.

`````` Standard array
3.938 s (108000001 allocations: 4.16 GiB)

 Semicolon concatenation
48.609 s (369000001 allocations: 13.14 GiB)

 Column concatenation
2.782 s (54000001 allocations: 4.43 GiB)

 Column concatenation with semicolon
12.749 s (108000001 allocations: 5.77 GiB)

 Elementwise definition
1.620 s (27000001 allocations: 2.41 GiB)

 Static SMatrix
802.554 ns (0 allocations: 0 bytes)

 Tuples
802.301 ns (0 allocations: 0 bytes)
``````

Could you post the current versions of those functions? You should by no means be seeing that many allocations per loop iteration in a type-stable function.

Here they are.

``````function standard_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
A = [a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end

``````
``````function semicolon_concat_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
A = [[a     ;; b     ;; 0.5*i];
[c     ;; d     ;; 0.];
[x*y-z ;; x*y+z ;; j+1.]]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````
``````function column_simple_concat_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
e1 = [a, c, x*y-z]
e2 = [b, d, x*y+z]
e3 = [0.5*i, 0., j+1.]
A = [e1 e2 e3]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````
``````function column_semicolon_concat_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
e1 = [a, c, x*y-z]
e2 = [b, d, x*y+z]
e3 = [0.5*i, 0., j+1.]
A = [e1 ;; e2 ;; e3]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````
``````function element_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
A = zeros(3,3)
A[1,1] = a
A[2,1] = c
A[3,1] = x*y-z
A[1,2] = b
A[2,2] = d
A[3,2] = x*y+z
A[1,3] = 0.5*i
A[2,3] = 0.
A[3,3] = j+1.
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````
``````function static_smatrix(a,b,c,d,x,y,z,N)
v = SVector(0.,0.,0.)
for i in 1:N
for j in 1:N
A = @SMatrix [a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]
u = SVector(b,c,z*y-x^2)
v = A*u
end
end
return v
end
``````
``````function byhand_tuple(a,b,c,d,x,y,z,N)
v = ()
for i in 1:N
for j in 1:N
a11,a12,a13,a21,a22,a23,a31,a32,a33=a,b,0.5*i,c,d,0.,x*y-z,x*y+z,j+1.
u1,u2,u3 = b,c,z*y-x^2
v1 = a11*u1 + a12*u2 + a13*u3
v2 = a21*u1 + a22*u2 + a23*u3
v3 = a31*u1 + a32*u2 + a33*u3
v = v1,v2,v3
end
end
return v
end

``````

I call the first function with

``````@btime v1 = standard_array(1.,2.,3.,4.,5.,6.,7.,3000)
``````

and similarly for the others.

Each of the `X = [ ]` or `X = zeros(3)` type of stuff (matrices or vectors) allocate a new array, these are the allocations which you see in your loops with non-static arrays (or tuples).

The other alternative is (for larger arrays) to preallocate them outside the loop, and update the values, such as:

``````function element_array(a,b,c,d,x,y,z,N)
v = zeros(3) # allocate here
u = zeros(3) # allocate here
A = zeros(3,3) # allocate here
for i in 1:N
for j in 1:N
A[1,1] = a
A[2,1] = c
A[3,1] = x*y-z
A[1,2] = b
A[2,2] = d
A[3,2] = x*y+z
A[1,3] = 0.5*i
A[2,3] = 0.
A[3,3] = j+1.
u .= (b,c,z*y-x^2) # or element-by element, as with A
LinearAlgebra.mul!(v,A,u) # in place matrix multiplication
end
end
return v
end
``````

(didn’t test). Something like this will not be better than the static array alternatives if the arrays are small, but if you to larger vectors or matrices, than it will be better.

1 Like

Interesting! Thanks for the explanation.

However, maybe for a dense matrix in higher dimensions, an elementwise definition would be complicated and concatenating rows or columns may be preferred. In this case, this brings back to my original question about the semicolon operator `;`

What is bad with this function

``````function column_semicolon_concat_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
e1 = [a, c, x*y-z]
e2 = [b, d, x*y+z]
e3 = [0.5*i, 0., j+1.]
A = [e1 ;; e2 ;; e3]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````

and very bad with this one ?

``````function semicolon_concat_array(a,b,c,d,x,y,z,N)
v = Float64[]
for i in 1:N
for j in 1:N
A = [[a     ;; b     ;; 0.5*i];
[c     ;; d     ;; 0.];
[x*y-z ;; x*y+z ;; j+1.]]
u = [b,c,z*y-x^2]
v = A*u
end
end
return v
end
``````

Is the drop of performance linked to the higher precedence of the spaces and tabs over the semicolon operator (as mentioned in the documentation)?

You are vertically concatenating rows. Column major ordering means moving a lot of data in `semicolon_concat_array`.

1 Like

Here you’ve allocated three temporary `Vector` objects.

``````            A = [e1 ;; e2 ;; e3]
``````

Now separately here, you concatenate them into a further array which also needs to be allocated and managed by the garbage collector.

All these memory allocations are probably what makes this code slow. Rather, it’s much better to use the 2D array syntax. Then the compiler can allocate only a single array and fill it once.

As you’ve seen `StaticArrays` completely fixes this (for small arrays), by allowing the compiler to remove the GC-managed allocation completely.

The precedence of the surface syntax can’t affect this, because parsing the syntax happens only a single time, and very early during compilation. Even the column major ordering shouldn’t really affect this for small arrays — that should only become very relevant for moderate to large size arrays which don’t fit into the processor’s cache.

However, it’s possible that the function inside `Base` for dealing with multidimensional concatenation is is less efficient for some reason. Consider the difference in the way that these syntaxes are actually passed to the functions inside `Base`:

``````julia> Meta.@lower [a b ; c d]
:(\$(Expr(:thunk, CodeInfo(
1 ─ %1 = Core.tuple(2, 2)
│   %2 = Base.hvcat(%1, a, b, c, d)           **********
└──      return %2
))))

julia> Meta.@lower [a; b ;; c ; d]
:(\$(Expr(:thunk, CodeInfo(

1 ─ %1 = Core.tuple(2, 2)
│   %2 = Base.hvncat(%1, false, a, b, c, d)   **********
└──      return %2
))))
``````

`Base.hvncat` is a more general and very recent addition to the standard library, so it’s possible this isn’t quite as efficient as `Base.hvcat`.

1 Like

By the way, you can also write this with the more compact `SA` syntax:

``````            A = SA[a       b       0.5*i
c       d       0.
(x*y-z) (x*y+z) (j+1.)]
``````
1 Like

Thank you @Jeff_Emanuel and @Chris_Foster, you both answer my question. In summary, since Julia is column major ordering, concatenating rows is indeed slow which explains why `semicolon_concat_array` is less efficient than `column_semicolon_concat_array`. The latter is also not very efficient because the double semicolon `;;` calls the function `hvncat` which is apparently less efficient than `hvcat`.

By the way, according to the doc, `hvncat` should be more efficient using the ‘’`dim` form’’ when the concatenation is along just one dimension (which is the case here). I would have thought that the more optimized function `hvcat` would be called in this case but it is apparently more subtle.

And thanks to the other contributors, I realised that in order to achieve better performance, I should avoid using global variables, be careful about types, avoid useless allocations and use static arrays when possible.

Thanks everyone!

Nobody mentioned, but there are Performance Tips · The Julia Language

which cover those and other aspects.

1 Like

This would almost definitely be the biggest issue with the original code. Non-const globals really are a performance disaster in an inner loop. You can try `const` globals though for cases where the globals are constant and used in many different functions.

1 Like