Performant-wise, what is the best way to define (many) local arrays?

Hi everyone!

I am new to Julia and so far, I have mostly exprimented basic things. However I realised that the way the arrays are defined may have a strong influence on the performance.

Here is the situation: let say that I have a loop, at each iteration I define a local array which may depend on global variables and then I perform some elementary array operations. Since the loop is typically big, I would like to know the fastest way to define the local array. For the application I have in mind, there are two nested loops but I don’t think that it has an influence.

Reading the Julia doc, I realised that there are many possible ways to define the same array, using semicolons, comma, spaces, etc. I initially thought that all are equivalent but it turns out that they are not and I do not fully understand why (and I did not find anything about performance in the doc). Below is a minimal working example.

First I import some packages and define some global variables.

using StaticArrays
using BenchmarkTools

a = 1. 
b = 2. 
c = 3. 
d = 4. 
x = 5. 
y = 6. 
z = 7. 

N = 3000  # Loop size

Then, I tested seven ways to do the same thing.

  1. First the simplest one, using only spaces and tabs, would be
function standard_array()
    for i in 1:N
        for j in 1:N
            A = [a       b       0.5*i
                 c       d       0.
                 (x*y-z) (x*y+z) (j+1.)]  # The local array
            u = [b,c,z*y-x^2]             # Some local vector
            v = A*u                       # A simple operation
        end
    end
end
  1. The same function can be implemented using semicolons as follows.
function semicolon_concat_array()
    for i in 1:N
        for j in 1:N
            A = [[a     ;; b     ;; 0.5*i];
                 [c     ;; d     ;; 0.];
                 [x*y-z ;; x*y+z ;; j+1.]]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
end
  1. Since the elements of the array have a cumbersome expression, I may prefer to define the columns first and then concatenate the result. Again it can be done using spaces
function column_simple_concat_array()
    for i in 1:N
        for j in 1:N
            e1 = [a, c, x*y-z]
            e2 = [b, d, x*y+z]
            e3 = [0.5*i, 0., j+1.]
            A = [e1 e2 e3]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
end
  1. …or semicolons.
function column_semicolon_concat_array()
    for i in 1:N
        for j in 1:N
            e1 = [a, c, x*y-z]
            e2 = [b, d, x*y+z]
            e3 = [0.5*i, 0., j+1.]
            A = [e1 ;; e2 ;; e3]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
end
  1. If the elements are very cumbersome, then maybe even an elementwise definition would be easier to handle.
function element_array()
    for i in 1:N
        for j in 1:N
            A = zeros(3,3)
            A[1,1] = a
            A[2,1] = c
            A[3,1] = x*y-z
            A[1,2] = b
            A[2,2] = d
            A[3,2] = x*y+z
            A[1,3] = 0.5*i
            A[2,3] = 0.
            A[3,3] = j+1.
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
end
  1. Finally, there are also two more exotic definitions: using static arrays
function static_smatrix()
    for i in 1:N
        for j in 1:N
            A = @SMatrix [a       b       0.5*i
                          c       d       0.
                          (x*y-z) (x*y+z) (j+1.)]
            u = SVector(b,c,z*y-x^2)
            v = A*u
        end
    end
end
  1. …or tuples (with a byhand multiplication):
function byhand_tuple()
    for i in 1:N
        for j in 1:N
            a11,a12,a13,a21,a22,a23,a31,a32,a33=a,b,0.5*i,c,d,0.,x*y-z,x*y+z,j+1.
            u1,u2,u3 = b,c,z*y-x^2
            v1 = a11*u1 + a12*u2 + a13*u3
            v2 = a21*u1 + a22*u2 + a23*u3
            v3 = a31*u1 + a32*u2 + a33*u3
            v = v1,v2,v3
        end
    end
end

I used @btime to benchmark these seven functions and here is what I obtained:

 [1] Standard array ("standard_array")
  7.252 s (140944979 allocations: 4.78 GiB)

 [2] Semicolon concatenation ("semicolon_concat_array")
  86.949 s (725944979 allocations: 23.42 GiB)

 [3] Column concatenation ("column_simple_concat_array")
  10.158 s (167944979 allocations: 6.39 GiB)

 [4] Column concatenation with semicolon ("column_semicolon_concat_array")
  30.007 s (266944979 allocations: 9.88 GiB)

 [5] Elementwise definition ("element_array")
  14.359 s (131944979 allocations: 4.11 GiB)

 [6] Static SMatrix ("static_smatrix")
  7.273 s (149944979 allocations: 3.84 GiB)

 [7] Tuples ("byhand_tuple")
  4.333 s (212944979 allocations: 3.31 GiB)

There are a few things I think I understand: the function [7] with tuples is the fastest one, as I expect since the tuples are immutables. Also [1] and [3] look similar but [1] is faster than [3] because it has less allocations, ok. But I still do not get the following points.

  1. The definitions [2] and [4] involving semicolons are very bad, 7 to 20 times slower than the fastest definition [7]. Obviously there is something I do not understand with the semicolon operator.

  2. What slows down the elementwise definition [5] ? I would expect it to be similar to [1] (with a comparable number of allocations) but it is 2 times slower.

  3. Static arrays [6] do not seem to perform better than standard arrays [1]. I would expect them to be similar to tuples [7] (immutable objects and comparable numbers of allocations).

Sorry for the long post and thank you in advance for your help!

Using a static matrix, a tuple, or manually defining the indexes all should be fine. However, your benchmarks are not very helpful, because you are using global variables (N, b, a, etc.). This is extremely detrimental to performance, and will dominate your benchmark. Pass all these variables as parameters to your functions. Also, your test functions aren’t returning anything from that computation as far as I can see, thus the compiler may well do optimizations that remove the sense of the test completely.

Just an illustration:

ulia> function static_smatrix(N,a,b,c,d,x,y,z)
           for i in 1:N
               for j in 1:N
                   A = @SMatrix [a       b       0.5*i
                                 c       d       0.
                                 (x*y-z) (x*y+z) (j+1.)]
                   u = SVector(b,c,z*y-x^2)
                   v = A*u
               end
           end
       end
static_smatrix (generic function with 2 methods)

julia> static_smatrix(1000,rand(7)...)

julia> @btime static_smatrix(1000,rand(7)...)
  192.284 ns (9 allocations: 240 bytes)

julia> @btime static_smatrix(10^6,rand(7)...)
  203.189 ns (10 allocations: 256 bytes)

julia> @btime static_smatrix(10^8,rand(7)...)
  200.542 ns (10 allocations: 256 bytes)


2 Likes

Thanks for your super fast answer!

I did like you said, for instance redefining the first function as

function standard_array(a,b,c,d,x,y,z,N)
    v = []
    for i in 1:N
        for j in 1:N
            A = [a       b       0.5*i
                 c       d       0.
                 (x*y-z) (x*y+z) (j+1.)]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end

and similarly for the others. I obtain the following results:

 [1] Standard array
  4.076 s (108000001 allocations: 4.16 GiB)

 [2] Semicolon concatenation
  49.603 s (369000001 allocations: 13.14 GiB)

 [3] Column concatenation
  2.757 s (54000001 allocations: 4.43 GiB)

 [4] Column concatenation with semicolon
  12.596 s (108000001 allocations: 5.77 GiB)

 [5] Elementwise definition
  1.600 s (27000001 allocations: 2.41 GiB)

 [6] Static SMatrix
  802.793 ns (0 allocations: 0 bytes)

 [7] Tuples
  802.565 ns (0 allocations: 0 bytes)

Just as you said, the elementwise definition, using a static matrix or a tuple are all equivalent. This solves my second and third points. It remains the first point: why do semicolons lead to such poor results? Also I am a bit surprised that now [3] is slightly faster than [1]. In the end, maybe all these questions could be summarised by “are there general rules to concatenate arrays/vectors in order to achieve best performance?”

v = [] creates a Vector{Any} give it an element type (eg v = Float64[].

Ok thanks for the tip. I did that and the results are similar.

[1] Standard array
  3.938 s (108000001 allocations: 4.16 GiB)

[2] Semicolon concatenation
  48.609 s (369000001 allocations: 13.14 GiB)

[3] Column concatenation
  2.782 s (54000001 allocations: 4.43 GiB)

[4] Column concatenation with semicolon
  12.749 s (108000001 allocations: 5.77 GiB)

[5] Elementwise definition
  1.620 s (27000001 allocations: 2.41 GiB)

[6] Static SMatrix
  802.554 ns (0 allocations: 0 bytes)

[7] Tuples
  802.301 ns (0 allocations: 0 bytes)

Could you post the current versions of those functions? You should by no means be seeing that many allocations per loop iteration in a type-stable function.

Here they are.

function standard_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            A = [a       b       0.5*i
                 c       d       0.
                 (x*y-z) (x*y+z) (j+1.)]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end

function semicolon_concat_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            A = [[a     ;; b     ;; 0.5*i];
                 [c     ;; d     ;; 0.];
                 [x*y-z ;; x*y+z ;; j+1.]]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end
function column_simple_concat_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            e1 = [a, c, x*y-z]
            e2 = [b, d, x*y+z]
            e3 = [0.5*i, 0., j+1.]
            A = [e1 e2 e3]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end
function column_semicolon_concat_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            e1 = [a, c, x*y-z]
            e2 = [b, d, x*y+z]
            e3 = [0.5*i, 0., j+1.]
            A = [e1 ;; e2 ;; e3]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end
function element_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            A = zeros(3,3)
            A[1,1] = a
            A[2,1] = c
            A[3,1] = x*y-z
            A[1,2] = b
            A[2,2] = d
            A[3,2] = x*y+z
            A[1,3] = 0.5*i
            A[2,3] = 0.
            A[3,3] = j+1.
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end
function static_smatrix(a,b,c,d,x,y,z,N)
    v = SVector(0.,0.,0.)
    for i in 1:N
        for j in 1:N
            A = @SMatrix [a       b       0.5*i
                          c       d       0.
                          (x*y-z) (x*y+z) (j+1.)]
            u = SVector(b,c,z*y-x^2)
            v = A*u
        end
    end
    return v
end
function byhand_tuple(a,b,c,d,x,y,z,N)
    v = ()
    for i in 1:N
        for j in 1:N
            a11,a12,a13,a21,a22,a23,a31,a32,a33=a,b,0.5*i,c,d,0.,x*y-z,x*y+z,j+1.
            u1,u2,u3 = b,c,z*y-x^2
            v1 = a11*u1 + a12*u2 + a13*u3
            v2 = a21*u1 + a22*u2 + a23*u3
            v3 = a31*u1 + a32*u2 + a33*u3
            v = v1,v2,v3
        end
    end
    return v
end

I call the first function with

@btime v1 = standard_array(1.,2.,3.,4.,5.,6.,7.,3000)

and similarly for the others.

Each of the X = [ ] or X = zeros(3) type of stuff (matrices or vectors) allocate a new array, these are the allocations which you see in your loops with non-static arrays (or tuples).

The other alternative is (for larger arrays) to preallocate them outside the loop, and update the values, such as:

function element_array(a,b,c,d,x,y,z,N)
    v = zeros(3) # allocate here
    u = zeros(3) # allocate here
    A = zeros(3,3) # allocate here
    for i in 1:N
        for j in 1:N
            A[1,1] = a
            A[2,1] = c
            A[3,1] = x*y-z
            A[1,2] = b
            A[2,2] = d
            A[3,2] = x*y+z
            A[1,3] = 0.5*i
            A[2,3] = 0.
            A[3,3] = j+1.
            u .= (b,c,z*y-x^2) # or element-by element, as with A
            LinearAlgebra.mul!(v,A,u) # in place matrix multiplication
        end
    end
    return v
end

(didn’t test). Something like this will not be better than the static array alternatives if the arrays are small, but if you to larger vectors or matrices, than it will be better.

1 Like

Interesting! Thanks for the explanation.

However, maybe for a dense matrix in higher dimensions, an elementwise definition would be complicated and concatenating rows or columns may be preferred. In this case, this brings back to my original question about the semicolon operator ;

What is bad with this function

function column_semicolon_concat_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            e1 = [a, c, x*y-z]
            e2 = [b, d, x*y+z]
            e3 = [0.5*i, 0., j+1.]
            A = [e1 ;; e2 ;; e3]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end

and very bad with this one ?

function semicolon_concat_array(a,b,c,d,x,y,z,N)
    v = Float64[]
    for i in 1:N
        for j in 1:N
            A = [[a     ;; b     ;; 0.5*i];
                 [c     ;; d     ;; 0.];
                 [x*y-z ;; x*y+z ;; j+1.]]
            u = [b,c,z*y-x^2]
            v = A*u
        end
    end
    return v
end

Is the drop of performance linked to the higher precedence of the spaces and tabs over the semicolon operator (as mentioned in the documentation)?

You are vertically concatenating rows. Column major ordering means moving a lot of data in semicolon_concat_array.

1 Like

Here you’ve allocated three temporary Vector objects.

            A = [e1 ;; e2 ;; e3]

Now separately here, you concatenate them into a further array which also needs to be allocated and managed by the garbage collector.

All these memory allocations are probably what makes this code slow. Rather, it’s much better to use the 2D array syntax. Then the compiler can allocate only a single array and fill it once.

As you’ve seen StaticArrays completely fixes this (for small arrays), by allowing the compiler to remove the GC-managed allocation completely.

The precedence of the surface syntax can’t affect this, because parsing the syntax happens only a single time, and very early during compilation. Even the column major ordering shouldn’t really affect this for small arrays — that should only become very relevant for moderate to large size arrays which don’t fit into the processor’s cache.

However, it’s possible that the function inside Base for dealing with multidimensional concatenation is is less efficient for some reason. Consider the difference in the way that these syntaxes are actually passed to the functions inside Base:

julia> Meta.@lower [a b ; c d]
:($(Expr(:thunk, CodeInfo(
1 ─ %1 = Core.tuple(2, 2)
│   %2 = Base.hvcat(%1, a, b, c, d)           **********
└──      return %2
))))

julia> Meta.@lower [a; b ;; c ; d]
:($(Expr(:thunk, CodeInfo(

1 ─ %1 = Core.tuple(2, 2)
│   %2 = Base.hvncat(%1, false, a, b, c, d)   **********
└──      return %2
))))

Base.hvncat is a more general and very recent addition to the standard library, so it’s possible this isn’t quite as efficient as Base.hvcat.

1 Like

By the way, you can also write this with the more compact SA syntax:

            A = SA[a       b       0.5*i
                   c       d       0.
                   (x*y-z) (x*y+z) (j+1.)]
1 Like

Thank you @Jeff_Emanuel and @c42f, you both answer my question. In summary, since Julia is column major ordering, concatenating rows is indeed slow which explains why semicolon_concat_array is less efficient than column_semicolon_concat_array. The latter is also not very efficient because the double semicolon ;; calls the function hvncat which is apparently less efficient than hvcat.

By the way, according to the doc, hvncat should be more efficient using the ‘‘dim form’’ when the concatenation is along just one dimension (which is the case here). I would have thought that the more optimized function hvcat would be called in this case but it is apparently more subtle.

And thanks to the other contributors, I realised that in order to achieve better performance, I should avoid using global variables, be careful about types, avoid useless allocations and use static arrays when possible.

Thanks everyone!

Nobody mentioned, but there are Performance Tips · The Julia Language

which cover those and other aspects.

1 Like

This would almost definitely be the biggest issue with the original code. Non-const globals really are a performance disaster in an inner loop. You can try const globals though for cases where the globals are constant and used in many different functions.

1 Like