@amrods, splitting here the question you posted.
1. Create a mutable array inside a loop, explicitly:
julia> function f(x)
z = 0.
for i in 1:10
y = [ i+1, i+2 ]
z += x[1]*y[1] + x[2]*y[2]
end
z
end
f (generic function with 1 method)
julia> @btime f(x) setup=(x=rand(2))
242.694 ns (10 allocations: 960 bytes)
110.46200068979707
As you can see, there are 10 allocations, which come from the line y = [i+1,i+2]
. This is something that possibly the compiler could figure out, and solve it, but currently it does not. That can be solved in several ways (for example not creating at all the y
vector and multiplying directly by the elements in the next line, but that would be artificial in a general case). The most clear way to solve that is use static arrays:
julia> using StaticArrays
julia> function f(x)
z = 0.
for i in 1:10
y = @SVector [ i+1, i+2 ]
z += x[1]*y[1] + x[2]*y[2]
end
z
end
f (generic function with 1 method)
julia> @btime f(x) setup=(x=rand(2))
5.526 ns (0 allocations: 0 bytes)
26.367283900704752
To understand what happened here, we need to understand that there is more than way to create a vector in the computer. The previous one created the vector in the “heap”, which is one type of memory. There the vector can be mutable, change size, etc, and has a pointer. When using static arrays, the vector will have a constant size and won’t have an address. It is created then in one of the faster temporary memories (stack, or processor cache). It is possible (particularly in these simple examples) that the compiler just eliminates the need of the vector completely by simplifying the code. It is more likely to get these optimizations if the vector is static as well.
This is one case where it is possible that the Julia compiler will, at some point, optimize the first loop as well in the same way, by figuring out that the y
vector is local to the loop and can be static.
2. Creating vectors unintentionally with broadcasting
Probably one of the most common mistakes that Julia allows people to do is to create intermediate vectors by broadcasting:
julia> using LinearAlgebra
julia> x = rand(2); y = rand(2);
julia> function f(x,y)
z = 0.
for i in 1:10
z += norm(x .* y)
end
z
end
f (generic function with 2 methods)
julia> @btime f(x,y) setup=(x=rand(2);y=rand(2))
333.809 ns (10 allocations: 960 bytes)
5.583855623549817
Again, there are those 10 allocations and a lousy perfomance. The problem here is that x .* y
, which is the element-by-element multiplication of x
and y
creates a new array (of course in this simple case we can compute x .* y
only once outside the loop). Now, one solution is to use static arrays again for x
and y
:
julia> @btime f(x,y) setup=(x=rand(SVector{2,Float64});y=rand(SVector{2,Float64}))
3.388 ns (0 allocations: 0 bytes)
9.108217887882278
That is optimal, but depends on the input types, thus it is on the hands of the user, and may be not generalizable. The function could, however, guarantee that at least only one intermediary vector is allocated:
julia> function f(x,y)
z = 0.
tmp = zeros(2)
for i in 1:10
@. tmp = x * y
z += norm(tmp)
end
z
end
f (generic function with 2 methods)
julia> @btime f(x,y) setup=(x=rand(2);y=rand(2))
147.463 ns (1 allocation: 96 bytes)
4.352897649964319
Note that there is ony one allocation, and that the time is 2x faster than the first code (still much slower than the static array version, but again, these are artificial examples to illustrate the allocations, in a real case it could be impossible to use static arrays). Here, the @. tmp = x * y
does the element-by-element product but updated vector tmp
instead of creating a new vector at every iteration of the loop.
But note that this function still allocates. If this function was to be called from another piece of code that does a loop, that single allocation would become a lot of allocations and would be problematic. Then you would like to have this function also allocation-free, and then it would be good to completely preallocate tmp
outside f
:
julia> function f(x,y,tmp)
z = 0.
for i in 1:10
@. tmp = x * y
z += norm(tmp)
end
z
end
f (generic function with 3 methods)
julia> @btime f(x,y,tmp) setup=(x=rand(2);y=rand(2);tmp=zeros(2))
125.657 ns (0 allocations: 0 bytes)
1.0594243698772656
3. Functions that return (small) arrays of undetermined size
One it come down to optimizing performance, one has to take care of using the static
approach as much as possible, such that mutable arrays are not created. Again, forgive me by the artificiality of the examples. Suppose that one wants a function that returns a vector containing half of the elements of the input vector:
julia> f(x) = [ x[i] for i in 1:div(length(x),2) ]
f (generic function with 1 method)
This, of course, will allocate a new vector, the output vector. If that is called from within a loop, that will be very bad:
julia> function loop(x)
z = 0.
for i in 1:10
z += sum(f(x))
end
z
end
loop (generic function with 1 method)
julia> @btime loop(x) setup=(x=rand(4))
302.872 ns (10 allocations: 960 bytes)
18.00302470686425
How to solve that? We need to give the compiler all the possible information. Thus 1) let us define x
as a static array, with a fixed size. 2) let us define f
in such a way that the output of vector length is known at compile time:
julia> f(x::SVector{N,T}) where {N,T} = SVector{div(N,2),T}(ntuple(i->x[i],div(N,2)))
f (generic function with 2 methods)
julia> @btime loop(x) setup=(x=rand(SVector{4,Float64}))
2.099 ns (0 allocations: 0 bytes)
6.298705903007199
Now things got more complex: f
receives a static vector with specified length N
and element type T
. It returns also a static vector, but which is created with size div(N,2)
(this is the integer division, returning an integer), same type T
and the elements of x
from the first half of the vector, collected in a tuple of compile-time known size. This is also known at compile time, because Julia specializes f
to the type of input variable, and here the type of input includes the length of x
.
4. Slices in Julia allocate
This is another very common issue new users encounter. The above example could have been written in a more simple way as:
f(x) = x[1:div(length(x),2)]
Thus, we want the slice of the vector x
with half of its elements. Slice, in Julia, allocate new arrays by default:
julia> @btime f(x) setup=(x=rand(4));
29.445 ns (1 allocation: 96 bytes)
That is solved using views:
julia> f(x) = @view(x[1:div(length(x),2)])
f (generic function with 1 method)
julia> @btime f(x) setup=(x=rand(4))
2.520 ns (0 allocations: 0 bytes)
2-element view(::Vector{Float64}, 1:2) with eltype Float64:
0.7250861856178934
0.4313586640733562
But, of course, this is different from the previous function, since in the first case, mutating an element of the output array will not mutate the original array, and in the second case it will.
If anyone has other/new/better examples, let us know
(edit: fixed benchmark calls)