Memory allocations when returning vectors

I am trying to understand what is the best strategy to initialize arrays in Julia. Consider this small example:

using BenchmarkTools

function f(theta)
    [cos(theta) * sin(theta), sin(theta), 1.0]
end

function g(theta)
    result = Array{Float64}(undef, 3)
    result[1] = cos(theta) * sin(theta)
    result[2] = sin(theta)
    result[3] = 1.0
    result
end

function runtest()
    bf = @benchmark f(0.1)
    bg = @benchmark g(0.1)

    println("f(x, y): $(bf.allocs) allocations, $(bf.memory) bytes")
    println("g(x, y): $(bg.allocs) allocations, $(bg.memory) bytes")
end

runtest()

This is the output (using Julia 0.7-alpha on a 64-bit Linux system):

f(x, y): 3 allocations, 144 bytes
g(x, y): 1 allocations, 112 bytes

I don’t understand why f uses more memory than g: to me, they seem absolutely equivalent. I tried to use @code_typed and @code_warntype to shed some light, but everything looks ok to me:

julia> @code_warntype f(0.1)
Body::Array{Float64,1}
4 1 ─ %1 = invoke Main.cos(%%theta::Float64)::Float64                                                                                  │
  │   %2 = invoke Main.sin(%%theta::Float64)::Float64                                                                                  │
  │   %3 = Base.mul_float(%1, %2)::Float64                                                                                             │╻ *
  │   %4 = invoke Main.sin(%%theta::Float64)::Float64                                                                                  │
  │   %5 = invoke Base.vect(%3::Float64, %4::Vararg{Float64,N} where N, 1.0)::Array{Float64,1}                                         │
  └──      return %5                                                                                                                   │

julia> @code_warntype g(0.1)
Body::Array{Float64,1}
8  1 ─ %1 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Float64,1}, svec(Any, Int64), :(:ccall), 2,     Array{Float64,1}, 3, 3))::Array{Float64,1}
9  │   %2 = invoke Main.cos(%%theta::Float64)::Float64                                                                        │
   │   %3 = invoke Main.sin(%%theta::Float64)::Float64                                                                        │
   │   %4 = Base.mul_float(%2, %3)::Float64                                                                                   │╻  *
   │        Base.arrayset(true, %1, %4, 1)                                                                                    │╻  setindex!
10 │   %6 = invoke Main.sin(%%theta::Float64)::Float64                                                                        │
   │        Base.arrayset(true, %1, %6, 2)                                                                                    │╻  setindex!
11 │        Base.arrayset(true, %1, 1.0, 3)                                                                                   │╻  setindex!
12 └──      return %1                                                                                                         │

Why is g slightly more efficient than f?

2 Likes

The additional allocations probably happen because the inital array in f isn’t typed; if you use something like

function f(theta)
    Float64[cos(theta) * sin(theta), sin(theta), 1.0]
end

instead, f and g perform the same (and it’s only fair to give the element type in f since you also specify it in g).

1 Like

For this kind of fixed-sized return you may want to return a tuple, named tuple, or even a custom type with named fields. Arrays are relatively heavyweight since they are a very general, flexible data structure.

2 Likes

But, isn’t the type inferred? I always heard you don’t need to specify types all over the place, when the compiler can figure them out. Is there some ambiguity?

Julia has no problem inferring the correct type for the array in f(), and both functions (without the type parameter in f()) take exactly the same time to run for me on v0.7, so I’m not convinced that the extra allocation is worth worrying about. And, as Stefan said, the right answer is to return a tuple or a struct or a StaticArrays.SVector.

Mmm, I’m not sure one should dismiss this so quickly. I agree about Tuple/SArray, but there might be some low-hanging optimisation possible here, perhaps? The @code_native result is much simpler doing Float64[...] versus just [...] . I also would have assumed that IPO constant propagation should have made this unnecessary. (Thinking again, the “problem” has probably nothing to do with that)

Nevertheless, the fact that an inferred but untyped array constructor is slower than an explicitly typed one is would be weird. I would have expected more or less identical @code_native.

edit: Sorry, I had a typo in my code. Looks roughly the same now.

begin
f_infer(N)=[i for i=1:N]
f_explicit(N)=Int[i for i=1:N]
f_collect(N)=collect(1:N);
f_collect2(N)=collect(Int, 1:N);
end;
using BenchmarkTools
begin
N=10;
@btime f_infer($N);
@btime f_explicit($N);
@btime f_collect($N);
@btime f_collect2($N);
versioninfo()
end

giving

  48.355 ns (1 allocation: 160 bytes)
  42.287 ns (1 allocation: 160 bytes)
  56.595 ns (2 allocations: 192 bytes)
  52.170 ns (1 allocation: 160 bytes)
Julia Version 0.6.2


  51.958 ns (1 allocation: 160 bytes)
  44.967 ns (1 allocation: 160 bytes)
  57.162 ns (1 allocation: 160 bytes)
  50.603 ns (1 allocation: 160 bytes)
Julia Version 0.7.0-DEV.5183

1 Like

One more observation:

function f1(theta)
           [theta, theta, 1.0]
           end
function f2(theta)
           [theta, 2*theta, 1.0]
           end

On v0.7-alpha.8

julia> @btime f1(0.1);
  30.795 ns (1 allocation: 112 bytes)

julia> @btime f2(0.1);
  34.528 ns (2 allocations: 128 bytes)

EDIT: Both become equally optimal with the Float64 annotation. Hence inference is working on the Vector type in f1 but not quite in f2

Many thanks for your responses! The script I posted at the beginning was written with the intent of understanding what’s going on in a more complex program I am developing. The following script is more representative of the things I am doing, which involve sequential calculations on a huge set of vectors:

# File: test2.jl
using BenchmarkTools
using LinearAlgebra
using Printf

function f(theta)
    # Calculations here are made up, the true function `f` is more complex
    v = [cos(theta) * sin(theta), sin(theta), 0.0]
    w = [cos(theta) * sin(theta), 0.0, cos(theta)]

    dot(v, w) * v + cross(v, w)
end

function g(theta)
    v = Float64[cos(theta) * sin(theta), sin(theta), 0.0]
    w = Float64[cos(theta) * sin(theta), 0.0, cos(theta)]

    dot(v, w) * v + cross(v, w)
end

function runtest()
    bf = @benchmark f(0.1)
    bg = @benchmark g(0.1)

    @printf("f(x, y): %d allocations, %d bytes, minimum time is %.1f ns\n",
            bf.allocs, bf.memory, minimum(bf.times))
    @printf("g(x, y): %d allocations, %d bytes, minimum time is %.1f ns\n",
            bg.allocs, bg.memory, minimum(bg.times))
end

runtest()

The body of function f must initialize a number of vectors (just two in this example, but more in the real code I’m developing) and do some maths on them. Function g is exactly the same as f, but here the vectors used internally are initialized using the Float64[…] trick. Results are the same as before:

$ julia test2.jl
f(x, y): 12 allocations, 672 bytes, minimum time is 196.0 ns
g(x, y): 8 allocations, 608 bytes, minimum time is 175.5 ns

Consider that the real-world analogous of function f must be run on a few billions of vectors, and that I am trying to make the code as efficient as possible. @pfitzseb’s idea allows g to save ~10% of the execution time with respect to f, which makes me quite happy!

Do note that (similar to what @StefanKarpinski and @rdeits were saying), if you know the size of the vectors in advance, it can be quite advantageous to use the StaticArrays package (which provides AbstractArray subtypes backed by Tuples).

julia> using StaticArrays

julia> function g(theta)
           v = @SVector [cos(theta) * sin(theta), sin(theta), 0.0]
           w = @SVector [cos(theta) * sin(theta), 0.0, cos(theta)]

           dot(v, w) * v + cross(v, w)
       end
g (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime g(x) setup = x = rand()
  45.897 ns (0 allocations: 0 bytes)

Not sure if this is applicable in your case, but if it is, using StaticArrays can result in big performance gains.

1 Like

Thanks @tkoolen, I wasn’t aware of the existence of the StaticArrays package, and using tuples as suggested by @StefanKarpinski would have prevented me from directly calling dot and cross.

I re-implemented my test script adding a new function h that uses StaticArrays. I see that the amount of memory allocated is smaller, but execution time is consistently larger than both f and g. Here is the script:

using BenchmarkTools
using LinearAlgebra
using StaticArrays
using Printf

function f(theta)
        v = [cos(theta) * sin(theta), sin(theta), 0.0]
        w = [cos(theta) * sin(theta), 0.0, cos(theta)]

        dot(v, w) * v + cross(v, w)
end

function g(theta)
        v = Float64[cos(theta) * sin(theta), sin(theta), 0.0]
        w = Float64[cos(theta) * sin(theta), 0.0, cos(theta)]

        dot(v, w) * v + cross(v, w)
end

function h(theta)
        v = @SVector [cos(theta) * sin(theta), sin(theta), 0.0]
        w = @SVector [cos(theta) * sin(theta), 0.0, cos(theta)]

        dot(v, w) * v + cross(v, w)
end

printstats(fn, bench) = @printf("%s(x, y): %d allocations, %d bytes, minimum time is %.1f ns\n",
            fn, bench.allocs, bench.memory, minimum(bench.times))

function runtest()
    bf = @benchmark f(0.1)
    bg = @benchmark g(0.1)
    bh = @benchmark h(0.1)

    printstats("f", bf)
    printstats("g", bg)
    printstats("h", bh)
end

runtest()

And here are the results. As numbers are quite close, I decided to re-run the script three times, in order to check that the behavior was repeatable (it was).

$ for i in 1 2 3; do echo ""; julia test2.jl; done

f(x, y): 12 allocations, 672 bytes, minimum time is 202.6 ns
g(x, y): 8 allocations, 608 bytes, minimum time is 182.4 ns
h(x, y): 5 allocations, 320 bytes, minimum time is 217.9 ns

f(x, y): 12 allocations, 672 bytes, minimum time is 204.1 ns
g(x, y): 8 allocations, 608 bytes, minimum time is 183.4 ns
h(x, y): 5 allocations, 320 bytes, minimum time is 217.3 ns

f(x, y): 12 allocations, 672 bytes, minimum time is 204.6 ns
g(x, y): 8 allocations, 608 bytes, minimum time is 184.4 ns
h(x, y): 5 allocations, 320 bytes, minimum time is 217.2 ns

So, on my machine it does not seem that in this particular example @SVector helps in reducing the computational time like it was in your case… Is it due to the fact that I am using Julia 0.7-alpha?

Yeah, something weird is happening on v0.7. Running in v0.6.3, I get:

julia> @benchmark h(0.5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     33.308 ns (0.00% GC)
  median time:      40.710 ns (0.00% GC)
  mean time:        44.508 ns (0.00% GC)
  maximum time:     146.214 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

which is much faster and has no memory allocation at all. I suspect this is a performance issue with StaticArrays on v0.7, and I’m sure it’s something that can be resolved.

The slowness with StaticArrays on v0.7 is fixed with StaticArrays master. After doing pkg> checkout StaticArrays, I get:

julia> @benchmark f(0.1)
BenchmarkTools.Trial: 
  memory estimate:  672 bytes
  allocs estimate:  12
  --------------
  minimum time:     182.939 ns (0.00% GC)
  median time:      218.062 ns (0.00% GC)
  mean time:        267.174 ns (12.00% GC)
  maximum time:     65.264 ÎĽs (99.57% GC)
  --------------
  samples:          10000
  evals/sample:     656

julia> @benchmark g(0.1)
BenchmarkTools.Trial: 
  memory estimate:  608 bytes
  allocs estimate:  8
  --------------
  minimum time:     166.280 ns (0.00% GC)
  median time:      191.356 ns (0.00% GC)
  mean time:        241.184 ns (12.28% GC)
  maximum time:     59.829 ÎĽs (99.41% GC)
  --------------
  samples:          10000
  evals/sample:     714

julia> @benchmark h(0.1)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.013 ns (0.00% GC)
  median time:      31.963 ns (0.00% GC)
  mean time:        35.483 ns (0.00% GC)
  maximum time:     139.469 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

1 Like

You can use the @btime macro instead of having to write your own printing function:

@btime f(0.1)
@btime g(0.1)
@btime h(0.1)

259.095 ns (12 allocations: 672 bytes)
228.756 ns (8 allocations: 608 bytes)
41.918 ns (0 allocations: 0 bytes)

@rdeits, thanks for your suggestion, I installed StaticArrays running pkg> install StaticArrays. After running pkg> checkout StaticArrays, results changed dramatically:

$ julia test2.jl
f(x, y): 12 allocations, 672 bytes, minimum time is 200.7 ns
g(x, y): 8 allocations, 608 bytes, minimum time is 181.1 ns
h(x, y): 0 allocations, 0 bytes, minimum time is 32.9 ns
2 Likes

Excellent! That’s what I’m seeing as well.

By the way, doing checkout will be unnecessary once a new version of StaticArrays is tagged.