Converting Vector to Tuple seems not optimal

Hi there,

I am working on improving the memory allocation performance of my code and in learning about how to do so I came across the fact that tuples are much cheaper to store that vectors. For this reason I was trying to create all new temporary “arrays” that will not be modified in the function as tuples. These new arrays are created as combinations of Vector{Float64} arguments and therefore I have to convert the combination into a tuple within the function. Nevertheless I found this conversion to make the perfomance worst. Please refer to the following simple example:

Approach one:

vec1 = rand(3); vec2 = rand(3);
function test(vec1::Vector{Float64}, vec2::Vector{Float64})

           temp1 = vec1 .+ 0.5
           temp2 = vec2 .+ 0.8
           return dot(temp1, temp2)

       end
@benchmark test($vec1, $vec2)

The above gives this:

Approach 2:

vec1 = rand(3); vec2 = rand(3);
function test(vec1::Vector{Float64}, vec2::Vector{Float64})

           temp1 = tuple(vec1 .+ 0.5...)
           temp2 = tuple(vec2 .+ 0.8...)
           return dot(temp1, temp2)

       end

This second approach gives:

Can someone explain why this is so costly and whether there is any advice about how to avoid having to define new temporary arrays as Vectors to avoid the high memory estimate?

thanks a lot in advance!

The reason that Tuples are so cheap to store is that they encode their length in their type, so the compiler is able to exactly plan out and set aside space for them efficiently.

julia> typeof((1,2,3,4))
NTuple{4, Int64}

When you splat a vector into a Tuple, the compiler can’t know ahead of time how big the Tuple should be, which is something known as a “type instability” and its a classic performance pitfall in julia.

Not only do the Tuples need to be heap allocated because the compiler doesn’t know how to set aside space for them, but then it doesn’t know ahead of time which method specialization of dot to call when you get to dot(temp1, temp2), so it needs to perform a “dynamic dispatch” to find the method at runtime (and compile one if it hasn’t already done so). This can be a real performance killer.

Here’s one way you can speed up your test function and avoid temporary allocations altogether:

function test3(vec1::Vector{Float64}, vec2::Vector{Float64})
    s = 0.0
    for i in eachindex(vec1, vec2)
        s += (vec1[i] + 0.5) * (vec2[i] + 0.8)
    end
    s
end
julia> let vec1 = rand(3), vec2 = rand(3)
           @btime test3($vec1, $vec2)
       end;
  3.897 ns (0 allocations: 0 bytes)

Alternatively, if you’re always working on vectors of length 3, you can use StaticArrays.jl which are basically like tuples themselves, and you should just avoid converting from regular Vector to SVector during performance sensitive parts of your program:

julia> using StaticArrays

julia> function test(vec1::SVector{N, Float64}, vec2::SVector{N, Float64}) where {N}
           temp1 = vec1 .+ 0.5
           temp2 = vec2 .+ 0.8
           return dot(temp1, temp2)
       end
test (generic function with 2 methods)

julia> let vec1 = rand(SVector{3, Float64}), vec2 = rand(SVector{3, Float64})
           @btime test($vec1, $vec2)
       end;
  2.594 ns (0 allocations: 0 bytes)
7 Likes

One much simpler answer rule of thumb is: it is cheaper if you do not need to convert.

Not necessarily. If they had done e.g.

temp1 = (vec1[1], vec1[2], vec1[3]) .+ 0.5
temp2 = (vec2[1], vec2[2], vec3[3]) .+ 0.8

It’d also be very fast. The conversion is cheap, the expensive part is the type instability.

3 Likes

Thanks a lot for you answer it really helps! However, I have follow up question. Consider:

function test(vec1::Vector{Float64}, vec2::Vector{Float64})

    temp1::NTuple{10, Float64} = ((vec1 .+ 2.0)..., )
    temp2::NTuple{10, Float64} = ((vec2 .+ 3.0)..., )

    return dot(temp1, temp2)
end

Why is this type-unstable still? Can the compiler not infer then type and size of the temp1 and temp2 from my annotation?

Thanks a lot!

It does:

julia> @code_warntype test(zeros(10), ones(10))
...
6 ─ %22 = Base.convert(%17, @_7)::Tuple{Vararg{Any, _A}} where _A
└──       (@_7 = Core.typeassert(%22, %17))
7 ┄       (temp2 = @_7::NTuple{10, Float64})
│   %25 = Main.dot(temp1, temp2)::Float64
└──       return %25

What it can’t infer is the output of ((vec1 .+ 2.0)..., ) and ((vec2 .+ 3.0)..., ), those can be tuples of any length.

A type annotation on variables doesn’t work like it does in statically typed languages where it specifies a name, type, and instance’s memory all at once. Julia variables don’t hold memory for instances, they’re entirely separate. So, when you annotate a variable (on a left side of an assignment expression or in a local/global statement) with a type, what you’re really doing is forcing every assignment to call a convert of the right-side instance to the annotated type and a typeassert that the conversion worked. An error is thrown if either fails, so the compiler can assume that the variable and its new instance have the type in the follow code.

If your inputs don’t have a fixed length, I wouldn’t bother converting it to something that does, it’d just end up wasting time and memory.

2 Likes

Thanks a lot for your answer! Sorry if this is silly, but what I still do not understand is: since “so the compiler can assume that the variable and its new instance have the type in the follow code”

why can’t the compiler specialize as if temp1 and temp2? I thought the type-unstability was the case where there was not enough information to assume a specific type but it seems like here my annotation should do the job. What am I missing?

I don’t know what you mean, the concrete annotated types of the variables temp1 and temp2 are known by the compiler, that’s why Main.dot(temp1, temp2)::Float64 is inferred.

But my annotation gives more information than just Float64, why is it not specializing to a tuple which is what I constrained my variables to be?

You’re looking in the wrong place. (temp2 = @_7::NTuple{10, Float64}) is your variable. The output of dot is a scalar Float64 as expected, temp1 and temp2 are the inputs. I omitted most of the report for brevity but your variables’ inferred types are also neatly written in one place before the lowered code:

julia> @code_warntype test(zeros(10), ones(10))
MethodInstance for test(::Vector{Float64}, ::Vector{Float64})
  from test(vec1::Vector{Float64}, vec2::Vector{Float64}) @ Main REPL[2]:1
Arguments
  #self#::Core.Const(test)
  vec1::Vector{Float64}
  vec2::Vector{Float64}
Locals
  temp2::NTuple{10, Float64}
  temp1::NTuple{10, Float64}
  @_6::Tuple{Vararg{Float64}}
  @_7::Tuple{Vararg{Float64}}
Body::Float64
...

See, temp1 and temp2 are nicely inferred. The problematic @_6 and @_7 are the ((vec1 .+ 2.0)..., ) and ((vec2 .+ 3.0)..., ) expressions.

1 Like

Thanks a lot! I think I understand now. My issue was that I thought that all the performance loss came from the fact that the compiler was not able to specialize dot() to the arguments being tuples but this is not the case due to my annotation (according to your message this enabled the compiler to assume that the follow code instances where of such type). However, the performance loss is coming from the fact that the type annotation does not prevent the type-unstability in the lines:

    temp1::NTuple{10, Float64} = ((vec1 .+ 2.0)..., )
    temp2::NTuple{10, Float64} = ((vec2 .+ 3.0)..., )

That’s right. Dynamic dispatch is actually not very slow (though it’s not instant). It’s typically all the things happening before and after a dynamic dispatch that are so slow (boxing of memory, runtime code generation, etc.). Looking up the right method to apply to an unknown type is by comparison, pretty quick.

Rather than using type annotations, you could just use an approriate constructor here:

julia> function test4(vec1::Vector{Float64}, vec2::Vector{Float64})
           temp1 = NTuple{10, Float64}(vec1) .+ 2.0
           temp2 = NTuple{10, Float64}(vec2) .+ 3.0
           dot(temp1, temp2)
       end;

julia> let vec1 = rand(10), vec2 = rand(10)
           @btime test4($vec1, $vec2)
       end;
  14.656 ns (0 allocations: 0 bytes)

Notice I’ve done two things here:

  1. I’ve directly told it what length of tuple to construct by calling NTuple{10, Float64}(x)
  2. I’ve done that before the broadcasting expression. When you do ((temp1 .+ 2.0)...,) that actually allocates an extra temporary array before the conversion to a Tuple, you want to first convert the array to a Tuple, and then do the broadcasting.
1 Like