Vectorised vs Devectorised Performance

Hi All,
I was reading the interesting post Relationshop between vectorised and unvectorised code.
The following two snippets are compared

function vectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        x = a + b
    end

    return
end

and

function devectorized()
    a = [1.0, 1.0]
    b = [2.0, 2.0]
    x = [NaN, NaN]

    for i in 1:1000000
        for index in 1:2
            x[index] = a[index] + b[index]
        end
    end

    return
end

for the vectorised and devectorised cases respectively. So far so good.
Now the statement I am not sure I fully get is

it was possible to entirely avoid allocating any memory for storing changes to x . The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear

So I get it that that in the vectorised case, a new allocation for each iteration occurs? Why would that happen at all?
And why it is made explicitly clear this should not be the case in the first, vectorised, snippet? Is it because a specific index

x[index] = a[index] + b[index]

is mentioned?
Thanks for clearing my confusion

x = a + b

creates a new array from the result of a + b, and assigns the label x to it. This new x has nothing to do with the x = [...] that was allocated previously, before the loop.

If one wants to explicitly mutate the previous x, than one could use the “devectorized” alternative, or:

x .= a .+ b

(note the dots). That is a syntax sugar for

for index in eachindex(x,a,b)
    @inbounds x[index] = a[index] + b[index]
end

(which is the same as the “devectorized” vesion, but will error if x, a, and b do not have the same length, such that @inbounds is good there - as it will be assumed in the dot sugar).

Probably this is a good read here: Assignment and mutation · JuliaNotes.jl

2 Likes

got it, thank you

Note that the performance analysis in the original link does not apply to Julia.

In R the vectorized code is much faster, because loops are slow. In Julia the loop is fast, as fast as the broadcasted x .= a .+ b, vectorized, version.

In Julia you will pay a huge price if you allocate a new x at every iteration, which is happening in the “vectorized” version of the original post.

Finally, those allocations are characteristic of mutable arrays. If the arrays where immutable, then the original “vectorized” code would be fine, that is:

using StaticArrays
function vectorized()
    a = SVector(1.0, 1.0)
    b = SVector(2.0, 2.0)
    for i in 1:1000000
        x = a + b
    end
    return x
end

a new x is created at every iteration of the loop, but not in the “heap”, such that it is fast. This can be used for small arrays. See also: Immutable variables · JuliaNotes.jl.

ps: I see that the post continues with an analysis of Julia versions. Yet, the post is confusing, in my opinion, and there is a problem in general in the comparison of the functions because neither one does nothing (all of them return “nothing”), so that measuring the time of these functions is quite prone to benchmarking artifacts. It is better to benchmark functions that actually return something that can be compared.

This is a better comparison:

julia> function v1()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i in 1:10^6
               x = x + a + b
           end
           return x
       end
v1 (generic function with 1 method)

julia> @btime v1()
  35.474 ms (1000003 allocations: 76.29 MiB)
2-element Vector{Float64}:
 3.0e6
 3.0e6

julia> function v2()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i in 1:10^6
               x .= x .+ a .+ b
           end
           return x
       end
v2 (generic function with 1 method)

julia> @btime v2()
  5.339 ms (3 allocations: 240 bytes)
2-element Vector{Float64}:
 3.0e6
 3.0e6

julia> function v3()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i in 1:10^6
               for index in eachindex(x,a,b)
                   @inbounds x[index] = x[index] + a[index] + b[index] 
               end
           end
           return x
       end
v3 (generic function with 1 method)

julia> @btime v3()
  3.258 ms (3 allocations: 240 bytes)
2-element Vector{Float64}:
 3.0e6
 3.0e6

julia> using StaticArrays

julia> function v4()
           a = SVector(1.0, 1.0)
           b = SVector(2.0, 2.0)
           x = SVector(0.0, 0.0)
           for i in 1:10^6
               x = x + a + b
           end
           return x
       end
v4 (generic function with 1 method)

julia> @btime v4()
  2.005 ms (0 allocations: 0 bytes)
2-element SVector{2, Float64} with indices SOneTo(2):
 3.0e6
 3.0e6
2 Likes

It’s probably worth noting that the post was written several years before Julia had dot syntax and potentially before Julia had immutable structs.

2 Likes

@Imiq, oh that is very useful thank you very much.
@johnmyleswhite, thanks much appreciated

1 Like

Playing with these methods on my machine, I get:

julia> @btime v1();
  23.927 ms (1000003 allocations: 76.29 MiB)

julia> @btime v2();
  2.574 ms (3 allocations: 240 bytes)

julia> @btime v3();
  2.574 ms (3 allocations: 240 bytes)

julia> @btime v4();
  1.028 ms (0 allocations: 0 bytes)

Unlike in your test, v2 and v3 ran exactly the same speed. Also those static arrays sure make it snappy! :sweat_smile:

In addition, I tried a method with manual indexing:

julia> function v5()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i = 1:1_000_000, j = 1:2
               x[j] += a[j] + b[j]
           end
           x
       end

and its performance was a smidge better than using eachindex:

julia> @btime v5()
  2.058 ms (3 allocations: 240 bytes)

Okay, now let’s play a bit. Slight variations on the definition of v2. The results are a bit surprising (or not):

julia> function v6()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i = 1:1_000_000
               x .+= a + b
           end
           x
       end

julia> function v7()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i = 1:1_000_000
               x .+= a .+ b
           end
           x
       end

julia> function v8()
           a = [1.0, 1.0]
           b = [2.0, 2.0]
           x = [0.0, 0.0]
           for i = 1:1_000_000
               x = x .+ a .+ b
           end
           x
       end

and their respective times:

julia> @btime v6();
  23.145 ms (1000003 allocations: 76.29 MiB)

julia> @btime v7();
  2.572 ms (3 allocations: 240 bytes)

julia> @btime v8();
  19.013 ms (1000003 allocations: 76.29 MiB)

v6 devectorized assignment but not addition, while v8 devectorized addition but not assignment. In short, you need to devectorize everything to get the full speed benefit from devectorization.

In terms of concept, it is not about vectorize vs. devectorize. It is about creating intermediate arrays or doing everything in place.

Btw, you can use @. for fused broadcasting:

Broadcasting is a type of “vectorized” operation as well. For instance, in Python a * b is the element wise multiplication, which is equivalent to a .* b in Julia.

1 Like