Broadcasting slower than for-loop

There has been some discussion on several fora on the speed of broadcasting vs. a for loop - some aspects escape me. The output of the Julia script below:

Explicit for loop:
  1.310 μs (0 allocations: 0 bytes)
  1.309 μs (0 allocations: 0 bytes)
  1.305 μs (0 allocations: 0 bytes)
Broadcasting:
  8.831 μs (2 allocations: 78.17 KiB)
  8.599 μs (2 allocations: 78.17 KiB)
  8.336 μs (2 allocations: 78.17 KiB)
Broadcasting inbounds:
  8.062 μs (2 allocations: 78.17 KiB)
  8.676 μs (2 allocations: 78.17 KiB)
  9.170 μs (2 allocations: 78.17 KiB)

In this simple example, the broadcasting causes two allocations and a performance penalty of a factor of 6. I have tried with the inbounds macro, but I guess the allocations are the problem. The number of allocations vary with the size of v.

So once again, why is the broadcasting so much slower than the for loop? And why does the number of allocations vary with the size of v?

#!/usr/bin/env julia

import BenchmarkTools

function explicit_for_loop(v::Vector{Float64})::Vector{Float64}
  for (index, value) in enumerate(v)
    v[index] = value^2
  end

  return v
end

function broadcasting(v::Vector{Float64})::Vector{Float64}
  v = v.^2

  return v
end

function broadcasting_inbounds(v::Vector{Float64})::Vector{Float64}
  @inbounds v = v.^2

  return v
end

function reset()::Vector{Float64}
  v::Vector{Float64} = [ x for x in range(1, 10000)]
  return v
end

v::Vector{Float64} = reset()
println("Explicit: ", explicit_for_loop(v))

v = reset()
println("Broadcast: ", broadcasting(v))

v = reset()
println("Broadcast: ", broadcasting_inbounds(v))

println("Explicit for loop:")
v = reset()
BenchmarkTools.@btime explicit_for_loop(v)
v = reset()
BenchmarkTools.@btime explicit_for_loop(v)
v = reset()
BenchmarkTools.@btime explicit_for_loop(v)

println("Broadcasting:")
v = reset()
BenchmarkTools.@btime broadcasting(v)
v = reset()
BenchmarkTools.@btime broadcasting(v)
v = reset()
BenchmarkTools.@btime broadcasting(v)

println("Broadcasting inbounds:")
v = reset()
BenchmarkTools.@btime broadcasting_inbounds(v)
v = reset()
BenchmarkTools.@btime broadcasting_inbounds(v)
v = reset()
BenchmarkTools.@btime broadcasting_inbounds(v)

First of all, note that it’s not necessary to type annotate each variable. Unless you want to restrict the type that a variable accepts, Julia will infer the type of the variable when passed as an argument to a function.

Second, note that you’re not comparing the same operations.

function explicit_for_loop(v)
  for (index, value) in enumerate(v)
    v[index] = value^2
  end

  return v
end

This operation mutates a vector that was already created, so you won’t see any allocations. On the contrary, the broadcasting operation creates a new vector when it executes the operation.

function broadcasting(v)
  y = v.^2          # don't reuse an argument variable, it can lead to issues

  return y
end

So what the broadcasting operation is doing is equivalent to the following (I’m writing your for-loop in a more idiomatic way):

function explicit_for_loop(v)
    y = similar(v)      # or y = Vector{Float64}(undef, length(v))
    
    for index in eachindex(v)
      y[index] = v[index]^2
    end
  
    return y
end

And now, note that the benchmarks also require extrapolating the variables with $. Otherwise, they’re considered global variables and you’ll likely see more allocations than what they really exist.

using BenchmarkTools
v = rand(10)
@btime explicit_for_loop($v)    #note the $ to interpolate the variable, otherwise it's treated as a global
    #output: 39.695 ns (1 allocation: 144 bytes)

@btime broadcasting($v)    #note the $ to interpolate the variable, otherwise it's treated as a global
    #output: 45.233 ns (1 allocation: 144 bytes)

If you actually wanted to “update” v, rather than creating a new vector y, you could also do:

function broadcasting(v)
  v .= v.^2          # note the operator .=

  return v
end

or you can directly apply the for-loop that you were using.

8 Likes

I’ve never seen this one in the wild, but I thought it should work, and it did:

v .^= 2  # squares in-place

BTW, enumerate isn’t really the right idiom here, even though it works for Vector, since it always just counts the numbering starting from one. Instead of enumerate use pairs.

5 Likes

Thanks for your fast answer.

I thought vectors are mutable to changing them in a function should not cause an issue. Am I missing something?

I added the dots before the = sign in the broadcasting functions. This makes the allocations go away indeed and the performance is the same as for the for-loop.

Variables in julia are just labels for objects. The variables can always be changed, but that doesn’t change the underlying data. Nothing is being mutated.

When you write

y = y .^ 2

you’re saying “I want to take my label y and change it so that it now labels the result of y .^ 2.” The operation y .^ 2 allocates a whole new vector. The data y used to label is left unchanged.

For example, let’s make a vector y and then say z = y, and then re-bind y. What ends up happening?

julia> let y = [1, 2, 3]
           z = y
           y = [4, 5, 6]
           z
       end
3-element Vector{Int64}:
 1
 2
 3

At the end of this, z is no longer labelleing the same data as y.

Mutation is different. When you mutate something, you’re changing the actual data that the label is pointing to. E.g.:

julia> let y = [1, 2, 3]
           z = y
           y .= [4, 5, 6]
           z
       end
3-element Vector{Int64}:
 4
 5
 6

In this case, z and y are still labelling the same data, but have both been changed by the mutation.

4 Likes

I think you are right. y = [ 1, 2, 3] calls a constructor and therefore a new object is generated and that is likely to be allocated at a different memory location. Its pointer changes. y .= [1,2,3] does not create a new object, but it overwrites the same memory - in place. Calling pointer on the object gives it away. That is indeed an important take-away from this discussion. z=y does not create a new vector but a new reference to the same memory. So z[2]=42 changes y[2]. New memory can be allocated by deepcopy. I guess as long as a function argument keeps pointing to the same memory, no issues are expected, but filling a vector argument with a new object must be risky. Now I wonder how this can have worked in the first place.

Actually, a new object is created here, on the right hand side of the expression. The values in that object then overwrites the values in the object that y refers to.

It’s the same as writing

x = [1, 2, 3]  # create new object
y .= x  # overwrite values in y

The crucial difference compared to something like

y .= y.^2

is that the dot operations are fused, so that the mutation of y[i] and the calculation of y[i]^2 happens in one ‘go’, and there is no need for intermediate storage. Fusing the broadcasts means that It’s not the same as

x = y.^2
y .= x
6 Likes