There has been some discussion on several fora on the speed of broadcasting vs. a for loop - some aspects escape me. The output of the Julia script below:
Explicit for loop:
  1.310 μs (0 allocations: 0 bytes)
  1.309 μs (0 allocations: 0 bytes)
  1.305 μs (0 allocations: 0 bytes)
Broadcasting:
  8.831 μs (2 allocations: 78.17 KiB)
  8.599 μs (2 allocations: 78.17 KiB)
  8.336 μs (2 allocations: 78.17 KiB)
Broadcasting inbounds:
  8.062 μs (2 allocations: 78.17 KiB)
  8.676 μs (2 allocations: 78.17 KiB)
  9.170 μs (2 allocations: 78.17 KiB)
In this simple example, the broadcasting causes two allocations and a performance penalty of a factor of 6. I have tried with the inbounds macro, but I guess the allocations are the problem. The number of allocations vary with the size of v.
So once again, why is the broadcasting so much slower than the for loop? And why does the number of allocations vary with the size of v?
#!/usr/bin/env julia
import BenchmarkTools
function explicit_for_loop(v::Vector{Float64})::Vector{Float64}
  for (index, value) in enumerate(v)
    v[index] = value^2
  end
  return v
end
function broadcasting(v::Vector{Float64})::Vector{Float64}
  v = v.^2
  return v
end
function broadcasting_inbounds(v::Vector{Float64})::Vector{Float64}
  @inbounds v = v.^2
  return v
end
function reset()::Vector{Float64}
  v::Vector{Float64} = [ x for x in range(1, 10000)]
  return v
end
v::Vector{Float64} = reset()
println("Explicit: ", explicit_for_loop(v))
v = reset()
println("Broadcast: ", broadcasting(v))
v = reset()
println("Broadcast: ", broadcasting_inbounds(v))
println("Explicit for loop:")
v = reset()
BenchmarkTools.@btime explicit_for_loop(v)
v = reset()
BenchmarkTools.@btime explicit_for_loop(v)
v = reset()
BenchmarkTools.@btime explicit_for_loop(v)
println("Broadcasting:")
v = reset()
BenchmarkTools.@btime broadcasting(v)
v = reset()
BenchmarkTools.@btime broadcasting(v)
v = reset()
BenchmarkTools.@btime broadcasting(v)
println("Broadcasting inbounds:")
v = reset()
BenchmarkTools.@btime broadcasting_inbounds(v)
v = reset()
BenchmarkTools.@btime broadcasting_inbounds(v)
v = reset()
BenchmarkTools.@btime broadcasting_inbounds(v)