Performance of map!()

#1

Inspired by this topic here: map vs loops vs broadcasts, I wanted to do my own tests on the speed of the functions. As it turns out the optimised for loop, map and vectorisation are all roughly in the same ballpark.
Now I wanted to play the same game with a predefined result array and found significant differences:

using BenchmarkTools

function forloop(e, v)
    @simd for i in eachindex(v)
        @inbounds e[i] = 2*v[i]^2 + v[i] + 5
    end
end

fmap(e, v) = map!(x -> 2x^2 + x + 5, e, v)
fbcs(e, v) = @. e = 2*v^2 + v + 5

v = rand(10000)
e = similar(v)

@btime for i in 1:100
    forloop(e, v)
end

@btime for i in 1:100
    fmap(e, v)
end

@btime for i in 1:100
    fbcs(e, v)
end
 julia> 336.145 μs (0 allocations: 0 bytes)
        944.702 μs (0 allocations: 0 bytes)
        340.421 μs (0 allocations: 0 bytes)

Am I using map!() correctly or is there a reason why it should be so slow compared to the other implementations?

1 Like
#2

Just some general notes about benchmarking:

  1. It’s not necessary to put a loop around the code you want to benchmark (@btime already does that for you)
  2. It is necessary to interpolate (with $) any variables you use inside the code you are benchmarking. Otherwise you’re timing the lookup of a global variable at each function call, which will affect your results (see https://github.com/JuliaCI/BenchmarkTools.jl#quick-start ). I wouldn’t expect it to affect the relative performance in this case, but it’s still worth getting right.

With that in mind:

julia> @btime forloop($e, $v)
  2.511 μs (0 allocations: 0 bytes)

julia> @btime fmap($e, $v);
  9.699 μs (0 allocations: 0 bytes)

julia> @btime fbcs($e, $v);
  2.611 μs (0 allocations: 0 bytes)

I’m surprised to see that map! is indeed slower as of Julia 1.1.0.

4 Likes
#3

the code for map in a 1D array goes to the function map_n!:

function map_n!(f::F, dest::AbstractArray, As) where F
    for i = LinearIndices(As[1])
        dest[i] = f(ith_all(i, As)...)
    end
    return dest
end

where ith_all:

@inline ith_all(i, as) = (as[1][i], ith_all(i, tail(as))...)
#4

Looks like an @inbounds in that function could help.

#5

Is the ith_all function even needed? I’m guessing it was written before dot-broadcast notation was introduced.

getindex.(As, i) does the same thing as ith_all(i, As) for a tuple of vectors As and scalar index i.

#7

Thanks for the advice regarding the interpolation and the loop. I will certainly use this in my next benchmark run. As you have already shown this does not affect relative performance, so I will keep my implementation in the post as is.
If the performance bottleneck of this is figured out should one open a github issue?

#8

maybe open a issue directing to this post?

#9

With the --check-bounds=no flag, all versions perform quite similarly (Julia 1.1.0, Windows 10):

julia> @btime forloop($e, $v);
  2.509 μs (0 allocations: 0 bytes)

julia> @btime fmap($e, $v);
  2.623 μs (0 allocations: 0 bytes)

julia> @btime fbcs($e, $v);
  2.537 μs (0 allocations: 0 bytes)
2 Likes