Performance of map!()

Inspired by this topic here: map vs loops vs broadcasts, I wanted to do my own tests on the speed of the functions. As it turns out the optimised for loop, map and vectorisation are all roughly in the same ballpark.
Now I wanted to play the same game with a predefined result array and found significant differences:

using BenchmarkTools

function forloop(e, v)
    @simd for i in eachindex(v)
        @inbounds e[i] = 2*v[i]^2 + v[i] + 5
    end
end

fmap(e, v) = map!(x -> 2x^2 + x + 5, e, v)
fbcs(e, v) = @. e = 2*v^2 + v + 5

v = rand(10000)
e = similar(v)

@btime for i in 1:100
    forloop(e, v)
end

@btime for i in 1:100
    fmap(e, v)
end

@btime for i in 1:100
    fbcs(e, v)
end
 julia> 336.145 μs (0 allocations: 0 bytes)
        944.702 μs (0 allocations: 0 bytes)
        340.421 μs (0 allocations: 0 bytes)

Am I using map!() correctly or is there a reason why it should be so slow compared to the other implementations?

1 Like

Just some general notes about benchmarking:

  1. It’s not necessary to put a loop around the code you want to benchmark (@btime already does that for you)
  2. It is necessary to interpolate (with $) any variables you use inside the code you are benchmarking. Otherwise you’re timing the lookup of a global variable at each function call, which will affect your results (see https://github.com/JuliaCI/BenchmarkTools.jl#quick-start ). I wouldn’t expect it to affect the relative performance in this case, but it’s still worth getting right.

With that in mind:

julia> @btime forloop($e, $v)
  2.511 μs (0 allocations: 0 bytes)

julia> @btime fmap($e, $v);
  9.699 μs (0 allocations: 0 bytes)

julia> @btime fbcs($e, $v);
  2.611 μs (0 allocations: 0 bytes)

I’m surprised to see that map! is indeed slower as of Julia 1.1.0.

4 Likes

the code for map in a 1D array goes to the function map_n!:

function map_n!(f::F, dest::AbstractArray, As) where F
    for i = LinearIndices(As[1])
        dest[i] = f(ith_all(i, As)...)
    end
    return dest
end

where ith_all:

@inline ith_all(i, as) = (as[1][i], ith_all(i, tail(as))...)

Looks like an @inbounds in that function could help.

Is the ith_all function even needed? I’m guessing it was written before dot-broadcast notation was introduced.

getindex.(As, i) does the same thing as ith_all(i, As) for a tuple of vectors As and scalar index i.

Thanks for the advice regarding the interpolation and the loop. I will certainly use this in my next benchmark run. As you have already shown this does not affect relative performance, so I will keep my implementation in the post as is.
If the performance bottleneck of this is figured out should one open a github issue?

maybe open a issue directing to this post?

With the --check-bounds=no flag, all versions perform quite similarly (Julia 1.1.0, Windows 10):

julia> @btime forloop($e, $v);
  2.509 μs (0 allocations: 0 bytes)

julia> @btime fmap($e, $v);
  2.623 μs (0 allocations: 0 bytes)

julia> @btime fbcs($e, $v);
  2.537 μs (0 allocations: 0 bytes)
2 Likes

While wondering if I should use map or for-loop in a program, my elementary test below showed that map was much slower than for-loop, even when I used no-check-bounds kernel option in IJulia by doing

julia> using IJulia
julia> installkernel("Julia no-check-bounds", "--check-bounds=no")

Did I do it wrong somewhere below?

using BenchmarkTools
@benchmark map((x)->x^2,1:5)
BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  1
  --------------
  minimum time:     37.563 ns (0.00% GC)
  median time:      41.289 ns (0.00% GC)
  mean time:        52.538 ns (16.17% GC)
  maximum time:     46.484 μs (99.75% GC)
  --------------
  samples:          10000
  evals/sample:     993
@benchmark for x in 1:5
    (x)->x^2
end
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.599 ns (0.00% GC)
  median time:      1.700 ns (0.00% GC)
  mean time:        1.740 ns (0.00% GC)
  maximum time:     14.201 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
versioninfo()
Julia Version 1.0.4
Commit 38e9fb7f80 (2019-05-16 03:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 4

Those don’t do the same thing. In the second benchmark, you’re repeatedly creating a function x -> x^2, not evaluating it.

Right. Thank you. I created a function temp() to evaluate it as below. The results below show that map! is still slower.

y .= y.^2

is the fastest, although it may not be an example to be compared with the others.

using BenchmarkTools
y = zeros(5);
@benchmark map!(x->x^2, y, 1:5)
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     30.986 ns (0.00% GC)
  median time:      31.591 ns (0.00% GC)
  mean time:        44.111 ns (17.76% GC)
  maximum time:     48.684 μs (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     994
function temp(y)
    for x in 1:5
        y[x] = x^2
    end
end
temp (generic function with 1 method)
y = zeros(5);
@benchmark temp(y)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.432 ns (0.00% GC)
  median time:      16.533 ns (0.00% GC)
  mean time:        19.683 ns (0.00% GC)
  maximum time:     64.629 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998
y = collect(1:5);
@benchmark y .= y.^2
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  4
  --------------
  minimum time:     883.784 ns (0.00% GC)
  median time:      900.027 ns (0.00% GC)
  mean time:        975.177 ns (3.75% GC)
  maximum time:     367.854 μs (99.37% GC)
  --------------
  samples:          10000
  evals/sample:     37
julia> @btime $y .= $y.^2
  6.818 ns (0 allocations: 0 bytes)

See e.g. Function calls in global scope, benchmarking, etc.

1 Like

Thank you, again. Learning a lot here :slight_smile:

It should’ve been:

using BenchmarkTools
y = collect(1:5);
@benchmark $y .= $y.^2
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.509 ns (0.00% GC)
  median time:      9.511 ns (0.00% GC)
  mean time:        9.927 ns (0.00% GC)
  maximum time:     86.085 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999