Mapreduce over selected indices


#1

I was writing a program involving a lot of mapreduce operations over pieces of arrays. To make things concrete here I’ll use the example of a sum over squared elements of an array:

using BenchmarkTools
x = rand(10000)
f(a) = a^2
inds = shuffle(1:10000)[1:100]
xselect = x[inds]
@benchmark sum(f, x)
@benchmark sum(f, xselect)
@benchmark sum(f, x[inds])
@benchmark sum(f, view(x, inds))

The first two benchmarks are the sums for the original array and the selection when it is allocated before the benchmark. The results for both on my system are:

BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     4.415 μs (0.00% GC)
  median time:      4.832 μs (0.00% GC)
  mean time:        4.941 μs (0.00% GC)
  maximum time:     34.531 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     436.629 ns (0.00% GC)
  median time:      486.954 ns (0.00% GC)
  mean time:        516.817 ns (0.15% GC)
  maximum time:     8.565 μs (89.27% GC)
  --------------
  samples:          10000
  evals/sample:     197

So the sum over the selection is about 10x faster at 486 ns vs 4.832 microseconds. The benchmark timings of the other approaches however are not nearly this fast because the index selection occurs inside the benchmark:

julia> @benchmark sum(f, x[inds])
BenchmarkTools.Trial:
  memory estimate:  928 bytes
  allocs estimate:  3
  --------------
  minimum time:     1.079 μs (0.00% GC)
  median time:      1.225 μs (0.00% GC)
  mean time:        1.442 μs (2.30% GC)
  maximum time:     128.936 μs (92.15% GC)
  --------------
  samples:          10000
  evals/sample:     10
julia> @benchmark sum(f, view(x, inds))
BenchmarkTools.Trial:
  memory estimate:  224 bytes
  allocs estimate:  11
  --------------
  minimum time:     3.957 μs (0.00% GC)
  median time:      4.540 μs (0.00% GC)
  mean time:        4.947 μs (0.00% GC)
  maximum time:     54.025 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7
julia> @benchmark sum(f, (x[i] for i in inds))
BenchmarkTools.Trial:
  memory estimate:  6.16 KiB
  allocs estimate:  394
  --------------
  minimum time:     28.575 μs (0.00% GC)
  median time:      36.739 μs (0.00% GC)
  mean time:        41.210 μs (0.73% GC)
  maximum time:     1.882 ms (91.48% GC)
  --------------
  samples:          10000
  evals/sample:     1

There’s no reason why a new allocation needs to be made for this function so I thought the array view or generator version might be faster but the overhead involved in those expressions far outweighs any savings. Now I ended up writing my own function to do such an operation without any allocations or overhead:

function selectedSum(f, inds::AbstractArray, V::AbstractArray)
	s = zero(eltype(V))
	@inbounds for i in inds
			s+= f(V[i])
	end
	return s
end

The results for this benchmark are:

julia> @benchmark selectedSum(f, inds, x)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     168.218 ns (0.00% GC)
  median time:      186.162 ns (0.00% GC)
  mean time:        193.043 ns (0.39% GC)
  maximum time:     1.983 μs (85.30% GC)
  --------------
  samples:          10000
  evals/sample:     780

Even faster than the regular sum on the preallocated selection. The only fundamental difference between my selectedSum and the built in sum is that the i variable in the loop iterates over the selected indices instead of all of them. I feel like there should be a way to pass into the sum function or into mapreduce some version of x with the indices and have it execute as fast as this, but every option I tried had some drawback. I’m not sure why the generator expression does so poorly because as the elements are iterated they should just be retrieved on demand like they are in selectedSum.

Has anyone else had to do this sort of task repeatedly and come up with another solution? Is there a way to pass x[inds] without actually allocating memory and allowing mapreduce and sum to perform optimally?


#2

One way I find convenient in situations like this is to use a generator, for example (x[i]^2 for i in inds) in this case. Seems to be a bit faster compared to your tests, although not quite as fast as the explicit loop.

using BenchmarkTools
srand(123)
x = rand(10000)
inds = shuffle(1:10000)[1:100]

f1(x, inds) = sum(x->x^2, x[inds])
f2(x, inds) = sum(x->x^2, view(x, inds))
f3(x, inds) = sum(x[i]^2 for i in inds)
f4(x, inds) = sum(@inbounds(x[i]^2) for i in inds)

function f5(x, inds)
    s = zero(eltype(x))
    @inbounds for i in inds
        s+= x[i]^2
    end
    return s
end

which results in

julia> @btime f1($x, $inds)
  320.348 ns (2 allocations: 912 bytes)

julia> @btime f2($x, $inds)
  240.729 ns (3 allocations: 80 bytes)

julia> @btime f3($x, $inds)
  123.728 ns (2 allocations: 48 bytes)

julia> @btime f4($x, $inds)
  116.949 ns (2 allocations: 48 bytes)

julia> @btime f5($x, $inds)
  91.784 ns (0 allocations: 0 bytes)

#3

I see now if I rewrite my generator expression version as follows the results are much better:

julia> @btime sum(f, ($x[i] for i in $inds))
  219.547 ns (4 allocations: 80 bytes)
35.07525073667715

Compared to before:

julia> @btime sum(f, (x[i] for i in inds))
  31.782 μs (394 allocations: 6.16 KiB)
35.07525073667715

I noticed you used $ when using @btime in the examples. Does that reflect how you’d write this in an actual program or just for the purposes of benchmarking? I’d really appreciate in explanation of why this speeds things up using @btime and how I should write fast code in general.


#4

The $ is just necessary for benchmarking, as without it your x is a global variable in the benchmarking scope. For an explanation, see the Benchmark tools manual: https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#interpolating-values-into-benchmark-expressions