I was writing a program involving a lot of mapreduce operations over pieces of arrays. To make things concrete here I’ll use the example of a sum over squared elements of an array:
using BenchmarkTools
x = rand(10000)
f(a) = a^2
inds = shuffle(1:10000)[1:100]
xselect = x[inds]
@benchmark sum(f, x)
@benchmark sum(f, xselect)
@benchmark sum(f, x[inds])
@benchmark sum(f, view(x, inds))
The first two benchmarks are the sums for the original array and the selection when it is allocated before the benchmark. The results for both on my system are:
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 4.415 μs (0.00% GC)
median time: 4.832 μs (0.00% GC)
mean time: 4.941 μs (0.00% GC)
maximum time: 34.531 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 436.629 ns (0.00% GC)
median time: 486.954 ns (0.00% GC)
mean time: 516.817 ns (0.15% GC)
maximum time: 8.565 μs (89.27% GC)
--------------
samples: 10000
evals/sample: 197
So the sum over the selection is about 10x faster at 486 ns vs 4.832 microseconds. The benchmark timings of the other approaches however are not nearly this fast because the index selection occurs inside the benchmark:
julia> @benchmark sum(f, x[inds])
BenchmarkTools.Trial:
memory estimate: 928 bytes
allocs estimate: 3
--------------
minimum time: 1.079 μs (0.00% GC)
median time: 1.225 μs (0.00% GC)
mean time: 1.442 μs (2.30% GC)
maximum time: 128.936 μs (92.15% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark sum(f, view(x, inds))
BenchmarkTools.Trial:
memory estimate: 224 bytes
allocs estimate: 11
--------------
minimum time: 3.957 μs (0.00% GC)
median time: 4.540 μs (0.00% GC)
mean time: 4.947 μs (0.00% GC)
maximum time: 54.025 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7
julia> @benchmark sum(f, (x[i] for i in inds))
BenchmarkTools.Trial:
memory estimate: 6.16 KiB
allocs estimate: 394
--------------
minimum time: 28.575 μs (0.00% GC)
median time: 36.739 μs (0.00% GC)
mean time: 41.210 μs (0.73% GC)
maximum time: 1.882 ms (91.48% GC)
--------------
samples: 10000
evals/sample: 1
There’s no reason why a new allocation needs to be made for this function so I thought the array view or generator version might be faster but the overhead involved in those expressions far outweighs any savings. Now I ended up writing my own function to do such an operation without any allocations or overhead:
function selectedSum(f, inds::AbstractArray, V::AbstractArray)
s = zero(eltype(V))
@inbounds for i in inds
s+= f(V[i])
end
return s
end
The results for this benchmark are:
julia> @benchmark selectedSum(f, inds, x)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 168.218 ns (0.00% GC)
median time: 186.162 ns (0.00% GC)
mean time: 193.043 ns (0.39% GC)
maximum time: 1.983 μs (85.30% GC)
--------------
samples: 10000
evals/sample: 780
Even faster than the regular sum on the preallocated selection. The only fundamental difference between my selectedSum and the built in sum is that the i variable in the loop iterates over the selected indices instead of all of them. I feel like there should be a way to pass into the sum function or into mapreduce some version of x with the indices and have it execute as fast as this, but every option I tried had some drawback. I’m not sure why the generator expression does so poorly because as the elements are iterated they should just be retrieved on demand like they are in selectedSum.
Has anyone else had to do this sort of task repeatedly and come up with another solution? Is there a way to pass x[inds] without actually allocating memory and allowing mapreduce and sum to perform optimally?