Sum with generic implementation is slower than double sum

Why is the latter implementation slower? These results surprise me especially given the fact that we have less allocations.

using BenchmarkTools

mat = randn(1000, 1000)
@btime sum(sum(x -> x > 1, mat, dims=1)) # 135.166 ΞΌs (2 allocations: 7.95 KiB)
@btime sum(x -> x > 1, mat) # 138.792 ΞΌs (1 allocation: 16 bytes)
1 Like

I think you’ve got some sampling error. I get results that are the other way around with the single summation about 3% faster.

But … this is probably an artifact of how @btime works, which is to report the minimum time. Using @benchmark instead I get these results:

julia> @benchmark double(mat)
BenchmarkTools.Trial: 8606 samples with 1 evaluation.
 Range (min … max):  228.354 ΞΌs …   2.375 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     526.527 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   572.941 ΞΌs Β± 224.422 ΞΌs  β”Š GC (mean Β± Οƒ):  0.04% Β± 0.85%

           β–β–‡β–ˆβ–ˆβ–†β–„β–„β–ƒβ–„β–ƒβ–‚β–‚β–                                         
  β–…β–†β–†β–„β–…β–†β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–…β–„β–„β–„β–ƒβ–ƒβ–„β–ƒβ–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–β–β–β–β– β–„
  228 ΞΌs           Histogram: frequency by time         1.32 ms <

 Memory estimate: 7.95 KiB, allocs estimate: 2.

julia> @benchmark single(mat)
BenchmarkTools.Trial: 8556 samples with 1 evaluation.
 Range (min … max):  227.849 ΞΌs …   2.756 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     525.841 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   576.029 ΞΌs Β± 222.980 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–β–‡β–ˆβ–†β–†β–…β–„β–‚β–ƒβ–ƒβ–‚β–β–                                         
  β–„β–…β–…β–„β–…β–…β–„β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–…β–„β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–‚β–β–β–β–β–β–β– β–ƒ
  228 ΞΌs           Histogram: frequency by time         1.32 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

What you can see is that the tiny difference in minimum times is dwarfed by the variation in timings to the point that I’m not sure there’s any meaningful difference.

Which is slightly interesting in that one is two function calls with intermediate allocations and the other is (seemingly) more efficient. But I suspect that the Julia code for sum is just very well optimized in both cases.

3 Likes

what is faster is count(x->x>1,mat) or equivalently count(>(1),mat)

1 Like