Sum with generic implementation is slower than double sum

PaterPen · May 8, 2023, 11:11pm

Why is the latter implementation slower? These results surprise me especially given the fact that we have less allocations.

using BenchmarkTools

mat = randn(1000, 1000)
@btime sum(sum(x -> x > 1, mat, dims=1)) # 135.166 μs (2 allocations: 7.95 KiB)
@btime sum(x -> x > 1, mat) # 138.792 μs (1 allocation: 16 bytes)

screw_dog · May 9, 2023, 1:59am

I think you’ve got some sampling error. I get results that are the other way around with the single summation about 3% faster.

But … this is probably an artifact of how @btime works, which is to report the minimum time. Using @benchmark instead I get these results:

julia> @benchmark double(mat)
BenchmarkTools.Trial: 8606 samples with 1 evaluation.
 Range (min … max):  228.354 μs …   2.375 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     526.527 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   572.941 μs ± 224.422 μs  ┊ GC (mean ± σ):  0.04% ± 0.85%

           ▁▇██▆▄▄▃▄▃▂▂▁                                         
  ▅▆▆▄▅▆▄▅▇███████████████▇▇▆▅▅▄▄▄▃▃▄▃▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁ ▄
  228 μs           Histogram: frequency by time         1.32 ms <

 Memory estimate: 7.95 KiB, allocs estimate: 2.

julia> @benchmark single(mat)
BenchmarkTools.Trial: 8556 samples with 1 evaluation.
 Range (min … max):  227.849 μs …   2.756 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.841 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   576.029 μs ± 222.980 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁▇█▆▆▅▄▂▃▃▂▁▁                                         
  ▄▅▅▄▅▅▄▅▅█████████████▇▇▆▆▅▅▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁ ▃
  228 μs           Histogram: frequency by time         1.32 ms <

 Memory estimate: 16 bytes, allocs estimate: 1.

What you can see is that the tiny difference in minimum times is dwarfed by the variation in timings to the point that I’m not sure there’s any meaningful difference.

Which is slightly interesting in that one is two function calls with intermediate allocations and the other is (seemingly) more efficient. But I suspect that the Julia code for sum is just very well optimized in both cases.

Jean_Michel · May 9, 2023, 12:28pm

what is faster is count(x->x>1,mat) or equivalently count(>(1),mat)

Topic		Replies	Views
Why the function sum1 is faster than builtin sum Performance question , profiling	14	1041	September 21, 2023
Generators speed New to Julia	5	1316	May 15, 2017
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Unclear allocation behaviour with built-in sum() Performance question , memory-allocation	4	204	February 2, 2025
Benchmark is moving target? Performance	2	186	December 3, 2024

Sum with generic implementation is slower than double sum

Related topics