sum(Array{Bool,1}) vs sum(Array(Int8,1})

I was reviewing old discussions around efficiency and I noticed that summing an array of Bool is still slower (julia-1.3.0-rc5) than summing the same array of values as Int8, which is puzzling since it seems it should involve exactly the same bits and operations.

Is there a reason that this should be so that I’m missing?

julia> using BenchmarkTools

julia> a = rand(Bool, 10000);

julia> b = Int8.(a);

julia> @benchmark sum($a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.188 μs (0.00% GC)
  median time:      4.209 μs (0.00% GC)
  mean time:        4.466 μs (0.00% GC)
  maximum time:     21.049 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark sum($b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.766 μs (0.00% GC)
  median time:      1.791 μs (0.00% GC)
  mean time:        2.005 μs (0.00% GC)
  maximum time:     9.272 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10
1 Like

They use slightly different implementations. sum(a) on a boolean array actually calls count(a) under the hood, whereas for an Int8 array it calls mapreduce(identity, Base.add_sum, a).

The underlying implementation is almost the same, actually — just a loop that sums up the contents, using Int accumulation. However, the mapreduce implementation for Int8 uses @simd whereas the count implementation does not, which is why you are seeing SIMD acceleration in the Int8 case.

The use of count (formerly countnz, formerly nnz) for summing Bool arrays was a performance optimization dating back to Julia 0.3 (https://github.com/JuliaLang/julia/pull/5288). It nowadays seems obsolete, and the appropriate thing to do seems to be to delete this specialized method for sum. (To implement this change, we’ll also need to make sure that Base.add_sum and Base.reduce_empty use Int accumulation for Bool arrays.) . We could also add @simd to the count implementation, of course. A PR would be welcome.

6 Likes

The diff changes base/reduce.jl adding
add_sum(x::Bool, y::Bool) = Int(x) + Int(y)
removing
sum(a::AbstractArray{Bool}) = count(a)
and leaving
reduce_empty(::typeof(+), ::Type{Bool}) = zero(Int)
as it was already correct.

And the change worked, sum(Array{Bool,1}) is twice as fast as it was in this morning’s benchmark. But something is still missing, sum(Array{Int8,1}) doubled its speed moving to 1.4, too, so Bool still loses.

julia> VERSION
v"1.4.0-DEV.513"

julia> using BenchmarkTools

julia> a = rand(Bool, 10000);

julia> @benchmark sum($a)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.029 μs (0.00% GC)
  median time:      2.089 μs (0.00% GC)
  mean time:        2.480 μs (0.00% GC)
  maximum time:     9.773 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> b = Int8.(a);

julia> @benchmark sum($b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     794.132 ns (0.00% GC)
  median time:      919.053 ns (0.00% GC)
  mean time:        1.069 μs (0.00% GC)
  maximum time:     4.338 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     38

It turns out that the changes did not have any impact on performance. After rebuilding the patch into v1.3.0-rc5 and benchmarking it against plain v1.3.0-rc5, there was no change in performance. There was no change in performance between v1.3 and v1.4, either, my laptop just runs benchmarks slower on battery power.

Looking at @code_typed, there’s a lot of preamble to invoking mapreduce in sum(Vector{Int8}) that isn’t present for sum(Vector{Bool}), so there must be another piece of codegen that isn’t recognizing sum(Vector{Bool}) as a candidate.

Okay, still trying to figure this out, running @code_llvm with the patch applied to master, I get:

julia> @code_llvm sum(bool)

;  @ reducedim.jl:652 within `sum'
define i64 @julia_sum_17062(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
; ┌ @ reducedim.jl:652 within `#sum#581'
; │┌ @ reducedim.jl:656 within `_sum' @ reducedim.jl:657
; ││┌ @ reducedim.jl:307 within `mapreduce'
; │││┌ @ reducedim.jl:307 within `#mapreduce#578'
; ││││┌ @ reducedim.jl:312 within `_mapreduce_dim'
       %1 = call i64 @julia__mapreduce_16947(%jl_value_t addrspace(10)* nonnull %0)
; └└└└└
  ret i64 %1
}

julia> @code_llvm sum(int)

;  @ reducedim.jl:652 within `sum'
define i64 @julia_sum_17063(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
; ┌ @ reducedim.jl:652 within `#sum#581'
; │┌ @ reducedim.jl:656 within `_sum' @ reducedim.jl:657
; ││┌ @ reducedim.jl:307 within `mapreduce'
; │││┌ @ reducedim.jl:307 within `#mapreduce#578'
; ││││┌ @ reducedim.jl:312 within `_mapreduce_dim'
; │││││┌ @ reduce.jl:299 within `_mapreduce'
[...]

which I take as both code generation paths matched up to reducedim.jl:312
_mapreduce_dim(f, op, ::NamedTuple{()}, A::AbstractArray, ::Colon) = _mapreduce(f, op, IndexStyle(A), A)
but the sum(bool) case failed to match the method reduce.jl:299
function _mapreduce(f, op, ::IndexLinear, A::AbstractArray{T}) where T
and bailed to vanilla code generation.

I don’t see why the match would fail:

julia> typeof(IndexStyle(bool))
IndexLinear

julia> typeof(bool) <: AbstractArray{T} where T
true

Am I reading this correctly?

1 Like

Are you sure it failed to match that method? Maybe it’s just that Int uses more inlining than Bool for some reason. You can benchmark _mapreduce(identity, Base.add_sum, IndexStyle(bool), bool) directly.

But maybe better file a PR to make it easier to reproduce and discuss.

Thanks for the response. I opened https://github.com/JuliaLang/julia/pull/34046.