I was reviewing old discussions around efficiency and I noticed that summing an array of Bool is still slower (julia-1.3.0-rc5) than summing the same array of values as Int8, which is puzzling since it seems it should involve exactly the same bits and operations.
Is there a reason that this should be so that I’m missing?
The underlying implementation is almost the same, actually — just a loop that sums up the contents, using Int accumulation. However, the mapreduce implementation for Int8uses @simd whereas the count implementation does not, which is why you are seeing SIMD acceleration in the Int8 case.
The use of count (formerly countnz, formerly nnz) for summing Bool arrays was a performance optimization dating back to Julia 0.3 (https://github.com/JuliaLang/julia/pull/5288). It nowadays seems obsolete, and the appropriate thing to do seems to be to delete this specialized method for sum. (To implement this change, we’ll also need to make sure that Base.add_sum and Base.reduce_empty use Int accumulation for Bool arrays.) . We could also add @simd to the count implementation, of course. A PR would be welcome.
The diff changes base/reduce.jl adding add_sum(x::Bool, y::Bool) = Int(x) + Int(y)
removing sum(a::AbstractArray{Bool}) = count(a)
and leaving reduce_empty(::typeof(+), ::Type{Bool}) = zero(Int)
as it was already correct.
And the change worked, sum(Array{Bool,1}) is twice as fast as it was in this morning’s benchmark. But something is still missing, sum(Array{Int8,1}) doubled its speed moving to 1.4, too, so Bool still loses.
julia> VERSION
v"1.4.0-DEV.513"
julia> using BenchmarkTools
julia> a = rand(Bool, 10000);
julia> @benchmark sum($a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.029 μs (0.00% GC)
median time: 2.089 μs (0.00% GC)
mean time: 2.480 μs (0.00% GC)
maximum time: 9.773 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> b = Int8.(a);
julia> @benchmark sum($b)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 794.132 ns (0.00% GC)
median time: 919.053 ns (0.00% GC)
mean time: 1.069 μs (0.00% GC)
maximum time: 4.338 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 38
It turns out that the changes did not have any impact on performance. After rebuilding the patch into v1.3.0-rc5 and benchmarking it against plain v1.3.0-rc5, there was no change in performance. There was no change in performance between v1.3 and v1.4, either, my laptop just runs benchmarks slower on battery power.
Looking at @code_typed, there’s a lot of preamble to invoking mapreduce in sum(Vector{Int8}) that isn’t present for sum(Vector{Bool}), so there must be another piece of codegen that isn’t recognizing sum(Vector{Bool}) as a candidate.
Okay, still trying to figure this out, running @code_llvm with the patch applied to master, I get:
julia> @code_llvm sum(bool)
; @ reducedim.jl:652 within `sum'
define i64 @julia_sum_17062(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
; ┌ @ reducedim.jl:652 within `#sum#581'
; │┌ @ reducedim.jl:656 within `_sum' @ reducedim.jl:657
; ││┌ @ reducedim.jl:307 within `mapreduce'
; │││┌ @ reducedim.jl:307 within `#mapreduce#578'
; ││││┌ @ reducedim.jl:312 within `_mapreduce_dim'
%1 = call i64 @julia__mapreduce_16947(%jl_value_t addrspace(10)* nonnull %0)
; └└└└└
ret i64 %1
}
julia> @code_llvm sum(int)
; @ reducedim.jl:652 within `sum'
define i64 @julia_sum_17063(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
; ┌ @ reducedim.jl:652 within `#sum#581'
; │┌ @ reducedim.jl:656 within `_sum' @ reducedim.jl:657
; ││┌ @ reducedim.jl:307 within `mapreduce'
; │││┌ @ reducedim.jl:307 within `#mapreduce#578'
; ││││┌ @ reducedim.jl:312 within `_mapreduce_dim'
; │││││┌ @ reduce.jl:299 within `_mapreduce'
[...]
which I take as both code generation paths matched up to reducedim.jl:312 _mapreduce_dim(f, op, ::NamedTuple{()}, A::AbstractArray, ::Colon) = _mapreduce(f, op, IndexStyle(A), A)
but the sum(bool) case failed to match the method reduce.jl:299 function _mapreduce(f, op, ::IndexLinear, A::AbstractArray{T}) where T
and bailed to vanilla code generation.
I don’t see why the match would fail:
julia> typeof(IndexStyle(bool))
IndexLinear
julia> typeof(bool) <: AbstractArray{T} where T
true
Are you sure it failed to match that method? Maybe it’s just that Int uses more inlining than Bool for some reason. You can benchmark _mapreduce(identity, Base.add_sum, IndexStyle(bool), bool) directly.
But maybe better file a PR to make it easier to reproduce and discuss.