It looks like xor
may be suboptimal for two Bool
values:
julia> @code_native xor(0x01,0x02)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ int.jl:321 within `xor'
xorl %esi, %edi
movl %edi, %eax
retl
nopw %cs:(%eax,%eax)
; └
julia> @code_native xor(true,false)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ bool.jl:75 within `xor'
; │┌ @ operators.jl:185 within `!='
; ││┌ @ bool.jl:75 within `=='
incl %eax
cmpb %dh, %bh
setne %al
; │└└
retl
; └
; ┌ @ bool.jl:75 within `<invalid>'
nopw (%eax,%eax)
The conversion between cum
(a UInt8
) to/from Bool
values also takes some time; it speeds up a bit if you initialize cum = zero(eltype(H))
instead.