BitVector vs Vector{Bool} as default on comparison operations

Yamen · November 1, 2020, 10:22pm

Comparison operations on numeric Vectors return BitVectors by default:

const a = rand(100_000)
const b = rand(100_000)

test1(a, b) = a .> b

julia> @btime test1($a,$b)
  24.901 μs (3 allocations: 16.59 KiB)
100000-element BitArray{1}:
...

Yet this is slower than the Bool implementation:

function test2(a, b)
    s = length(a)
    c = zeros(Bool, s)
    @inbounds @simd for i in 1:s
        c[i] = a[i] > b[i]
    end
    c
end

julia> @btime test2($a,$b)
  19.100 μs (2 allocations: 97.77 KiB)
100000-element Array{Bool,1}:
...

And more importantly results in allocations even when using pre-allocated arrays:

const bits = falses(100_000)
const bools = zeros(Bool, 100_000)

test3(out, a, b) = out .= a .> b

julia> @btime test3($bits, $a, $b)
  23.999 μs (1 allocation: 4.19 KiB)
100000-element BitArray{1}:
...

julia> @btime test3($bools, $a, $b)
  15.499 μs (0 allocations: 0 bytes)
100000-element Array{Bool,1}:
...

The question is: why is the default to return a BitVector/BitArray? It takes up less memory, but in performance critical applications this memory will be preallocated, and then BitArray -forces- an allocation regardless.

So in performance critical code, we have a memory allocating and time intensive Bool to Bit pack before the user has given any indication that they would prefer the memory saving default.

(This caused me some major headaches recently where I couldn’t understand why my preallocated loop was allocating 4k each time, significant performance improvements by moving to Bools).

DNF · November 1, 2020, 10:53pm

You’re actually wasting some time here by initializing all the values to false. Use

c = Vector{Bool}(undef, s)

instead.

Yamen · November 1, 2020, 11:36pm

Yes, you’re right, this makes the Vector{Bool} version another 15% faster:

julia> @btime test2($a, $b)
  16.701 μs (2 allocations: 97.77 KiB)
100000-element Array{Bool,1}:
...

Elrod · November 2, 2020, 1:10am

Seems to be a problem with the optimizer. The BitVector is faster when using LoopVectorization on my computer:

julia> using LoopVectorization, BenchmarkTools

julia> test1(a, b) = a .> b
test1 (generic function with 1 method)

julia> function test2(a, b)
           s = length(a)
           c = Vector{Bool}(undef, s)
           @inbounds @simd for i in 1:s
               c[i] = a[i] > b[i]
           end
           c
       end
test2 (generic function with 1 method)

julia> function test3(a, b)
           s = length(a)
           c = BitVector(undef, s)
           @avx for i in 1:s
               c[i] = a[i] > b[i]
           end
           c
       end
test3 (generic function with 1 method)

julia> function test4(a, b)
           s = length(a)
           c = Vector{Bool}(undef, s)
           @avx for i in 1:s
               c[i] = a[i] > b[i]
           end
           c
       end
test4 (generic function with 1 method)

julia> a = rand(2_000); b = rand(2_000);

julia> @benchmark test1($a, $b)
BenchmarkTools.Trial:
  memory estimate:  4.55 KiB
  allocs estimate:  3
  --------------
  minimum time:     739.253 ns (0.00% GC)
  median time:      843.852 ns (0.00% GC)
  mean time:        1.011 μs (9.97% GC)
  maximum time:     22.888 μs (91.69% GC)
  --------------
  samples:          10000
  evals/sample:     91

julia> @benchmark test2($a, $b)
BenchmarkTools.Trial:
  memory estimate:  2.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     388.926 ns (0.00% GC)
  median time:      456.727 ns (0.00% GC)
  mean time:        527.220 ns (12.06% GC)
  maximum time:     13.640 μs (82.24% GC)
  --------------
  samples:          10000
  evals/sample:     203

julia> @benchmark test3($a, $b)
BenchmarkTools.Trial:
  memory estimate:  368 bytes
  allocs estimate:  2
  --------------
  minimum time:     180.577 ns (0.00% GC)
  median time:      184.667 ns (0.00% GC)
  mean time:        194.267 ns (3.90% GC)
  maximum time:     1.987 μs (86.91% GC)
  --------------
  samples:          10000
  evals/sample:     700

julia> @benchmark test4($a, $b)
BenchmarkTools.Trial:
  memory estimate:  2.13 KiB
  allocs estimate:  1
  --------------
  minimum time:     396.010 ns (0.00% GC)
  median time:      461.687 ns (0.00% GC)
  mean time:        549.838 ns (13.43% GC)
  maximum time:     14.108 μs (81.79% GC)
  --------------
  samples:          10000
  evals/sample:     201

julia> test3(a, b) == test1(a, b)
true

Oscar_Smith · November 2, 2020, 1:19am

Also possibly worth noting that this doesn’t depend on AVX-512. On Zen2, I get

julia> @btime test1($a,$b);
  875.349 ns (3 allocations: 4.55 KiB)

julia> @btime test2($a,$b);
  476.995 ns (1 allocation: 2.13 KiB)

julia> @btime test3($a,$b);
  229.183 ns (2 allocations: 368 bytes)

julia> @btime test4($a,$b);
  528.568 ns (1 allocation: 2.13 KiB)

Elrod · November 2, 2020, 2:17am

Interesting.

Mind showing the @code_native debuginfo=:none syntax=:intel of both versions, or at least the main loop?

E.g., with the BitVector @code_native debuginfo=:none syntax=:intel test_avx!(BitVector(undef, length(a)), a, b) (the function test_avx! is defined below):

L64:
        vmovupd zmm0, zmmword ptr [r9 + 8*rdx]
        vcmpltpd        k0, zmm0, zmmword ptr [r10 + 8*rdx]
        kmovb   byte ptr [rdi], k0
        add     rdx, 8
        inc     rdi
        cmp     rdx, rsi
        jl      L64

and the Bool vector, @code_native debuginfo=:none syntax=:intel test_avx!(Vector{Bool}(undef, length(a)), a, b):

L64:
        vmovupd zmm0, zmmword ptr [r9 + 8*rdx]
        vcmpltpd        k0, zmm0, zmmword ptr [rdi + 8*rdx]
        vpmovm2w        xmm0, k0
        vpsrlw  xmm0, xmm0, 15
        vpmovwb qword ptr [r8 + rdx], xmm0
        add     rdx, 8
        cmp     rdx, rsi
        jl      L64

With AVX512, the comparison (vcmpltpd) generates bitmasks directly. So the difference between both versions is that with bitmasks, it stores those bits directly. With Bools, it needs an extra two instructions (vpsrlw and vpmovwb) to turn the bits into vectors of bools before it can store them.

We can also compare across sizes after removing GC as a factor:

using LoopVectorization, BenchmarkTools
function test_avx!(c, a, b)
    @avx for i in eachindex(c)
        c[i] = a[i] < b[i]
    end
    c
end
# More convenient test3 and test4 definitions
test3(a, b) = test_avx!(BitVector(undef, length(a)), a, b)
test4(a, b) = test_avx!(Vector{Bool}(undef, length(a)), a, b)

for _n in 2:13
    n = 2 ^ _n
    @show n
    a = rand(n); b = rand(n);
    cbit = BitVector(undef, n)
    cbyte = Vector{Bool}(undef, n)
    println("BitVector:")
    @btime test_avx!($cbit, $a, $b)
    println("Byte Vector:")
    @btime test_avx!($cbyte, $a, $b)
    println("")
end

I get:

n = 4
BitVector:
  9.732 ns (0 allocations: 0 bytes)
Byte Vector:
  11.343 ns (0 allocations: 0 bytes)

n = 8
BitVector:
  9.377 ns (0 allocations: 0 bytes)
Byte Vector:
  10.713 ns (0 allocations: 0 bytes)

n = 16
BitVector:
  10.953 ns (0 allocations: 0 bytes)
Byte Vector:
  12.574 ns (0 allocations: 0 bytes)

n = 32
BitVector:
  12.291 ns (0 allocations: 0 bytes)
Byte Vector:
  13.964 ns (0 allocations: 0 bytes)

n = 64
BitVector:
  14.946 ns (0 allocations: 0 bytes)
Byte Vector:
  15.982 ns (0 allocations: 0 bytes)

n = 128
BitVector:
  18.746 ns (0 allocations: 0 bytes)
Byte Vector:
  26.987 ns (0 allocations: 0 bytes)

n = 256
BitVector:
  21.882 ns (0 allocations: 0 bytes)
Byte Vector:
  32.635 ns (0 allocations: 0 bytes)

n = 512
BitVector:
  40.731 ns (0 allocations: 0 bytes)
Byte Vector:
  58.382 ns (0 allocations: 0 bytes)

n = 1024
BitVector:
  73.779 ns (0 allocations: 0 bytes)
Byte Vector:
  106.884 ns (0 allocations: 0 bytes)

n = 2048
BitVector:
  143.350 ns (0 allocations: 0 bytes)
Byte Vector:
  223.396 ns (0 allocations: 0 bytes)

n = 4096
BitVector:
  385.217 ns (0 allocations: 0 bytes)
Byte Vector:
  452.274 ns (0 allocations: 0 bytes)

n = 8192
BitVector:
  784.822 ns (0 allocations: 0 bytes)
Byte Vector:
  920.618 ns (0 allocations: 0 bytes)

Looking a little more in depth, in part because LoopVectorization is choosing not to unroll this loop and I’m curious about whether or not that helps:

function test_avx_unroll4!(c, a, b)
    @avx unroll=4 for i in eachindex(c)
        c[i] = a[i] < b[i]
    end
    c
end

N = 1024;
cbit = BitVector(undef, N);
cbyte = Vector{Bool}(undef, N);
a = rand(N); b = rand(N);

using LinuxPerf
foreachf!(f::F, N, args::Vararg{<:Any,A}) where {F,A} = foreach(_ -> f(args...), Base.OneTo(N))

Now, running the following twice:

@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbit, a, b)
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbyte, a, b)
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbit, a, b)
@pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbyte, a, b)

I get the following on the second run; BitVector without extra unrolling:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbit, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               4.34e+09   60.1%  #  4.3 cycles per ns
┌ instructions             1.44e+10   60.1%  #  3.3 insns per cycle
│ branch-instructions      2.19e+09   60.1%  # 15.2% of instructions
└ branch-misses            1.00e+07   60.1%  #  0.5% of branch instructions
┌ task-clock               1.01e+09  100.0%  #  1.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.29e+05   19.9%  #  0.0% of dcache loads
│ L1-dcache-loads          4.27e+09   19.9%
└ L1-icache-load-misses    3.44e+04   19.9%
┌ dTLB-load-misses         1.12e+04   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               4.27e+09   19.9%
┌ iTLB-load-misses         3.30e+04   40.0%  # 2676.4% of iTLB loads
└ iTLB-loads               1.23e+03   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Here, we see that it ran at a clock speed of 4.3 GHz (4.3 cycles per ns == 4.3 billion cycles / s), and executed 3.3 instructions per clock cycle (IPC).
15.2% of those instructions were branches, and 0.5% of branches were mispredicted (real world branch prediction rate will probably be lower than when we’re just running the same code over and over again).
Our main loop was 7 instructions, of which 1 was a branch: do we repeat the loop, or exit?

julia> 1/7
0.14285714285714285

So we have a higher branch density overall than our main loop implies.

At this size, we had 0% L1 cache misses, so memory bottle necking isn’t a problem. Now, looking at using Vector{Bool} instead:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx!, 10_000_000, cbyte, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               5.62e+09   60.0%  #  4.3 cycles per ns
┌ instructions             1.55e+10   60.0%  #  2.8 insns per cycle
│ branch-instructions      2.16e+09   60.0%  # 13.9% of instructions
└ branch-misses            1.01e+07   60.0%  #  0.5% of branch instructions
┌ task-clock               1.31e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.32e+05   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          4.18e+09   20.0%
└ L1-icache-load-misses    4.46e+04   20.0%
┌ dTLB-load-misses         3.30e+03   19.9%  #  0.0% of dTLB loads
└ dTLB-loads               4.18e+09   19.9%
┌ iTLB-load-misses         7.84e+04   39.9%  # 6023.1% of iTLB loads
└ iTLB-loads               1.30e+03   39.9%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Note that our main loop in this version had 8 instructions, one of which was a branch (the comparison → store equired 2 extra instructions, but we had one less loop counter here – eliminating the extra loop counter in the BitVector version is something for me to loop into. I’ll file a LoopVectorization issue).
So given the density of branches in our loop is now 12.5%, it’s natural for us to have a slightly lower %branch instructions.
We still have a 0% L1 cache miss-rate, yet instructions per cycle decreased from 3.3 to 2.8. That means on top of this version requiring more instructions (1.55e10 vs 1.44e10), the relative difference in number of cycles needed was even greater.

Now lets look at the impact of unrolling the BitVector version:

julia> @pstats "cpu-cycles,(instructions,,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbit, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.72e+09   60.0%  #  4.3 cycles per ns
┌ instructions             1.06e+10   60.0%  #  2.8 insns per cycle
│ branch-instructions      1.23e+09   60.0%  # 11.6% of instructions
└ branch-misses            1.01e+07   60.0%  #  0.8% of branch instructions
┌ task-clock               8.68e+08  100.0%  # 868.5 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.43e+05   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          4.28e+09   20.0%
└ L1-icache-load-misses    4.11e+04   20.0%
┌ dTLB-load-misses         5.14e+02   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               4.28e+09   20.0%
┌ iTLB-load-misses         8.32e+04   40.0%  # 7892.9% of iTLB loads
└ iTLB-loads               1.05e+03   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The instructions per cycle was 2.8, lower than the non-unrolled version, but the number of instructions required decreased by a much a larger factor, resulting in fewer cycles and time needed overall.
The Vector{Bool}:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads)" foreachf!(test_avx_unroll4!, 10_000_000, cbyte, a, b)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               5.58e+09   60.0%  #  4.3 cycles per ns
┌ instructions             1.27e+10   60.0%  #  2.3 insns per cycle
│ branch-instructions      1.20e+09   60.0%  #  9.5% of instructions
└ branch-misses            1.01e+07   60.0%  #  0.8% of branch instructions
┌ task-clock               1.30e+09  100.0%  #  1.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    3.33e+05   20.0%  #  0.0% of dcache loads
│ L1-dcache-loads          4.19e+09   20.0%
└ L1-icache-load-misses    3.81e+04   20.0%
┌ dTLB-load-misses         1.27e+03   20.0%  #  0.0% of dTLB loads
└ dTLB-loads               4.19e+09   20.0%
┌ iTLB-load-misses         1.46e+05   40.0%  # 11649.5% of iTLB loads
└ iTLB-loads               1.25e+03   40.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Even though we again have 0.0% of dcache loads being misses, we again halve about 0.5 fewer instructions per clock than the BitVector version, and this is about the same fast as the Vector{Bool} version without unrolling, slower than either BitVector version.
I’m not sure what explains the difference in instructions per clock in the BitVector and Vector{Bool} versions.

Oscar_Smith · November 2, 2020, 2:24am

I don’t really understand what you’re asking. What code do you want me to run?

Elrod · November 2, 2020, 2:27am

using LoopVectorization, BenchmarkTools
function test_avx!(c, a, b)
    @avx for i in eachindex(c)
        c[i] = a[i] < b[i]
    end
    c
end
# More convenient test3 and test4 definitions
test3(a, b) = test_avx!(BitVector(undef, length(a)), a, b)
test4(a, b) = test_avx!(Vector{Bool}(undef, length(a)), a, b)

@code_native debuginfo=:none syntax=:intel test_avx!(BitVector(undef, 0), Float64[], Float64[])
@code_native debuginfo=:none syntax=:intel test_avx!(Vector{Bool}(undef, 0), Float64[], Float64[])
for _n in 2:13
    n = 2 ^ _n
    @show n
    a = rand(n); b = rand(n);
    cbit = BitVector(undef, n)
    cbyte = Vector{Bool}(undef, n)
    println("BitVector:")
    @btime test_avx!($cbit, $a, $b)
    println("Byte Vector:")
    @btime test_avx!($cbyte, $a, $b)
    println("")
end

Oscar_Smith · November 2, 2020, 2:30am

n = 4
BitVector:
  21.909 ns (0 allocations: 0 bytes)
Byte Vector:
  23.729 ns (0 allocations: 0 bytes)

n = 8
BitVector:
  14.927 ns (0 allocations: 0 bytes)
Byte Vector:
  14.838 ns (0 allocations: 0 bytes)

n = 16
BitVector:
  15.419 ns (0 allocations: 0 bytes)
Byte Vector:
  15.720 ns (0 allocations: 0 bytes)

n = 32
BitVector:
  16.212 ns (0 allocations: 0 bytes)
Byte Vector:
  17.829 ns (0 allocations: 0 bytes)

n = 64
BitVector:
  18.932 ns (0 allocations: 0 bytes)
Byte Vector:
  23.618 ns (0 allocations: 0 bytes)

n = 128
BitVector:
  23.809 ns (0 allocations: 0 bytes)
Byte Vector:
  35.111 ns (0 allocations: 0 bytes)

n = 256
BitVector:
  30.953 ns (0 allocations: 0 bytes)
Byte Vector:
  57.575 ns (0 allocations: 0 bytes)

n = 512
BitVector:
  50.429 ns (0 allocations: 0 bytes)
Byte Vector:
  103.185 ns (0 allocations: 0 bytes)

n = 1024
BitVector:
  86.228 ns (0 allocations: 0 bytes)
Byte Vector:
  194.597 ns (0 allocations: 0 bytes)

n = 2048
BitVector:
  194.248 ns (0 allocations: 0 bytes)
Byte Vector:
  378.745 ns (0 allocations: 0 bytes)

n = 4096
BitVector:
  521.130 ns (0 allocations: 0 bytes)
Byte Vector:
  742.758 ns (0 allocations: 0 bytes)

n = 8192
BitVector:
  1.023 μs (0 allocations: 0 bytes)
Byte Vector:
  1.489 μs (0 allocations: 0 bytes)

Topic		Replies	Views
Hand written loop slower than bitVector broadcast Performance question	15	725	January 18, 2024
Why is BitArray so slow? Performance	28	5868	January 14, 2022
Performance hacking `(&)(a::BitArray, b::BitArray)` Performance	14	234	October 22, 2024
Help understanding vectorization (or lack thereof) Performance	15	1221	June 8, 2018
Why mul! is so fast with BitVector? Performance question , linearalgebra	1	655	November 29, 2019

BitVector vs Vector{Bool} as default on comparison operations

Related topics