# Fastest way to count unique elements in `Vector{Union{Bool, Missing})`

Given that `missing` is now in base I was testing ways to count unique element in a vector of boolean or missing. My code is below and I wonder if there is a faster way to achieve the count; I think not. I looked up the documentation for `missing` and only two functions were documented `skipmissing` and `ismissing`, I have used both.

Updated to include data generation and run as requested by @kristoffer.carlsson

``````# using Missings # not needed from 0.6.2 but was testing on next.juliabox.com

function fcount2(x::Vector{Union{Bool, Missing}})
lxsm = sum(ismissing, x)

x_skip_missing = skipmissing(x)
sumx = sum(x_skip_missing)

Dict{Union{Bool, Missing}, Int}(true => sumx, false => length(x) - lxsm - sumx,
missing => lxsm)
end

a = [true, false, missing]
bool = rand(a, 1_000_000_000)
@time fcount2(bools)
``````

A loop seems ~2.5x faster:

``````function my_count(x::Vector{Union{Bool, Missing}})
t, m = 0, 0
for v in x
ismissing(v) && (m +=1; continue)
v && (t += 1)
end
Dict{Union{Bool, Missing}, Int}(true => t, false => length(x) - t - m, missing => m)
end
``````
1 Like

Please also provide the input data you are using or how to generate it.

1 Like

My question is, why would you want to represent something that might be true, false, or missing this way?
This will likely take much more memory, and be a lot slower, than simply having two BitVectors, one to indicate the value, and another to indicate if there really is a value at that position.

I was thinking of a scenario where I am reading in a CSV file with true, false, missing values as one of the columns. I imagine `Union{Bool, Missing}` would be the `eltype` of the input array.

If I am new to Julia and I read in a CSV, this is what I get. Seems natural to want to solve this in fast way without “advanced” knowledge. The target is the 80%~90% of eventual Julia users who just want speed by default and does not yet know about those things that can be done.

I thought there was going to be some optimizations for `Union{T, Missing}` happening in 0.7 which would do roughly what you are describing. I know the fst package in R also uses the representation described effectively, actually it uses 2 bits.

On 0.7 the loop takes 60% of the time it does on 0.6 so some better optimizations there.

This is true. On 0.7, using two `BitVector`s is almost 2x faster than `Union{Bool, Missing}` and using two `Vector{Bool}` is over 3x faster (but they eat up more memory than `BitVector`s).

Did an experiment. The results were indeed very fast. ``````using Missings

# define an abstract array with two bits
struct TFM
tf::BitArray
ms::BitArray
end

# a function to count two bit arrays
function fcount_bitarray2(tfm1::TFM)
mscnt = sum(tfm1.ms)
tcnt = sum(tfm1.tf) - sum(tfm1.ms .& tfm1.tf)
Dict{Union{Bool, Missing}, Int}(true => tcnt, missing => mscnt, false => length(tfm1.ms) - tcnt - mscnt)
end

# create the data
a = [missing, true, false]
x = rand(a, 200_000_000)

y = Vector{Bool}(200_000_000)
ms = ismissing.(x)
y[ms] = rand(Bool,sum(ms))
y[.!ms] = x[.!ms]
tfm1 = TFM(y, ismissing.(x))

## run the below benchmarks on next.juliabox.com
gc()
@time res4 = fcount_bitarray2(tfm1) # 0.044 on next.juliabox.com
gc()
@time res2 = fcount2(x) # 4.28
gc()
@time res3 = my_count(x) # 3.24
gc()

res3 == res2 && res2 == res4
``````
2 Likes

Given how fast it is, I think it would be good to have fast method to convert from `Vector{Union{Bool, Missing}}` to the `TFM` (True/False/Missing vector) type.

Again wonder if there are even faster ways?

``````import Base.convert
function convert(::Type{TFM}, x::Vector{Union{Bool, Missing}})
y = Vector{Bool}(200_000_000)
ms = ismissing.(x)
y[.!ms] = x[.!ms]
tfm1 = TFM(y, ms)
tfm1
end

function convert_loop(::Type{TFM}, x::Vector{Union{Bool, Missing}})
y = Vector{Bool}(200_000_000)
for (i,xi) = enumerate(x)
ismissing(xi) || (y[i] = xi)
end
tfm1 = TFM(y, ismissing.(x))
tfm1
end

struct TFM
tf::BitArray
ms::BitArray
end

### data gen and tests
a = [true, false, missing]
x = rand(a, 200_000_000);

@time TFM(x); # 14.3s
@time convert_loop(TFM, x) #63s!!!!!
``````

Instead of `BitArray`, I think those should be `BitVector`

As far as making it as fast as possible would require using (on Intel/AMD) the AVX instructions, to operate on 128, 256, or even 512 bits at a time. (I used to write assembly code for these sorts of bit instructions )

I think `Vector{Bool}` being faster than `BitVector` in Julia must be some issue with not generating optimal code for `BitVector`. I’ve done a lot of work with bit vectors and sparse bit structures, and if the bit vectors are well optimized, I’ve never seen them slower than a vector of bytes.

I think we should probably only compare the `Union{Bool,Missing}` story with anything else on julia 0.7. Arrays of small unions were never meant to be fast on 0.6, so the advice there is pretty much simply don’t use them on 0.6 if you need speed.

Having said that, here is my attempt to speed up the `BitVector` version:

``````using BenchmarkTools
using Missings

n = 100_000

bool_data = rand(Bool, n)
bool_missing = rand(Bool, n);

bitvector_data = BitVector(bool_data);
bitvector_missing = BitVector(bool_missing)

data_union = Union{Bool,Missing}[i ? missing : i for i in zip(bool_data, bool_missing)]

function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
t,m = 0, 0
for i in 1:length(A)
B[i] && (m += 1; continue)
A[i] && (t += 1)
end
return t, length(A) - t - m, m
end

function count_missing(A::Vector{Union{Bool,Missing}})
t, m = 0, 0
for v in A
ismissing(v) && (m += 1; continue)
v && (t += 1)
end
return t, length(A) - t - m, m
end

function count_bitmap(A::BitVector, B::BitVector)
m = count(B)
t = 0
for i in 1:length(A.chunks)
t += count_ones(A.chunks[i] & ~B.chunks[i])
end
return t, length(A) - t - m, m
end

println("BitVector")
display(@benchmark count_bitmap(\$bitvector_data, \$bitvector_missing))
println()
println()
println("Bool")
display(@benchmark count_vector_bool(\$bool_data, \$bool_missing))
println()
println()
println("Union{Bool,Missing}")
display(@benchmark count_missing(\$data_union))
``````

On julia 0.6 I get these results:

``````BitVector
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     5.005 μs (0.00% GC)
median time:      5.385 μs (0.00% GC)
mean time:        6.092 μs (0.00% GC)
maximum time:     194.123 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     6

Bool
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     621.902 μs (0.00% GC)
median time:      673.981 μs (0.00% GC)
mean time:        736.019 μs (0.00% GC)
maximum time:     6.392 ms (0.00% GC)
--------------
samples:          6750
evals/sample:     1

Union{Bool,Missing}
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.483 ms (0.00% GC)
median time:      1.654 ms (0.00% GC)
mean time:        1.753 ms (0.00% GC)
maximum time:     3.022 ms (0.00% GC)
--------------
samples:          2840
evals/sample:     1
``````

On julia 0.7 I get:

``````BitVector
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.901 μs (0.00% GC)
median time:      2.053 μs (0.00% GC)
mean time:        2.298 μs (0.00% GC)
maximum time:     60.252 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

Bool
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     712.375 μs (0.00% GC)
median time:      775.098 μs (0.00% GC)
mean time:        848.171 μs (0.00% GC)
maximum time:     2.547 ms (0.00% GC)
--------------
samples:          5858
evals/sample:     1

Union{Bool,Missing}
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     965.166 μs (0.00% GC)
median time:      1.062 ms (0.00% GC)
mean time:        1.143 ms (0.00% GC)
maximum time:     2.021 ms (0.00% GC)
--------------
samples:          4355
evals/sample:     1
``````

Long story short: the bitvector is way, way, way faster than anything else. What is a bit surprising is that the `Union{Bool,Missing}` thing is not faster on julia 0.7, I would have hoped that it would at least be competitive with the two `Vector{Bool}` version. But maybe not all the optimizations for that case have made it into base, or maybe the way we’ve coded up that function is not ideal. Not sure.

EDIT: Initial version had an error, now corrected.

3 Likes

Yes, AFAIK all optimizations for small `Union` types have not been implemented yet.

Please don’t rehash this debate which has been going on for several years. We’ve decided to stop using `DataArray` and `NullableArray` to use standard `Array` instead, with optimizations to make it as fast as the custom approaches. We’re not there yet, but it’s already not too bad on 0.7 (and at least the memory representation is efficient). See the DataFrames 0.11 announcement post and all issues tagged with the `missing values`/`nullable` label for background.

I wasn’t attempting to rehash anything.
Somebody asked for help with what would be the fastest way to do something, to which I replied, and the results confirmed what I had said.

Also, just debating something doesn’t ensure that the best decision will be arrived at, especially when decisions are made based on who is making an argument, rather than what the evidence shows.

1 Like

Your suggestions to improve computation performance are welcome, what I object to is “why would you want to to represent something that might be true, false, or missing this way”. `Array{Union{T, Missing}` is the standard way of representing arrays with missing values in Julia 0.7, so the question was similar to asking “why would you use `Array{Bool}` to represent something that might be true or false”. @xiaodai’s original question about “the fastest way to count unique elements in `Vector{Union{T, Missing}`” is perfectly legitimate, and to address it one needs to provide solutions which start with a `Vector{Union{T, Missing}` object (which is e.g. the kind of object CSV.jl returns, as he noted).

What are you talking about? People who implemented `Array{Union{T, Misssing}}` support and optimizations are largely the same as those who developed `NullableArray` and `DataArray` before that. Do you suggest these people are arguing against themselves in an unfair fashion?

FWIW, it’s more efficient to avoid branches in this kind of loop. Calling `continue` is counter-productive since it prevents use of SIMD. This version appears to be faster than what was proposed above:

``````function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
t,m = 0, 0
@inbounds for i in 1:length(A)
m += B[i]
t += A[i]
end
return t, length(A) - t - m, m
end
``````

So this makes it an even hard goal to attain for `Union{Bool, Missing}`.

Unless optimizations can be implemented soon, an efficient approach would be to allow accessing the underlying components of `Array{Union{T, Missing}}` objects. See https://github.com/JuliaLang/julia/issues/24057.

2 Likes

While I really appreciate the usability advantage of not having to worry about different container types (as was instead the case with DataArrays and NullableArrays), in performance sensitive code it is useful to access the underlying `v.na` and `v.data` (in DataArray terminology). Apart from the example in this post, `v.data` seems like a decent way to get the concretely typed vector without copying in the case where there isn’t missing data. `view(v.data, find(!, v.na))` is probably also useful when instead there is missing data.

If the data layout of `Array{Union{T, Missing}}` is the same as `DataArray{T}`, then this sounds like a useful possibility.

1 Like

Your code shows advantages of `Union{Bool, Missing}` over two arrays . Simply put: `Union{Bool, Missing}` safeguards the programmer against mistakes.

``````julia> n = 100_000
100000

julia> bool_data = rand(Bool, n);

julia> bool_missing = rand(Bool, n);

julia> function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
t,m = 0, 0
@inbounds for i in 1:length(A)
m += B[i]
t += A[i]
end
return t, length(A) - t - m, m
end
count_vector_bool (generic function with 1 method)

julia> count_vector_bool(bool_data, bool_missing)
(50108, -339, 50231)
``````

and we have learned that we had `-339` `false`s in the data.

Probably something like this would be faster to avoid branching (I give two options, I have not optimized it):

``````function f(A::Vector{Bool}, B::Vector{Bool})
v = [0, 0, 0]
@inbounds for i in 1:length(A)
v[ifelse(B[i], 1, ifelse(A[i], 2, 3))] += 1
# other option:
# v[1+(!B[i])*(1+A[i])] += 1
end
return v
end
``````
1 Like

Here is a new version. This one takes inspiration from @nalimilan’s version in that it tries to avoid a branch, but fixes the code to return correct results. I also tried @bkamins’ version, but that was quite a bit slower. I also added `@inbounds` to all versions. So, the complete new benchmark looks like this:

``````using BenchmarkTools
using Missings

n = 100_000

bool_data = rand(Bool, n)
bool_missing = rand(Bool, n);

bitvector_data = BitVector(bool_data);
bitvector_missing = BitVector(bool_missing)

data_union = Union{Bool,Missing}[i ? missing : i for i in zip(bool_data, bool_missing)]

function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
t,m = 0, 0
@inbounds for i in 1:length(A)
m += B[i]
t += A[i] & !B[i]
end
return t, length(A) - t - m, m
end

function count_missing(A::Vector{Union{Bool,Missing}})
t, m = 0, 0
@inbounds for v in A
ismissing(v) && (m += 1; continue)
v && (t += 1)
end
return t, length(A) - t - m, m
end

function count_bitmap(A::BitVector, B::BitVector)
m = count(B)
t = 0
@inbounds for i in 1:length(A.chunks)
t += count_ones(A.chunks[i] & ~B.chunks[i])
end
return t, length(A) - t - m, m
end

println("BitVector")
display(@benchmark count_bitmap(\$bitvector_data, \$bitvector_missing))
println()
println()
println("Bool")
display(@benchmark count_vector_bool(\$bool_data, \$bool_missing))
println()
println()
println("Union{Bool,Missing}")
display(@benchmark count_missing(\$data_union))
``````

I get much better timings on julia 0.7 for both the bitvector and the bool vector version than on julia 0.6. So I’ll only post the 0.7 results here:

``````BitVector
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     890.316 ns (0.00% GC)
median time:      960.342 ns (0.00% GC)
mean time:        1.094 μs (0.00% GC)
maximum time:     19.327 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     38

Bool
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     46.756 μs (0.00% GC)
median time:      50.178 μs (0.00% GC)
mean time:        55.851 μs (0.00% GC)
maximum time:     1.176 ms (0.00% GC)
--------------
samples:          10000
evals/sample:     1

Union{Bool,Missing}
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     972.008 μs (0.00% GC)
median time:      1.106 ms (0.00% GC)
mean time:        1.187 ms (0.00% GC)
maximum time:     11.711 ms (0.00% GC)
--------------
samples:          4188
evals/sample:     1
``````

High level summary is that @nalimilan’s approach to avoid branches really pays off a lot. The bitvector version is still about 50 times faster than a version based on a vector of bools.

2 Likes

update dthe code below thanks to davidanthoff for pointing out bug; remvoed benchmark; will add back after work
The below is slower

``````function count_missing2(A::Vector{Union{Bool,Missing}})
t, m = 0, 0
@inbounds for v in A
ismissingv = ismissing(v)
m += ismissingv
if !ismissingv
t += v
end
end
return t, length(A) - t - m, m
end
``````