Fastest way to count unique elements in `Vector{Union{Bool, Missing})`


#1

Given that missing is now in base I was testing ways to count unique element in a vector of boolean or missing. My code is below and I wonder if there is a faster way to achieve the count; I think not. I looked up the documentation for missing and only two functions were documented skipmissing and ismissing, I have used both.

Updated to include data generation and run as requested by @kristoffer.carlsson

# using Missings # not needed from 0.6.2 but was testing on next.juliabox.com

function fcount2(x::Vector{Union{Bool, Missing}})
    lxsm = sum(ismissing, x)
    
    x_skip_missing = skipmissing(x)
    sumx = sum(x_skip_missing)
    
    Dict{Union{Bool, Missing}, Int}(true => sumx, false => length(x) - lxsm - sumx, 
        missing => lxsm)
end

a = [true, false, missing]
bool = rand(a, 1_000_000_000)
@time fcount2(bools)

#2

A loop seems ~2.5x faster:

function my_count(x::Vector{Union{Bool, Missing}})
    t, m = 0, 0
    for v in x
        ismissing(v) && (m +=1; continue)
        v && (t += 1)
    end 
    Dict{Union{Bool, Missing}, Int}(true => t, false => length(x) - t - m, missing => m)
end

#3

Please also provide the input data you are using or how to generate it.


#4

My question is, why would you want to represent something that might be true, false, or missing this way?
This will likely take much more memory, and be a lot slower, than simply having two BitVectors, one to indicate the value, and another to indicate if there really is a value at that position.


#5

I was thinking of a scenario where I am reading in a CSV file with true, false, missing values as one of the columns. I imagine Union{Bool, Missing} would be the eltype of the input array.

If I am new to Julia and I read in a CSV, this is what I get. Seems natural to want to solve this in fast way without “advanced” knowledge. The target is the 80%~90% of eventual Julia users who just want speed by default and does not yet know about those things that can be done.

I thought there was going to be some optimizations for Union{T, Missing} happening in 0.7 which would do roughly what you are describing. I know the fst package in R also uses the representation described effectively, actually it uses 2 bits.


#6

On 0.7 the loop takes 60% of the time it does on 0.6 so some better optimizations there.


#7

This is true. On 0.7, using two BitVectors is almost 2x faster than Union{Bool, Missing} and using two Vector{Bool} is over 3x faster (but they eat up more memory than BitVectors).


#8

Did an experiment. The results were indeed very fast.

newplot (1)

using Missings

# define an abstract array with two bits
struct TFM
    tf::BitArray
    ms::BitArray
end

# a function to count two bit arrays
function fcount_bitarray2(tfm1::TFM)
    mscnt = sum(tfm1.ms)
    tcnt = sum(tfm1.tf) - sum(tfm1.ms .& tfm1.tf)
    Dict{Union{Bool, Missing}, Int}(true => tcnt, missing => mscnt, false => length(tfm1.ms) - tcnt - mscnt)
end

# create the data
a = [missing, true, false]
x = rand(a, 200_000_000)

y = Vector{Bool}(200_000_000)
ms = ismissing.(x)
y[ms] = rand(Bool,sum(ms))
y[.!ms] = x[.!ms]
tfm1 = TFM(y, ismissing.(x))


## run the below benchmarks on next.juliabox.com
gc()
@time res4 = fcount_bitarray2(tfm1) # 0.044 on next.juliabox.com
gc()
@time res2 = fcount2(x) # 4.28 
gc()
@time res3 = my_count(x) # 3.24
gc()

res3 == res2 && res2 == res4

#9

Given how fast it is, I think it would be good to have fast method to convert from Vector{Union{Bool, Missing}} to the TFM (True/False/Missing vector) type.

Again wonder if there are even faster ways?

import Base.convert
function convert(::Type{TFM}, x::Vector{Union{Bool, Missing}})
    y = Vector{Bool}(200_000_000)
    ms = ismissing.(x)    
    y[.!ms] = x[.!ms]
    tfm1 = TFM(y, ms)
    tfm1
end

function convert_loop(::Type{TFM}, x::Vector{Union{Bool, Missing}})
    y = Vector{Bool}(200_000_000)
    for (i,xi) = enumerate(x)
        ismissing(xi) || (y[i] = xi)
    end
    tfm1 = TFM(y, ismissing.(x))
    tfm1
end

struct TFM
    tf::BitArray
    ms::BitArray
end

### data gen and tests
a = [true, false, missing]
x = rand(a, 200_000_000);

@time TFM(x); # 14.3s
@time convert_loop(TFM, x) #63s!!!!!

#10

Instead of BitArray, I think those should be BitVector

As far as making it as fast as possible would require using (on Intel/AMD) the AVX instructions, to operate on 128, 256, or even 512 bits at a time. (I used to write assembly code for these sorts of bit instructions :slight_smile: )


#11

I think Vector{Bool} being faster than BitVector in Julia must be some issue with not generating optimal code for BitVector. I’ve done a lot of work with bit vectors and sparse bit structures, and if the bit vectors are well optimized, I’ve never seen them slower than a vector of bytes.


#12

I think we should probably only compare the Union{Bool,Missing} story with anything else on julia 0.7. Arrays of small unions were never meant to be fast on 0.6, so the advice there is pretty much simply don’t use them on 0.6 if you need speed.

Having said that, here is my attempt to speed up the BitVector version:

using BenchmarkTools
using Missings

n = 100_000

bool_data = rand(Bool, n)
bool_missing = rand(Bool, n);

bitvector_data = BitVector(bool_data);
bitvector_missing = BitVector(bool_missing)

data_union = Union{Bool,Missing}[i[2] ? missing : i[1] for i in zip(bool_data, bool_missing)]

function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
    t,m = 0, 0
    for i in 1:length(A)
        B[i] && (m += 1; continue)
        A[i] && (t += 1)
    end
    return t, length(A) - t - m, m
end

function count_missing(A::Vector{Union{Bool,Missing}})
    t, m = 0, 0
    for v in A
        ismissing(v) && (m += 1; continue)
        v && (t += 1)
    end
    return t, length(A) - t - m, m
end

function count_bitmap(A::BitVector, B::BitVector)                              
    m = count(B)                                          
    t = 0                                                   
    for i in 1:length(A.chunks)                           
        t += count_ones(A.chunks[i] & ~B.chunks[i])         
    end                                                   
    return t, length(A) - t - m, m
end                                                     

println("BitVector")
display(@benchmark count_bitmap($bitvector_data, $bitvector_missing))
println()
println()
println("Bool")
display(@benchmark count_vector_bool($bool_data, $bool_missing))
println()
println()
println("Union{Bool,Missing}")
display(@benchmark count_missing($data_union))

On julia 0.6 I get these results:

BitVector                                    
BenchmarkTools.Trial:                        
  memory estimate:  0 bytes                  
  allocs estimate:  0                        
  --------------                             
  minimum time:     5.005 μs (0.00% GC)      
  median time:      5.385 μs (0.00% GC)      
  mean time:        6.092 μs (0.00% GC)      
  maximum time:     194.123 μs (0.00% GC)    
  --------------                             
  samples:          10000                    
  evals/sample:     6                        
                                             
Bool                                         
BenchmarkTools.Trial:                        
  memory estimate:  0 bytes                  
  allocs estimate:  0                        
  --------------                             
  minimum time:     621.902 μs (0.00% GC)    
  median time:      673.981 μs (0.00% GC)    
  mean time:        736.019 μs (0.00% GC)    
  maximum time:     6.392 ms (0.00% GC)      
  --------------                             
  samples:          6750                     
  evals/sample:     1                        
                                             
Union{Bool,Missing}                          
BenchmarkTools.Trial:                        
  memory estimate:  0 bytes                  
  allocs estimate:  0                        
  --------------                             
  minimum time:     1.483 ms (0.00% GC)      
  median time:      1.654 ms (0.00% GC)      
  mean time:        1.753 ms (0.00% GC)      
  maximum time:     3.022 ms (0.00% GC)      
  --------------                             
  samples:          2840                     
  evals/sample:     1                                                

On julia 0.7 I get:

BitVector                                  
BenchmarkTools.Trial:                      
  memory estimate:  0 bytes                
  allocs estimate:  0                      
  --------------                           
  minimum time:     1.901 μs (0.00% GC)    
  median time:      2.053 μs (0.00% GC)    
  mean time:        2.298 μs (0.00% GC)    
  maximum time:     60.252 μs (0.00% GC)   
  --------------                           
  samples:          10000                  
  evals/sample:     10                     
                                           
Bool                                       
BenchmarkTools.Trial:                      
  memory estimate:  0 bytes                
  allocs estimate:  0                      
  --------------                           
  minimum time:     712.375 μs (0.00% GC)  
  median time:      775.098 μs (0.00% GC)  
  mean time:        848.171 μs (0.00% GC)  
  maximum time:     2.547 ms (0.00% GC)    
  --------------                           
  samples:          5858                   
  evals/sample:     1                      
                                           
Union{Bool,Missing}                        
BenchmarkTools.Trial:                      
  memory estimate:  0 bytes                
  allocs estimate:  0                      
  --------------                           
  minimum time:     965.166 μs (0.00% GC)  
  median time:      1.062 ms (0.00% GC)    
  mean time:        1.143 ms (0.00% GC)    
  maximum time:     2.021 ms (0.00% GC)    
  --------------                           
  samples:          4355                   
  evals/sample:     1                                   

Long story short: the bitvector is way, way, way faster than anything else. What is a bit surprising is that the Union{Bool,Missing} thing is not faster on julia 0.7, I would have hoped that it would at least be competitive with the two Vector{Bool} version. But maybe not all the optimizations for that case have made it into base, or maybe the way we’ve coded up that function is not ideal. Not sure.

EDIT: Initial version had an error, now corrected.


#13

Yes, AFAIK all optimizations for small Union types have not been implemented yet.

Please don’t rehash this debate which has been going on for several years. We’ve decided to stop using DataArray and NullableArray to use standard Array instead, with optimizations to make it as fast as the custom approaches. We’re not there yet, but it’s already not too bad on 0.7 (and at least the memory representation is efficient). See the DataFrames 0.11 announcement post and all issues tagged with the missing values/nullable label for background.


#14

I wasn’t attempting to rehash anything.
Somebody asked for help with what would be the fastest way to do something, to which I replied, and the results confirmed what I had said.

Also, just debating something doesn’t ensure that the best decision will be arrived at, especially when decisions are made based on who is making an argument, rather than what the evidence shows.


#15

Your suggestions to improve computation performance are welcome, what I object to is “why would you want to to represent something that might be true, false, or missing this way”. Array{Union{T, Missing} is the standard way of representing arrays with missing values in Julia 0.7, so the question was similar to asking “why would you use Array{Bool} to represent something that might be true or false”. @xiaodai’s original question about "the fastest way to count unique elements in Vector{Union{T, Missing}" is perfectly legitimate, and to address it one needs to provide solutions which start with a Vector{Union{T, Missing} object (which is e.g. the kind of object CSV.jl returns, as he noted).

What are you talking about? People who implemented Array{Union{T, Misssing}} support and optimizations are largely the same as those who developed NullableArray and DataArray before that. Do you suggest these people are arguing against themselves in an unfair fashion?


#16

FWIW, it’s more efficient to avoid branches in this kind of loop. Calling continue is counter-productive since it prevents use of SIMD. This version appears to be faster than what was proposed above:

function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
    t,m = 0, 0
    @inbounds for i in 1:length(A)
        m += B[i]
        t += A[i]
    end
    return t, length(A) - t - m, m
end

So this makes it an even hard goal to attain for Union{Bool, Missing}.

Unless optimizations can be implemented soon, an efficient approach would be to allow accessing the underlying components of Array{Union{T, Missing}} objects. See https://github.com/JuliaLang/julia/issues/24057.


#17

While I really appreciate the usability advantage of not having to worry about different container types (as was instead the case with DataArrays and NullableArrays), in performance sensitive code it is useful to access the underlying v.na and v.data (in DataArray terminology). Apart from the example in this post, v.data seems like a decent way to get the concretely typed vector without copying in the case where there isn’t missing data. view(v.data, find(!, v.na)) is probably also useful when instead there is missing data.

If the data layout of Array{Union{T, Missing}} is the same as DataArray{T}, then this sounds like a useful possibility.


#18

Your code shows advantages of Union{Bool, Missing} over two arrays :slight_smile: . Simply put: Union{Bool, Missing} safeguards the programmer against mistakes.

julia> n = 100_000
100000

julia> bool_data = rand(Bool, n);

julia> bool_missing = rand(Bool, n);

julia> function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
           t,m = 0, 0
           @inbounds for i in 1:length(A)
               m += B[i]
               t += A[i]
           end
           return t, length(A) - t - m, m
       end
count_vector_bool (generic function with 1 method)

julia> count_vector_bool(bool_data, bool_missing)
(50108, -339, 50231)

and we have learned that we had -339 falses in the data.

Probably something like this would be faster to avoid branching (I give two options, I have not optimized it):

function f(A::Vector{Bool}, B::Vector{Bool})
    v = [0, 0, 0]
    @inbounds for i in 1:length(A)
        v[ifelse(B[i], 1, ifelse(A[i], 2, 3))] += 1
        # other option:
        # v[1+(!B[i])*(1+A[i])] += 1
    end
    return v
end

#19

Here is a new version. This one takes inspiration from @nalimilan’s version in that it tries to avoid a branch, but fixes the code to return correct results. I also tried @bkamins’ version, but that was quite a bit slower. I also added @inbounds to all versions. So, the complete new benchmark looks like this:

using BenchmarkTools
using Missings

n = 100_000

bool_data = rand(Bool, n)
bool_missing = rand(Bool, n);

bitvector_data = BitVector(bool_data);
bitvector_missing = BitVector(bool_missing)

data_union = Union{Bool,Missing}[i[2] ? missing : i[1] for i in zip(bool_data, bool_missing)]

function count_vector_bool(A::Vector{Bool}, B::Vector{Bool})
    t,m = 0, 0
    @inbounds for i in 1:length(A)
        m += B[i]
        t += A[i] & !B[i]
    end
    return t, length(A) - t - m, m
end

function count_missing(A::Vector{Union{Bool,Missing}})
    t, m = 0, 0
    @inbounds for v in A
        ismissing(v) && (m += 1; continue)
        v && (t += 1)
    end
    return t, length(A) - t - m, m
end

function count_bitmap(A::BitVector, B::BitVector)                              
    m = count(B)                                          
    t = 0                                                   
    @inbounds for i in 1:length(A.chunks)                           
        t += count_ones(A.chunks[i] & ~B.chunks[i])         
    end                                                   
    return t, length(A) - t - m, m
end                                                     

println("BitVector")
display(@benchmark count_bitmap($bitvector_data, $bitvector_missing))
println()
println()
println("Bool")
display(@benchmark count_vector_bool($bool_data, $bool_missing))
println()
println()
println("Union{Bool,Missing}")
display(@benchmark count_missing($data_union))

I get much better timings on julia 0.7 for both the bitvector and the bool vector version than on julia 0.6. So I’ll only post the 0.7 results here:

BitVector
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     890.316 ns (0.00% GC)
  median time:      960.342 ns (0.00% GC)
  mean time:        1.094 μs (0.00% GC)
  maximum time:     19.327 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     38

Bool
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     46.756 μs (0.00% GC)
  median time:      50.178 μs (0.00% GC)
  mean time:        55.851 μs (0.00% GC)
  maximum time:     1.176 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Union{Bool,Missing}
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     972.008 μs (0.00% GC)
  median time:      1.106 ms (0.00% GC)
  mean time:        1.187 ms (0.00% GC)
  maximum time:     11.711 ms (0.00% GC)
  --------------
  samples:          4188
  evals/sample:     1

High level summary is that @nalimilan’s approach to avoid branches really pays off a lot. The bitvector version is still about 50 times faster than a version based on a vector of bools.


#20

update dthe code below thanks to davidanthoff for pointing out bug; remvoed benchmark; will add back after work
The below is slower

function count_missing2(A::Vector{Union{Bool,Missing}})
    t, m = 0, 0
    @inbounds for v in A
        ismissingv = ismissing(v)
        m += ismissingv
        if !ismissingv
            t += v
        end
    end
    return t, length(A) - t - m, m
end