Fastest way to count unique elements in `Vector{Union{Bool, Missing})`

I think that version is incorrect, you are now counting any value that is not missing as true. Note also that the if almost certainly still introduces a branch.

1 Like

R is pretty quick for this task. I don’t have Julia installed on the test computer but I estimate it’s 2x-3x faster

code here

a = sample(c(T,F, NA), 1e8, replace = T)

cnta = function(a) {
  na = sum(is.na(a))
  t = sum(a, na.rm=T)
  list(na = na, true = t, false = length(a) - na - t)
}

system.time(cnta(a))

Relevant issues are:
https://github.com/JuliaLang/julia/issues/23336
https://github.com/JuliaLang/julia/issues/23338

1 Like

If I make n much larger (so that the benchmark is more meaningful as a measure of performance, rather than cache size), I see this drop to about 15-25x. I think LLVM also is missing a couple of possible vectorization optimization (it’s doing 4xi8, but if we rearrange the problem, it notices it can do 4 x 4xi8 – and I think it could merge those, and go even wider). This should make it faster again (the rearrangement seems to make it about 2x faster, so fully vectorized should be even better). We should fully expect that the BitVector version is fastest for this operation, since this really is the algorithm that representation is most optimized for. But for non-vectorizing algorithms, BitVector performance can be just downright terrible (relative to a Bool Vector). And much of it’s current benchmark performance advantage above appears to be due to using a biased micro-benchmark that’s comparing an hand-optimized bit-vector implementation against a moderately optimized bool-vector implementation.

And as nalimilan pointed out, the count_missing isn’t optimized at all (lots of notable issues with code_warntype and code_llvm). We already know that someone needs to teach our system how to handle these better.

bkamins is correct that avoiding branches is good, but that code is avoiding the wrong branch (this version would inhibit the more important mem2reg optimization). You would want to write m += ifelse(B[i], 1, 0), but LLVM also easily recognizes that this branch can be removed (as being equivalent to B[i] & 1), so it’s not necessary.

3 Likes

Vectorized operations on 3 things in a computer is similar to vectorized operations of 4 things if packed into a small representation. Bioinformatics is a field that ends to deal with very long sequences of 4 things (ACGT or ACGU), and not surprisingly Bio.jl has some special tools for this.

http://biojulia.net/Bio.jl/latest/man/seq/symbols/#DNA-and-RNA-nucleotides-1

I think it would be interesting to try an algorithm that uses their special lower-bit DNA representations as a way to encode true/false/missing, and see what happens.

my google-fu only revealed that Union{T, Missing} has some efficient memory layout. I wonder if a pointer to the vector of union type and some bit twiddling can help make an fast algorithm?

update
got to get to work now but this works on v0.7. To count 2 billion records it’s 2.5 seconds vs 20 seconds before on my computer; so ~9x improvement!!!

function bittwiddling(A::Vector{Union{Bool,Missing}})
    ptra = Ptr{UInt8}(pointer(A))
    l = length(A)

    # load the missing
    nbools = sum(unsafe_load.(ptra .+ (l:l+l-1))) |> Int
    t = sum(unsafe_load.(ptra .+ (0:l-1))) |> Int

    t, nbools - t, l - nbools
end

a = [missing, true, false]
A = rand(a, 2^31-1)
@time bittwiddling(A) # 
4 Likes

Could you explain why that is the case? In my experience, the only time they are the same performance (and I’ve never seen packed bits be terrible relative to a vector of bytes), is when you are doing random access to the bits. In that case, both ways have a single memory access (which may involve reading a cache line in), and for packed bits, a single cycle to get the bit out of the loaded register. If there is any locality of reference at all, then having 8 times more bits in each cache block will make a big difference (something that I’ve found to be a very common case when accessing bits “randomly” (there are often a cluster of related bits)

a single cycle to get the bit out of the loaded register

What is this, a 1980’s PIC? A modern processor can do quite a lot in 1 cycle. For instance, I can issue two loads in that same amount of time (http://agner.org/optimize/instruction_tables.pdf), so you could have just cut throughput in third. It very probably didn’t, since there are so many other factors that can come into play (for starters, caches, pipelining, vectorization). I should install the Intel tool for computing these correctly, but too lazy right now to bother. So here’s just the assembly, manually annotated from the approximate information from Agner’s tables for Haswell.

@code_native Base.unsafe_bitgetindex(v.chunks, 1)
	sarq	$6, %rax            # 2/cycle
	addl	$63, %esi           # 4/cycle
	movq	(%rdi), %rcx        # 2/cycle
	movq	(%rcx,%rax,8), %rax # 2/cycle
	btq	%rsi, %rax              # 1/cycle

@code_native @inbounds getindex(v.chunks, 1)
	movq	(%rax), %rax          # 2/cycle
	movq	-8(%rax,%rsi,8), %rax # 2/cycle

This is, of course, a terribly inaccurate representation of the code a compiler would emit for either of those operations, as I’ve completely taken them out of context and their surrounding loop. And throughput also a fairly useless metric for estimating application performance (since it says nothing about uops/ports, instruction decoding, register usage, and memory-stalls). So lets just call this a silly teaser.

But also, this all so far assumes that your array is read-only and the only data you have is 3VL. Since Julia is a general-purpose language, I also tried the above benchmarks for summing bytes instead of counting bool (so that the fancy hand-written vectorization trick above would be harder to write), and the result was that the bitvector representation was a bit slower than the current representation (my observations placed it at around 50-100%). So yes, it’s not hard to show that doing more work can indeed be slower (and yes, this was true whether or not the data fit in cache – I used a range of sizes from 100 MB byte-mask / 12 MB bit-mask down to 100 KB byte-mask).

   function sum_bitmap(A::Vector, B::BitVector)
       t = 0
       @inbounds for i in 1:length(A)
           a = A[i]
           t += ifelse(B[i], a, zero(a))
       end                                                   
       return t
   end
   function sum_vector_bool(A::Vector, B::Vector{Bool})
       t, m = 0, 0
       @inbounds for i in 1:length(A)
           a = A[i]
           b = B[i]
           m += b
           t += ifelse(b, a, zero(a))
       end
       return t, length(A) - t - m, m
   end

xiaodai bittwiddling code

Looking at that code, I think you forgot to mask t with nbools? Anyways, that’s just the same as count_vector_bool + an extra initial copy of each vector. I fully expect to further teach the compiler to handle union-split vectors more intelligently before the 1.0 release (after the v0.7 freeze), such that the count_missing emits the same code as count_vector_bool. I’d also be happy to review PRs from anyone who wants to help with this, as there’s still lots of interesting work to be done and low-hanging fruit for improvements.

a single cycle to get the bit out of the loaded register

What is this, a '90s 8-bit microcontroller? A modern processor can do quite a lot in 1 cycle. For instance, my processor can issue additional two loads in that same amount of time (http://agner.org/optimize/instruction_tables.pdf), so you could have just cut the throughput in third. It very probably didn’t, since there are so many other factors that can come into play (for starters, caches, pipelining, vectorization). I should install the Intel tool for computing these correctly, but too lazy right now to bother. So here’s just the assembly, manually annotated from the approximate information from Agner’s tables for Haswell.

@code_native Base.unsafe_bitgetindex(v.chunks, 1)
	sarq	$6, %rax            # 2/cycle
	addl	$63, %esi           # 4/cycle
	movq	(%rdi), %rcx        # 2/cycle
	movq	(%rcx,%rax,8), %rax # 2/cycle
	btq	%rsi, %rax              # 1/cycle

@code_native @inbounds getindex(v.chunks, 1)
	movq	(%rax), %rax          # 2/cycle
	movq	-8(%rax,%rsi,8), %rax # 2/cycle

This is, of course, a terribly inaccurate representation of the code a compiler would emit for either of those operations, as I’ve completely taken them out of context and their surrounding loop. And throughput also a fairly useless metric for estimating application performance (since it says nothing about uops/ports, instruction decoding, register usage, and memory-stalls). So lets just call this a silly teaser.

But also, this all so far assumes that your array is read-only and the only data you have is 3VL. Since Julia is a general-purpose language, I also tried the above benchmarks for summing bytes instead of counting bool (so that the fancy hand-written vectorization trick above would be harder to write), and the result was that the bitvector representation was a bit slower than the current representation (my observations placed it at around 50-100%). So yes, it’s not hard to show that doing more work can indeed be slower (and yes, this was true whether or not the data fit in cache – I used a range of sizes from 100 MB byte-mask / 12 MB bit-mask down to 100 KB byte-mask).

   function sum_bitmap(A::Vector, B::BitVector)
       t = 0
       @inbounds for i in 1:length(A)
           a = A[i]
           t += ifelse(B[i], a, zero(a))
       end                                                   
       return t
   end
   function sum_vector_bool(A::Vector, B::Vector{Bool})
       t, m = 0, 0
       @inbounds for i in 1:length(A)
           a = A[i]
           b = B[i]
           m += b
           t += ifelse(b, a, zero(a))
       end
       return t, length(A) - t - m, m
   end

xiaodai bittwiddling code

Looking at that code, I think you forgot to mask t with nbools? Anyways, that’s just the same as count_vector_bool + an extra initial copy of each vector. I fully expect to further teach the compiler to handle union-split vectors more intelligently before the 1.0 release (after the v0.7 freeze), such that the count_missing emits the same code as count_vector_bool. I’d also be happy to review PRs from anyone who wants to help with this, as there’s still lots of interesting work to be done and low-hanging fruit for improvements.

good point was thinking about that as soon as i ledt fpr work. the issue is that we cant assume the bits that are trues are truely trues with checking if it’s a bool first.

Couldn’t help if for sum_bits of long arrays we sum bytes (or UInt16) using lookup table?

I know about caches, pipelining, vectorization. Please explain how one ALU instruction, which would normally end up executing at the same time as a load or other instruction, would cut throughput.
The back to back loads are what will likely hurt performance, I would think, even with both in cache, because the second can’t execute until the first is complete.

Also, if you really want to quickly count the bits, you’d use SSE2, AVX, or even AVX-512 (if available) instructions, not a loop like that. There is the popcnt instruction, and a faster way (depending on processor) using pshufb.

The question was about calculating the number of false, true, and missing bits?
For one thing, I’d make sure that all bits that value were true in the missing bitvector were false in the true/false vector.
Then you can simply do a very fast count of the set bits in the missing bitvector, do a fast count of the number of 1 bits in the value vector, and the number of false bits is the bit length - number 1 bits - number missing bits.

All of that can be heavily optimized with the SSE/AVX etc. instructions, but even without them,
you can still operate 64-bits at a time. If you know you have a lot to do (for example, a 64K bit bit map),
then you can write a very optimized loop that interleaves loading registers up with calculating the number bits in each 64/128/256 or 512 bit register and summing them up.

Nobody here suggests that there is a faster way of counting true/false/missing than popcount (or weird tricks because of too few ports for popcount) in two bitvectors.

What at-jameson remarked is that the choice of representation is a trade-off between the many things you want to do with your data. And for many cases, a bool ends up faster than a single bit, obviously depending both on code-context and on what you actually want to do.

That being said, are there any plans on trying to bit-stuff Union{T,missing}? For T=Bool, one could imagine using the representation 0=false, 1=true, 2=missing. For T non-inline stored, one could maybe abuse some part of the pointer? (e.g. 0 = #undef, -1 = missing, else=object)
For structs with padding, one could use the padding. I’m not sure whether there is a reasonable way to appropriate some value for missing in float. What semantics of NaN-payloads does julia currently guarantee? Maybe one could steal one for missing?

My contention is that that is more an artifact of the generated code in Julia, rather than an underlying case where using packed bits are really slower than using bytes.

Could you please give an example where, at the assembly level, where picking a bit from an aligned 64-bit word will be slower than getting a byte (when the code is inlined, so that the loads and operations can overlap with other instructions)? What are the many cases?

I think in that case, I’d usually pick something that kept the bits separate, to make things like popcnt or other SIMD instructions faster. (This is if you want the sets/gets/ and things like popcnt as fast as possible, not if you want to follow some general pattern for a vector of Union{Bool, missing}).

I think for best efficiency, I might have a structure, where you had aligned 64 byte blocks (for best caching efficiency, that represents 256 bits of false/true in the first 32 bytes, and 256 bits of present/missing in the second block. That way both halves can be loaded into AVX registers, the operations done on each half all at once. Does anybody have an approach that they think would be faster, both for random access to both bits, and for fast processing of them using SIMD instructions?

For pointers, Julia already plays with the low bits internally.
If the smallest allocation size is aligned 16 byte chunks, then each pointer has 4 bits that can be used, so you can have the bottom 4 bits used as an index into a type vector, and the 60 bits of payload (or 28 on 32-bit machines) could be used either to store a pointer or a bits type. This would be very useful for implementing fast arbitrary sized numeric types, or short strings (i.e. store up to 7 bytes, with 4 bits more than enough to store the size, type (ASCII or UTF8, etc. or 3 words), in addition to a pointer to memory for a longer string or bigint, big decimal (like Java), or ArbFloats.

I retract my statement. I cannot currently give you an example/context where picking a bit from an aligned 64-bit word will be significantly slower than getting a byte (supposing that the compiler does not emit silly code). The latter is a big caveat.

I was thinking about the latter, and also about whether onr can quickly convert Vector{Union{T, missing}} to Vector{T}.

I know; but I think we cannot, currently, take these bits from 0=#undef because everything (including gc) tests for CNull when deciding whether a pointer is dereferencable (please correct me if I’m wrong). Also, I am not sure whether there is still a flag combination left for “this is not a pointer”. More brutally, one could maybe steal the top bit from object references on 64bit archs (should belong to kernel space anyway, right?).

Inlined tiny strings and inlined small bigints are quite attractive. As are inlined small vectors (this juicy half cache line after the array struct looks so tempting).

Since this is the sort of thing that I always assembly optimized on lots of different architectures, I never worried about what a bad compiler might do. Since the Julia community is in charge of what the Julia compiler does (and has lots of smart people working on it!), I think we should start out by seeing what the best possible performance is in machine code, and then try to figure out how to make the compiler generate that,
instead of deciding what data structures to use based on what is currently faster (in some cases) due to the compiler.
(I trust Jameson and others to be able to make things faster - but if we are already tied to a data structure that is intrinsically slower [using 8x as much memory will do that!], then there won’t be much that can be done in the future.

I think there would need to be some sort of UnionPtrBits type, built into the system, that could handle up to 16 different types, and those would indicate to the GC whether the upper 60 bits represented a pointer value to be handled by the GC, or some opaque primitive bit type.