Hand written loop slower than bitVector broadcast


MWE below

I have a use case for intersection of lists of sorted integer vectors with a vector of sorted integers i.e.

intersect!(vec :: Vector{Int},vectorOfVectors::Vector{Vector{Int}}) 

because the vectorOfVectors can be quite large in and of itself, and the fact that I cant always “edit” my input vector, it tends to be better to iterate over vectorOfVectors and intersect and flag an intermediary BitVector/Vector that spans the original vec length and stores if an intersections was found:

    local commonPopulated = 0
    local commonOutput = falses(length(intersectionList)) 
    for list in vectorOfVectors
        if ismissing(list) continue end
        commonPopulated += sorted_intersect(vec ,list,commonOutput)

after this, I end up with “commonPopulated” that identify all intersections of vec with all vectors in vectorOfVectors

because I care about the intersection of vec only, I wanted to reuse it and resize based on the values found; below are my own attempts vs the inbuilt “(vector)[bitVector”] value and my question is: why is mine so much slower, even though I reuse the memory vs 1.9MB or so in the inbuilt version?

using BenchmarkTools

@inline function shiftBitFlags!(inputIndices , bitVector) 

    local swapIndex = 0 #length(inputIndices)
    local swapped = 1
    local comparisonValue = one(eltype(inputIndices))

    @inbounds @simd for ind in eachindex(inputIndices)
            if bitVector[ind] == comparisonValue
                swapIndex += 1
                if swapIndex != ind
                    inputIndices[ind], inputIndices[swapIndex] = inputIndices[swapIndex], inputIndices[ind]                
                    swapped += 1

@inline function shiftBitFlagsAndResize!(inputIndices , bitVector)  

    swapped = shiftBitFlags!(inputIndices, bitVector)
    resize!(inputIndices, swapped)    

bFalses = falses(1000000)
zInts = zeros(Int32,1000000)

for i in eachindex(bFalses)
    if rand(0:1) == 1
        bFalses[i] = 1
        zInts[i] = 1

a = collect(Int32[i for i in 1: 1000000])
f = (a)[bFalses]

@btime ($a)[$bFalses] #563.100 μs (2 allocations: 1.91 MiB)
@btime shiftBitFlagsAndResize!(c,$zInts) setup=(c =collect(Int32[i for i in 1: 1000000])) #  3.051 ms (0 allocations: 0 bytes)
@btime shiftBitFlagsAndResize!(c,$bFalses) setup=(c =collect(Int32[i for i in 1: 1000000])) #3.046 ms (0 allocations: 0 bytes)
cShiftInts =collect(Int32[i for i in 1: 1000000])
shiftInts = shiftBitFlagsAndResize!(cShiftInts,zInts)
cShiftBits =collect(Int32[i for i in 1: 1000000])
shiftBits = shiftBitFlagsAndResize!(cShiftBits,bFalses)

@show shiftInts ,shiftBits , length(f), count(bFalses)

@show f == cShiftBits == cShiftInts
@show shiftInts == shiftBits == length(f)


1 Like

Chances are this is an artifact from the way you benchmark. Please try again with

using BenchmarkTools
a = collect(Int32[i for i in 1: 1000000])
@btime ($a)[$bFalses] #not sure whether this makes sense tbh
c = collect(Int32[i for i in 1: 1000000])
@btime shiftBitFlagsAndResize!($c,$zInts)
c = collect(Int32[i for i in 1: 1000000])
@btime shiftBitFlagsAndResize!($c,$bFalses)

@btime runs the code multiple times and excludes first compilation time. Interpolation further avoids costs associated with accessing global variables.

Hi @abraemer ,

Thanks for the reply, I’ve updated the MWE to make use of @btime in place of @time and also added in the timings I get. The difference in timings (to my mind, unexpected) and allocations (expected less allocations) remain.


I actually didn’t understand what code you compared this to, but note that iterating (specifically indexing) into BitArrays is slow, because you have to mask out the bits you’re not interested in.

BitVector operations are efficient when you do vectorized operations, if you need to index it’s better to use Vector{Bool}, which uses more memory, but doesn’t need masking.

1 Like

I think the most appropriate way is to skip all this bitvector stuff and just write the obvious code. If you can’t edit your input vector, then make a copy, probably still better than bitvector.

The obvious code would be something like:

function intersectSorted!(dst, left, right)
li = 1
ri = 1
out = 1
resize!(dst, length(left))
@inbounds while(li <= length(left) && ri <= length(right))
litem = left[li]
ritem = right[ri]
li = ifelse(litem <=  ritem, li+1, li)
ri = ifelse(ritem <= litem, ri+1, ri)
#we could make this inplace, by left[out]=item.
#but then we would often overwrite the very thing we read in the next step, that would need to come out of the store-buffer, and that is slower than L1.
dst[out] = litem
out = ifelse(litem == ritem, out+1, out)
resize!(dst, out-1)

And you would use that as e.g.

function intersect(vec :: Vector{Int},vectorOfVectors::Vector{Vector{Int}})
res, tmp = copy(vec), copy(vec)
for right in  vectorOfVectors
 intersectSorted!(tmp, res, right)
 tmp, res = res, tmp

If you want to go fancy you can consider multithreading, GPU, or SIMD. For SIMD you should not attempt to do a single intersectSorted!; instead, you should treat each SIMD-lane as a thread. Unfortunately julia has no direct support for that transformation, so you would need to do it by hand, via SIMD.jl.

Thanks for the reply - can I ask - were you able to run the mwe and get the same results, in that, (a)[bitVector] was roughly 6 times faster than the loops written.

Regarding your intersection description - I do do this.

The way I see it, this is not surprising.
Boolean vector indexing only makes quick work of transforming the indices. I think something equivalent to


I add a variant, I think, more efficient, of the shift bits function (seen used by @Dan in another topic. I don’t remember which one though)

function shiftBit(inputIndices, bitVector) 
    b, e = 1, length(x)
    while b < e
        bitVector[b] && (b += 1; continue)
        bitVector[e] || (e -= 1; continue)
        inputIndices[b], inputIndices[e] = inputIndices[e], inputIndices[b]

thanks for the reply - I make use of this in my mwe as the inbuilt comparison (you dont need the findall).

it may be just the way that I’ve named the parameters in my mwe, but I actually test (a)[bFalses] performance with both the bitVector and normal Vector{Int} collections, to mitigate the bitVector accessor slowness.


The collect does nothing useful here, the comprehension already is a regular Vector, perhaps collect allocates another vector.

But can’t you write

a = Int32(1):Int32(1000000)
f = a[bFalses]


thanks for the reply - I think the direction of this thread has gone awry. All I really want to understand is why:

@btime ($a)[$bFalses] #563.100 μs (2 allocations: 1.91 MiB)
@btime shiftBitFlagsAndResize!(c,$zInts) setup=(c =collect(Int32[i for i in 1: 1000000])) #  3.051 ms (0 allocations: 0 bytes)
@btime shiftBitFlagsAndResize!(c,$bFalses) setup=(c =collect(Int32[i for i in 1: 1000000])) #3.046 ms (0 allocations: 0 bytes)

even though they yield identical results.

I believe my mwe is fully copy - pastable to recreate - I’m just trying to understand why I can’t write the equivalent loop with at least similar performance.


You should take a look at what base is actually doing under the hood.

I originally wanted to walk you through how I think you could have figured this out – parallel construction, because I know the code.

Alas, if I didn’t know the code already this would also have confused me, using

julia> using Cthulhu
julia> a = Float64.(collect(1:10)); b=falses(10);
julia> @descend a[b]

So I’ll give you the keywords: This is called logical indexing (i.e. you index via a bunch of true/false booleans), and specifically using a BitVector.

You find the relevant code here. Looking at the git blame could theoretically lead you to the PR that introduced the relevant code snippet into Base. Looking at that PR again, it doesn’t explain the code either :frowning:

The job of your a[b] task is basically to take the bitarray b and iterate over all indices where b[idx] is true. Your branchy if b[idx] code will lose against branch-free bitfiddling code.

If you are intereseted, then I can explain the branchless bitfiddling code – but it is really not general purpose, it is specifically for “I have UInt64, I want to iterate over all indices of set bits”.

(and yes, this is faster than with Bool, thanks to the glory of low-latency BLSR instruction)

thanks for the reply - very insightful - I think the below:

is essentially the answer, no? I still remain confused as to how, even when the indexing is faster, the filter, creation of a new vector (since the original is untouched) and return is c. 6 times faster…! It is also equivalent speed for a bitvector and a vector{Int}… so accessing shouldnt be so slow.

I should add that I did actually try and investigate the base code (where I would normally start), but when I couldnt find it (thanks for the links) I came here.

your offer of explaining branchless code is well received - but please don’t trouble yourself too much - I imagine it will take more than a few minutes.


Your code cannot use simd, since each operation depends on the previous step (that’s where the branches come in), so a 6x speed difference seems roughly expected, since avx2 works on eight 32bit integers in parallel. The cost of a vector allocation just isn’t high enough to make up the difference.

This surprised me, too. Indexing into a BitVector should cause a slowdown, but it seems to be below the noise floor here.

OK, but please, consider removing all the collects from your code, it’s really quite painful for me to see. They really are all redundant. And if you care about performance, you should care about that.


Thanks for your response - very insightful.

There aren’t any collects in my actual code - this is purely a toy script to try and better (a)[f] since I wanted to mitigate intermediate collections and allocations.

When I couldn’t, I came here


PS: on this topic, have you ever got the tattoo? :slight_smile:

FWIW, the fast Base code doesn’t use SIMD either.

Back in 2018 when I last looked at that in detail, the core loop of the Base code was intended to run in ~1.7 cycles per selected index (at density 0.5 like bitrand output): There is a chain of dependent BLSR instructions, which have 1 cycle latency; and we incur one branch mispredict (20 cycles) for each UInt64, i.e. for each 32 selected indices. Other costs should be invisible, due to superscalar execution.

Naive branchy code pays one branch mispredict per selected index, i.e. 20 cycles per selected index.

But the Base code never managed that speed. From above github link, it originally managed 3-5 cycles per index, which is also what I measure today:

julia> b=Random.bitrand(64<<12); a=collect(1:length(b));
julia> t=@belapsed $(a)[$b];
julia> (t * 5e9)/(length(b) / 2)

So there you have your ~ 6x factor.