Compiling to branch table

Hey @bennedich , since you obviously share the hobby of micro-optimizations, I have a challenge for you (sorry for derailing the conversation):

findall of BitArray really needs some love. Logical indexing by BitArray will get a 5x speedup in https://github.com/JuliaLang/julia/pull/29746, to 3 cycles/selected index (on intel Broadwell) when most indices are selected (we only pay a branch miss on every 64 bit of the BitArray).

We could use the same code in findall(B::BitVector), but I have no idea how to do fast findall(B::BitMatrix): We need to produce cartesian, not linear indices.

On the other hand, we can go crazy here: We own all of the context, not a single call to code we do not control. Nobody prevents us from e.g. first collecting linear indices and then batch-converting them in-place to cartesian by using a magic AVX bit-manipulation something (as long as we find a pure llvm idiom; declare ?? @llvm.x86.?? is obviously not admissible in Base, since julia supports more architectures than x86). Even allocation of temporaries can probably amortize. Integer division is unlikely to pay off, though.