Compiling to branch table

For even better performance, we can use a packed version of the bit shifting solution to convert 8 codes at a time:

const ONE = 0x101010101010101
const SEVEN = 0x707070707070707

findcode_shift_packed(x) = ((x+ONE)>>1-x>>3&ONE)&SEVEN

With the codes stored in an Int8 array, we can wrap it as an Int64 array to operate on it efficiently (same memory):

src8 = rand(Int8[2,3,5,7,11], 10_000);
src64 = unsafe_wrap(Array, convert(Ptr{Int64}, pointer(src8)), length(src8)>>3);
dst64 = similar(src64);

broadcast!(findcode_shift_packed, dst64, src64);

dst8 = unsafe_wrap(Array, convert(Ptr{Int8}, pointer(dst64)), length(dst64)<<3);

Verification:

julia> for i=1:6; print("$(src8[i])->$(dst8[i])  "); end
3->2  5->3  7->4  2->1  3->2  11->5

Timings:

julia> @btime broadcast!($findcode_shift_packed, $dst64, $src64);
  207.870 ns (0 allocations: 0 bytes)
  
julia> @btime broadcast!(identity, $dst64, $src64);
  101.603 ns (0 allocations: 0 bytes)

code_native shows that this method uses 256 bit SSE2 instructions, operating on 32 codes in parallel. The timings above show that ~16.7 codes are handled per clock cycle. Correcting for the overhead of copying elements, ~32.0 codes are handled per clock cycle. 460 times faster than the original implementation!

1 Like