For even better performance, we can use a packed version of the bit shifting solution to convert 8 codes at a time:
const ONE = 0x101010101010101
const SEVEN = 0x707070707070707
findcode_shift_packed(x) = ((x+ONE)>>1-x>>3&ONE)&SEVEN
With the codes stored in an Int8
array, we can wrap it as an Int64
array to operate on it efficiently (same memory):
src8 = rand(Int8[2,3,5,7,11], 10_000);
src64 = unsafe_wrap(Array, convert(Ptr{Int64}, pointer(src8)), length(src8)>>3);
dst64 = similar(src64);
broadcast!(findcode_shift_packed, dst64, src64);
dst8 = unsafe_wrap(Array, convert(Ptr{Int8}, pointer(dst64)), length(dst64)<<3);
Verification:
julia> for i=1:6; print("$(src8[i])->$(dst8[i]) "); end
3->2 5->3 7->4 2->1 3->2 11->5
Timings:
julia> @btime broadcast!($findcode_shift_packed, $dst64, $src64);
207.870 ns (0 allocations: 0 bytes)
julia> @btime broadcast!(identity, $dst64, $src64);
101.603 ns (0 allocations: 0 bytes)
code_native
shows that this method uses 256 bit SSE2 instructions, operating on 32 codes in parallel. The timings above show that ~16.7 codes are handled per clock cycle. Correcting for the overhead of copying elements, ~32.0 codes are handled per clock cycle. 460 times faster than the original implementation!