I’ve found an interesting example of a slightly convoluted for-loop that might be a good showcase of Julia’s ability to generate efficient code. I need help optimizing this code, though, because I’m not sure I have either the fastest or the most elegant code. Pretty sure I got neither!

This is a quite relevant function, with a non-trivial operation, and I’m actually taking this from real-world Python project https://github.com/OsmoSystems/picamraw/blob/4936bea6188e5284d5ed80652d0ca79306815880/picamraw/main.py#L234

The way this works is: every 4 x UInt16 pixels in the output array come from 5 bytes of the input array, with the 5th byte providing the lowest two bits in every group of pixels.

Here’s the code I’ve come up with, first a direct porting of the Python code, with Numpy-style vectorial operations, and then an explicit for-loop. It already results in a speed-up of 80% in my machine. How much further can it be improved?

```
using BenchmarkTools
function pythonicconv5_8to4_16!(output_data::Array{UInt16}, pixel_bytes_2d::Array{UInt8})
# First, set aside cohort 4: the bytes that will be unpacked into the low bits to go with the first 4 bytes
cohort_4 = UInt16.(pixel_bytes_2d[:, 5:5:end])
for byte_cohort_index in 1:4
# High bits come from the input array,
# shifted left by two bits to make room for the low 2-bits which will come from the 5th byte
high_bits = UInt16.(pixel_bytes_2d[:, byte_cohort_index:5:end]) .<< 2
# Now process bits from cohort 4 and unpack the appropriate ones to be our low 2 bits:
# Shift the bits over so that the relevant ones are in the rightmost (lowest 2 bits) position
# eg. for byte 1, 0b00001100 -> 0b11
shifted_bits_from_cohort_4 = cohort_4 .>> ((byte_cohort_index-1) * 2)
# Mask the relevant ones (the lowest 2)
masked_low_bits = shifted_bits_from_cohort_4 .& 0x0003
# Finally, put the masked low bits into their place in the output array
output_data[:, byte_cohort_index:4:end] .= high_bits .| masked_low_bits
end
end
function conv5_8to4_16!(output_data::Array{UInt16}, pixel_bytes_2d::Array{UInt8})
m=4
ll = length(pixel_bytes_2d)
for n in 5:5:ll
low = pixel_bytes_2d[n]
output_data[m-3] = UInt16(pixel_bytes_2d[n-4]) << 2
output_data[m-2] = UInt16(pixel_bytes_2d[n-3]) << 2
output_data[m-1] = UInt16(pixel_bytes_2d[n-2]) << 2
output_data[m] = UInt16(pixel_bytes_2d[n-1]) << 2
output_data[m-3] |= low & 0x03
output_data[m-2] |= (low >> 2) & 0x03
output_data[m-1] |= (low >> 4) & 0x03
output_data[m] |= (low >> 6) & 0x03
m += 4
end
end
len = 1024
px = rand(UInt8, len, len*5÷4)
out = Array{UInt16, 2}(undef, len, len)
@btime conv5_8to4_16!(out, px)
@btime pythonicconv5_8to4_16!(out, px)
```