Compiler optimization challenge: Bayer pattern image unpacking

I’ve found an interesting example of a slightly convoluted for-loop that might be a good showcase of Julia’s ability to generate efficient code. I need help optimizing this code, though, because I’m not sure I have either the fastest or the most elegant code. Pretty sure I got neither!

This is a quite relevant function, with a non-trivial operation, and I’m actually taking this from real-world Python project https://github.com/OsmoSystems/picamraw/blob/4936bea6188e5284d5ed80652d0ca79306815880/picamraw/main.py#L234

The way this works is: every 4 x UInt16 pixels in the output array come from 5 bytes of the input array, with the 5th byte providing the lowest two bits in every group of pixels.

Here’s the code I’ve come up with, first a direct porting of the Python code, with Numpy-style vectorial operations, and then an explicit for-loop. It already results in a speed-up of 80% in my machine. How much further can it be improved?

using BenchmarkTools

function pythonicconv5_8to4_16!(output_data::Array{UInt16}, pixel_bytes_2d::Array{UInt8})
    # First, set aside cohort 4: the bytes that will be unpacked into the low bits to go with the first 4 bytes
    cohort_4 = UInt16.(pixel_bytes_2d[:, 5:5:end])
    for byte_cohort_index in 1:4
        # High bits come from the input array,
        # shifted left by two bits to make room for the low 2-bits which will come from the 5th byte
        high_bits = UInt16.(pixel_bytes_2d[:, byte_cohort_index:5:end]) .<< 2

        # Now process bits from cohort 4 and unpack the appropriate ones to be our low 2 bits:
        # Shift the bits over so that the relevant ones are in the rightmost (lowest 2 bits) position
        # eg. for byte 1, 0b00001100 -> 0b11
        shifted_bits_from_cohort_4 = cohort_4 .>> ((byte_cohort_index-1) * 2)
        # Mask the relevant ones (the lowest 2)
        masked_low_bits = shifted_bits_from_cohort_4 .& 0x0003

        # Finally, put the masked low bits into their place in the output array
        output_data[:, byte_cohort_index:4:end] .= high_bits .| masked_low_bits
    end
end


function conv5_8to4_16!(output_data::Array{UInt16}, pixel_bytes_2d::Array{UInt8})
    m=4
    ll = length(pixel_bytes_2d)
    for n in 5:5:ll
        low = pixel_bytes_2d[n]

        output_data[m-3] = UInt16(pixel_bytes_2d[n-4]) << 2
        output_data[m-2] = UInt16(pixel_bytes_2d[n-3]) << 2
        output_data[m-1] = UInt16(pixel_bytes_2d[n-2]) << 2
        output_data[m] = UInt16(pixel_bytes_2d[n-1]) << 2

        output_data[m-3] |= low & 0x03
        output_data[m-2] |= (low >> 2) & 0x03
        output_data[m-1] |= (low >> 4) & 0x03
        output_data[m] |= (low >> 6) & 0x03
        m += 4
    end
end

len = 1024
px = rand(UInt8, len, len*5÷4)
out = Array{UInt16, 2}(undef, len, len)

@btime conv5_8to4_16!(out, px)
@btime pythonicconv5_8to4_16!(out, px)