Could this function be faster?

By default, it is probably unrolling by 4 so that it can load a vector of 4 contiguous values from img and then shuffle them.
That might not work well for UInt8.

I’m currently getting crashes:

julia: /home/chriselrod/Documents/languages/juliarelease/src/ccall.cpp:879: jl_cgval_t emit_llvmcall(jl_codectx_t&, jl_value_t**, size_t): Assertion `*it == f->getFunctionType()->getParamType(i)' failed.

So I’ll have to look into what’s going on later.

You could also try @tturbo unroll=1/@turbo thread=true unroll=1.