By default, it is probably unrolling by 4 so that it can load a vector of 4 contiguous values from img
and then shuffle them.
That might not work well for UInt8
.
I’m currently getting crashes:
julia: /home/chriselrod/Documents/languages/juliarelease/src/ccall.cpp:879: jl_cgval_t emit_llvmcall(jl_codectx_t&, jl_value_t**, size_t): Assertion `*it == f->getFunctionType()->getParamType(i)' failed.
So I’ll have to look into what’s going on later.
You could also try @tturbo unroll=1
/@turbo thread=true unroll=1
.