Feel free to tag me. I don’t see every question and didn’t see this.
Yes (note, there was a bug in VectorizationBase that is now fixed in VectorizationBase 0.19.13, released about an hour ago):
using LoopVectorization
const lut = [Float32(i)/255 for i in 0:255];
f(a,b,c) = a*b+c
n = 334*3;
rgb = rand(UInt8, n);
out0= Vector{Float32}(undef, n ÷ 3);
out1 = similar(out0);
out2 = similar(out0);
function while_loop!(out, rgb)
i=1; j=1; n = length(rgb)
@inbounds while i <= n
r = lut[1 + rgb[i]]
g = lut[1 + rgb[i + 1]]
b = lut[1 + rgb[i + 2]]
out[j] = f(r, g, b)
i += 3
j += 1
end
end
function for_loop!(out, rgb) # ivdep needed for `@simd` when using a table
@inbounds @simd ivdep for j in eachindex(out)
r = lut[1 + rgb[3j - 2]]
g = lut[1 + rgb[3j - 1]]
b = lut[1 + rgb[3j]]
out[j] = f(r, g, b)
end
end
function avxfor_loop!(out, rgb)
@avx for j in eachindex(out)
r = lut[1 + rgb[3j - 2]]
g = lut[1 + rgb[3j - 1]]
b = lut[1 + rgb[3j]]
out[j] = f(r, g, b)
end
end
while_loop!(out0, rgb)
for_loop!(out1, rgb)
avxfor_loop!(out2, rgb)
out0 ≈ out1
out0 ≈ out2
Also, for me, not using a table seems much faster:
function cvt_for_loop!(out, rgb)
s = one(Float32)/255
@inbounds @simd for j in eachindex(out)
r = rgb[3j - 2] * s
g = rgb[3j - 1] * s
b = rgb[3j] * s
out[j] = f(r, g, b)
end
end
function cvt_avxfor_loop!(out, rgb)
s = one(Float32)/255
@avx for j in eachindex(out)
r = rgb[3j - 2] * s
g = rgb[3j - 1] * s
b = rgb[3j] * s
out[j] = f(r, g, b)
end
end
Although maybe your table involves a more complicated transform than the linear one I’m using.
Benchmraks:
julia> @benchmark while_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 302.780 ns (0.00% GC)
median time: 370.132 ns (0.00% GC)
mean time: 370.471 ns (0.00% GC)
maximum time: 435.040 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 227
julia> @benchmark for_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 208.995 ns (0.00% GC)
median time: 209.178 ns (0.00% GC)
mean time: 209.449 ns (0.00% GC)
maximum time: 241.436 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 550
julia> @benchmark avxfor_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 208.137 ns (0.00% GC)
median time: 208.157 ns (0.00% GC)
mean time: 208.637 ns (0.00% GC)
maximum time: 237.182 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 555
julia> @benchmark cvt_for_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 112.369 ns (0.00% GC)
median time: 112.491 ns (0.00% GC)
mean time: 112.669 ns (0.00% GC)
maximum time: 130.954 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 923
julia> @benchmark cvt_avxfor_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 79.738 ns (0.00% GC)
median time: 80.110 ns (0.00% GC)
mean time: 80.143 ns (0.00% GC)
maximum time: 97.106 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 968
Thats 80ns for the @avx loop just relying on promoting UInt8 to Float32 and multiplying by one(Float32)/255, vs >200 ns for the lookup table variants.
Currently, f is still opaque to LoopVectorization, so it won’t do any extra unrolling beyond vectorization.
Replacing it with functions it knows about, like * and + can help it make more informed decisions.
You can also manually add @avx unroll=2 or @avx unroll=4 to force it to unroll more, and see if some other unrolling factor results in better performance (if it does while none of the functions are opaque, e.g. after replacing f with things LV knows about, you could file an issue and I could look into tweaking LV’s modeling).
julia> function cvt_avxfor_loop_u2!(out, rgb)
s = one(Float32)/255
@avx unroll=2 for j in eachindex(out)
r = rgb[3j - 2] * s
g = rgb[3j - 1] * s
b = rgb[3j] * s
out[j] = f(r, g, b)
end
end
cvt_avxfor_loop_u2! (generic function with 1 method)
julia> @benchmark cvt_avxfor_loop_u2!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 78.921 ns (0.00% GC)
median time: 79.199 ns (0.00% GC)
mean time: 80.142 ns (0.00% GC)
maximum time: 108.159 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 968
On a computer with AVX512:
julia> using VectorizationBase
julia> pick_vector_width(Float32)
static(16)
On a computer with AVX(2) only, it would return StaticInt(8) instead. On computers with 128 bit vectors (like the M1 mac), it would return StaticInt(4).
This is how I’d suggest getting the 8 to use (and it is what LoopVectorization does).
Your loads here also have a pattern. While the subsequent rs, gs, and bs are offset by 3, they’re offset from each other by only 1.
This means that we could load them efficiently by, instead of loading 8 numbers spaced appart by 3 three times, we could load 8 contiguous vectors, and then shuffle them until we get the desired r, g, b.
help?> Unroll
search: Unroll VecUnroll
Unroll{AU,F,N,AV,W,M,X}(i::I)
• AU: Unrolled axis
• F: Factor, step size per unroll. If AU == AV, F == W means successive loads. 1 would mean offset by 1, e.g. x{1:8], x[2:9], and x[3:10].
• N: How many times is it unrolled
• AV: Vectorized axis # 0 means not vectorized, some sort of reduction
• W: vector width
• M: bitmask indicating whether each factor is masked
• X: stride between loads of vectors along axis AV.
• i::I - index
julia> vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>
julia> rgb[1:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
0xb9 0x4e 0x11 0x35 0x6a 0xa1 0x82 0x25 0x06 0xc9 0x9b 0x07 0x3a 0x1c 0x04 0x0c
julia> rgb[2:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
0x81 0x8f 0x7d 0x4a 0xaa 0xf3 0xc3 0x50 0x50 0xcd 0xe5 0x81 0xd2 0x9f 0x19 0x84
julia> rgb[3:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
0x6f 0xb1 0xe2 0x1e 0xd3 0xcf 0x76 0x7a 0x5f 0x0f 0xe0 0x50 0x41 0x3b 0x99 0x81
Benchmark vs the old approach:
julia> function loadold(rgb, i)
sp = stridedpointer(rgb)
r = vload(sp, (MM(pick_vector_width(Float32), i, StaticInt(3)),))
g = vload(sp, (MM(pick_vector_width(Float32), i+1, StaticInt(3)),))
b = vload(sp, (MM(pick_vector_width(Float32), i+2, StaticInt(3)),))
VecUnroll((r,g,b))
end
loadold (generic function with 1 method)
julia> loadold(rgb,1)
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>
julia> vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>
julia> @benchmark loadold($rgb,1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 22.970 ns (0.00% GC)
median time: 23.059 ns (0.00% GC)
mean time: 23.124 ns (0.00% GC)
maximum time: 39.215 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
julia> @benchmark vload(stridedpointer($rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.187 ns (0.00% GC)
median time: 2.189 ns (0.00% GC)
mean time: 2.198 ns (0.00% GC)
maximum time: 17.585 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Let me know if this helps/if you have any other questions!
The Unroll API is pretty awkward, and the order of the parameters is nonsensical to the point I have no clue what I was thinking (I added the doc string because I have to look it up every time myself…). Maybe I’ll clean it up for VectorizationBase 0.20.
But it’s meant to have fields I (or LoopVectorization) can plug things in to get whatever behavior it wants to get optimizations on the loads/stores like in this example.