Feel free to tag me. I don’t see every question and didn’t see this.
Yes (note, there was a bug in VectorizationBase that is now fixed in VectorizationBase 0.19.13, released about an hour ago):
using LoopVectorization
const lut = [Float32(i)/255 for i in 0:255];
f(a,b,c) = a*b+c
n = 334*3;
rgb = rand(UInt8, n);
out0= Vector{Float32}(undef, n ÷ 3);
out1 = similar(out0);
out2 = similar(out0);
function while_loop!(out, rgb)
i=1; j=1; n = length(rgb)
@inbounds while i <= n
r = lut[1 + rgb[i]]
g = lut[1 + rgb[i + 1]]
b = lut[1 + rgb[i + 2]]
out[j] = f(r, g, b)
i += 3
j += 1
end
end
function for_loop!(out, rgb) # ivdep needed for `@simd` when using a table
@inbounds @simd ivdep for j in eachindex(out)
r = lut[1 + rgb[3j - 2]]
g = lut[1 + rgb[3j - 1]]
b = lut[1 + rgb[3j]]
out[j] = f(r, g, b)
end
end
function avxfor_loop!(out, rgb)
@avx for j in eachindex(out)
r = lut[1 + rgb[3j - 2]]
g = lut[1 + rgb[3j - 1]]
b = lut[1 + rgb[3j]]
out[j] = f(r, g, b)
end
end
while_loop!(out0, rgb)
for_loop!(out1, rgb)
avxfor_loop!(out2, rgb)
out0 ≈ out1
out0 ≈ out2
Also, for me, not using a table seems much faster:
function cvt_for_loop!(out, rgb)
s = one(Float32)/255
@inbounds @simd for j in eachindex(out)
r = rgb[3j - 2] * s
g = rgb[3j - 1] * s
b = rgb[3j] * s
out[j] = f(r, g, b)
end
end
function cvt_avxfor_loop!(out, rgb)
s = one(Float32)/255
@avx for j in eachindex(out)
r = rgb[3j - 2] * s
g = rgb[3j - 1] * s
b = rgb[3j] * s
out[j] = f(r, g, b)
end
end
Although maybe your table involves a more complicated transform than the linear one I’m using.
Benchmraks:
julia> @benchmark while_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 302.780 ns (0.00% GC)
median time: 370.132 ns (0.00% GC)
mean time: 370.471 ns (0.00% GC)
maximum time: 435.040 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 227
julia> @benchmark for_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 208.995 ns (0.00% GC)
median time: 209.178 ns (0.00% GC)
mean time: 209.449 ns (0.00% GC)
maximum time: 241.436 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 550
julia> @benchmark avxfor_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 208.137 ns (0.00% GC)
median time: 208.157 ns (0.00% GC)
mean time: 208.637 ns (0.00% GC)
maximum time: 237.182 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 555
julia> @benchmark cvt_for_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 112.369 ns (0.00% GC)
median time: 112.491 ns (0.00% GC)
mean time: 112.669 ns (0.00% GC)
maximum time: 130.954 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 923
julia> @benchmark cvt_avxfor_loop!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 79.738 ns (0.00% GC)
median time: 80.110 ns (0.00% GC)
mean time: 80.143 ns (0.00% GC)
maximum time: 97.106 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 968
Thats 80ns for the @avx
loop just relying on promoting UInt8
to Float32
and multiplying by one(Float32)/255
, vs >200 ns for the lookup table variants.
Currently, f
is still opaque to LoopVectorization, so it won’t do any extra unrolling beyond vectorization.
Replacing it with functions it knows about, like *
and +
can help it make more informed decisions.
You can also manually add @avx unroll=2
or @avx unroll=4
to force it to unroll more, and see if some other unrolling factor results in better performance (if it does while none of the functions are opaque, e.g. after replacing f
with things LV knows about, you could file an issue and I could look into tweaking LV’s modeling).
julia> function cvt_avxfor_loop_u2!(out, rgb)
s = one(Float32)/255
@avx unroll=2 for j in eachindex(out)
r = rgb[3j - 2] * s
g = rgb[3j - 1] * s
b = rgb[3j] * s
out[j] = f(r, g, b)
end
end
cvt_avxfor_loop_u2! (generic function with 1 method)
julia> @benchmark cvt_avxfor_loop_u2!($out0, $rgb)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 78.921 ns (0.00% GC)
median time: 79.199 ns (0.00% GC)
mean time: 80.142 ns (0.00% GC)
maximum time: 108.159 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 968
On a computer with AVX512:
julia> using VectorizationBase
julia> pick_vector_width(Float32)
static(16)
On a computer with AVX(2) only, it would return StaticInt(8)
instead. On computers with 128 bit vectors (like the M1 mac), it would return StaticInt(4)
.
This is how I’d suggest getting the 8
to use (and it is what LoopVectorization does).
Your loads here also have a pattern. While the subsequent r
s, g
s, and b
s are offset by 3, they’re offset from each other by only 1.
This means that we could load them efficiently by, instead of loading 8 numbers spaced appart by 3 three times, we could load 8 contiguous vectors, and then shuffle them until we get the desired r
, g
, b
.
help?> Unroll
search: Unroll VecUnroll
Unroll{AU,F,N,AV,W,M,X}(i::I)
• AU: Unrolled axis
• F: Factor, step size per unroll. If AU == AV, F == W means successive loads. 1 would mean offset by 1, e.g. x{1:8], x[2:9], and x[3:10].
• N: How many times is it unrolled
• AV: Vectorized axis # 0 means not vectorized, some sort of reduction
• W: vector width
• M: bitmask indicating whether each factor is masked
• X: stride between loads of vectors along axis AV.
• i::I - index
julia> vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>
julia> rgb[1:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
0xb9 0x4e 0x11 0x35 0x6a 0xa1 0x82 0x25 0x06 0xc9 0x9b 0x07 0x3a 0x1c 0x04 0x0c
julia> rgb[2:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
0x81 0x8f 0x7d 0x4a 0xaa 0xf3 0xc3 0x50 0x50 0xcd 0xe5 0x81 0xd2 0x9f 0x19 0x84
julia> rgb[3:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
0x6f 0xb1 0xe2 0x1e 0xd3 0xcf 0x76 0x7a 0x5f 0x0f 0xe0 0x50 0x41 0x3b 0x99 0x81
Benchmark vs the old approach:
julia> function loadold(rgb, i)
sp = stridedpointer(rgb)
r = vload(sp, (MM(pick_vector_width(Float32), i, StaticInt(3)),))
g = vload(sp, (MM(pick_vector_width(Float32), i+1, StaticInt(3)),))
b = vload(sp, (MM(pick_vector_width(Float32), i+2, StaticInt(3)),))
VecUnroll((r,g,b))
end
loadold (generic function with 1 method)
julia> loadold(rgb,1)
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>
julia> vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>
julia> @benchmark loadold($rgb,1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 22.970 ns (0.00% GC)
median time: 23.059 ns (0.00% GC)
mean time: 23.124 ns (0.00% GC)
maximum time: 39.215 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 996
julia> @benchmark vload(stridedpointer($rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.187 ns (0.00% GC)
median time: 2.189 ns (0.00% GC)
mean time: 2.198 ns (0.00% GC)
maximum time: 17.585 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Let me know if this helps/if you have any other questions!
The Unroll
API is pretty awkward, and the order of the parameters is nonsensical to the point I have no clue what I was thinking (I added the doc string because I have to look it up every time myself…). Maybe I’ll clean it up for VectorizationBase 0.20.
But it’s meant to have fields I (or LoopVectorization) can plug things in to get whatever behavior it wants to get optimizations on the loads/stores like in this example.