Experiments with VectorizationBase

I’ve been trying out VectorizationBase for some low-level SIMD coding. This has worked but I suspect there may be better ways to go about it or at least I need some pointers for how to adapt to some recent changes in this quickly developing target.

The problem I’m working on has the structure that the input data is a UInt8 color vector rgbrgbrgb..., the output data is a Float32 vector and the scalar processing can be described on the form

i, j = 1, 1
while i <= n
    r = lut[1 + rgb[i]]
    g = lut[1 + rgb[i + 1]]
    b = lut[1 + rgb[i + 2]]
    out[j] = f(r, g, b)
    i += 3
    j += 1
end

where lut is a 256 element Float32 lookup table and f is a function from 3 Float32 arguments to one Float32 output.

With VectorizationBase 0.18.2 I implemented this as

i, j = 1, 1
while i <= n - 23
    r = vload(pointer(rgb, i), MM(StaticInt(8), StaticInt(0), StaticInt(3)))
    g = vload(pointer(rgb, i), MM(StaticInt(8), StaticInt(1), StaticInt(3)))
    b = vload(pointer(rgb, i), MM(StaticInt(8), StaticInt(2), StaticInt(3)))
    rr = vload(pointer(lut), 4 * VectorizationBase.vconvert(Vec{8, UInt32}, r))
    gg = vload(pointer(lut), 4 * VectorizationBase.vconvert(Vec{8, UInt32}, g))
    bb = vload(pointer(lut), 4 * VectorizationBase.vconvert(Vec{8, UInt32}, b))
    t = f(rr, gg, bb)
    vstore!(pointer(out, j), t)
    i += 24
    j += 8
end

Questions, from higher level to lower level, where the lower levels are less interesting if there are better higher level solutions.

  1. Could this have been solved at the LoopVectorization @avx abstraction level?
  2. At the level of VectorizationBase, are there better ways to go about this? It’s obviously not ideal that I have hardcoded this to 8xFloat32 with no unrolling option.
  3. The vload and vstore! methods I used went away in VectorizationBase 0.19.1. I’ve found how to implement this is in terms of __vload and __vstore! but that is clearly way deeper into the internals than I should be. What vload and vstore! methods should I be using now?
1 Like

Makes an attempt to summon @Elrod. No need to go too much into details. I can probably figure out if I get some general direction.

Feel free to tag me. I don’t see every question and didn’t see this.

Yes (note, there was a bug in VectorizationBase that is now fixed in VectorizationBase 0.19.13, released about an hour ago):

using LoopVectorization
const lut = [Float32(i)/255 for i in 0:255];
f(a,b,c) = a*b+c

n =  334*3;
rgb = rand(UInt8, n);
out0= Vector{Float32}(undef, n ÷ 3);
out1 = similar(out0);
out2 = similar(out0);

function while_loop!(out, rgb)
    i=1; j=1; n = length(rgb)
    @inbounds while i <= n
        r = lut[1 + rgb[i]]
        g = lut[1 + rgb[i + 1]]
        b = lut[1 + rgb[i + 2]]
        out[j] = f(r, g, b)
        i += 3
        j += 1
    end
end
function for_loop!(out, rgb) # ivdep needed for `@simd` when using a table
    @inbounds @simd ivdep for j in eachindex(out)
        r = lut[1 + rgb[3j - 2]]
        g = lut[1 + rgb[3j - 1]]
        b = lut[1 + rgb[3j]]
        out[j] = f(r, g, b)
    end
end
function avxfor_loop!(out, rgb)
    @avx for j in eachindex(out)
        r = lut[1 + rgb[3j - 2]]
        g = lut[1 + rgb[3j - 1]]
        b = lut[1 + rgb[3j]]
        out[j] = f(r, g, b)
    end
end
while_loop!(out0, rgb)
for_loop!(out1, rgb)
avxfor_loop!(out2, rgb)
out0 ≈ out1
out0 ≈ out2

Also, for me, not using a table seems much faster:

function cvt_for_loop!(out, rgb)
    s = one(Float32)/255
    @inbounds @simd for j in eachindex(out)
        r = rgb[3j - 2] * s
        g = rgb[3j - 1] * s
        b = rgb[3j]     * s
        out[j] = f(r, g, b)
    end
end
function cvt_avxfor_loop!(out, rgb)
    s = one(Float32)/255
    @avx for j in eachindex(out)
        r = rgb[3j - 2] * s
        g = rgb[3j - 1] * s
        b = rgb[3j]     * s
        out[j] = f(r, g, b)
    end
end

Although maybe your table involves a more complicated transform than the linear one I’m using.
Benchmraks:

julia> @benchmark while_loop!($out0, $rgb)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     302.780 ns (0.00% GC)
  median time:      370.132 ns (0.00% GC)
  mean time:        370.471 ns (0.00% GC)
  maximum time:     435.040 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     227

julia> @benchmark for_loop!($out0, $rgb)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     208.995 ns (0.00% GC)
  median time:      209.178 ns (0.00% GC)
  mean time:        209.449 ns (0.00% GC)
  maximum time:     241.436 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     550

julia> @benchmark avxfor_loop!($out0, $rgb)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     208.137 ns (0.00% GC)
  median time:      208.157 ns (0.00% GC)
  mean time:        208.637 ns (0.00% GC)
  maximum time:     237.182 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     555

julia> @benchmark cvt_for_loop!($out0, $rgb)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     112.369 ns (0.00% GC)
  median time:      112.491 ns (0.00% GC)
  mean time:        112.669 ns (0.00% GC)
  maximum time:     130.954 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     923

julia> @benchmark cvt_avxfor_loop!($out0, $rgb)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     79.738 ns (0.00% GC)
  median time:      80.110 ns (0.00% GC)
  mean time:        80.143 ns (0.00% GC)
  maximum time:     97.106 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     968

Thats 80ns for the @avx loop just relying on promoting UInt8 to Float32 and multiplying by one(Float32)/255, vs >200 ns for the lookup table variants.

Currently, f is still opaque to LoopVectorization, so it won’t do any extra unrolling beyond vectorization.
Replacing it with functions it knows about, like * and + can help it make more informed decisions.
You can also manually add @avx unroll=2 or @avx unroll=4 to force it to unroll more, and see if some other unrolling factor results in better performance (if it does while none of the functions are opaque, e.g. after replacing f with things LV knows about, you could file an issue and I could look into tweaking LV’s modeling).

julia> function cvt_avxfor_loop_u2!(out, rgb)
           s = one(Float32)/255
           @avx unroll=2 for j in eachindex(out)
               r = rgb[3j - 2] * s
               g = rgb[3j - 1] * s
               b = rgb[3j]     * s
               out[j] = f(r, g, b)
           end
       end
cvt_avxfor_loop_u2! (generic function with 1 method)

julia> @benchmark cvt_avxfor_loop_u2!($out0, $rgb)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     78.921 ns (0.00% GC)
  median time:      79.199 ns (0.00% GC)
  mean time:        80.142 ns (0.00% GC)
  maximum time:     108.159 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     968

On a computer with AVX512:

julia> using VectorizationBase

julia> pick_vector_width(Float32)
static(16)

On a computer with AVX(2) only, it would return StaticInt(8) instead. On computers with 128 bit vectors (like the M1 mac), it would return StaticInt(4).

This is how I’d suggest getting the 8 to use (and it is what LoopVectorization does).

Your loads here also have a pattern. While the subsequent rs, gs, and bs are offset by 3, they’re offset from each other by only 1.
This means that we could load them efficiently by, instead of loading 8 numbers spaced appart by 3 three times, we could load 8 contiguous vectors, and then shuffle them until we get the desired r, g, b.

help?> Unroll
search: Unroll VecUnroll

  Unroll{AU,F,N,AV,W,M,X}(i::I)

    •  AU: Unrolled axis

    •  F: Factor, step size per unroll. If AU == AV, F == W means successive loads. 1 would mean offset by 1, e.g. x{1:8], x[2:9], and x[3:10].

    •  N: How many times is it unrolled

    •  AV: Vectorized axis # 0 means not vectorized, some sort of reduction

    •  W: vector width

    •  M: bitmask indicating whether each factor is masked

    •  X: stride between loads of vectors along axis AV.

    •  i::I - index

julia> vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>

julia> rgb[1:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
 0xb9  0x4e  0x11  0x35  0x6a  0xa1  0x82  0x25  0x06  0xc9  0x9b  0x07  0x3a  0x1c  0x04  0x0c

julia> rgb[2:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
 0x81  0x8f  0x7d  0x4a  0xaa  0xf3  0xc3  0x50  0x50  0xcd  0xe5  0x81  0xd2  0x9f  0x19  0x84

julia> rgb[3:3:48]'
1×16 adjoint(::Vector{UInt8}) with eltype UInt8:
 0x6f  0xb1  0xe2  0x1e  0xd3  0xcf  0x76  0x7a  0x5f  0x0f  0xe0  0x50  0x41  0x3b  0x99  0x81

Benchmark vs the old approach:

julia> function loadold(rgb, i)
           sp = stridedpointer(rgb)
           r = vload(sp, (MM(pick_vector_width(Float32), i, StaticInt(3)),))
           g = vload(sp, (MM(pick_vector_width(Float32), i+1, StaticInt(3)),))
           b = vload(sp, (MM(pick_vector_width(Float32), i+2, StaticInt(3)),))
           VecUnroll((r,g,b))
       end
loadold (generic function with 1 method)

julia> loadold(rgb,1)
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>

julia> vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0xb9, 0x4e, 0x11, 0x35, 0x6a, 0xa1, 0x82, 0x25, 0x06, 0xc9, 0x9b, 0x07, 0x3a, 0x1c, 0x04, 0x0c>
Vec{16, UInt8}<0x81, 0x8f, 0x7d, 0x4a, 0xaa, 0xf3, 0xc3, 0x50, 0x50, 0xcd, 0xe5, 0x81, 0xd2, 0x9f, 0x19, 0x84>
Vec{16, UInt8}<0x6f, 0xb1, 0xe2, 0x1e, 0xd3, 0xcf, 0x76, 0x7a, 0x5f, 0x0f, 0xe0, 0x50, 0x41, 0x3b, 0x99, 0x81>

julia> @benchmark loadold($rgb,1)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.970 ns (0.00% GC)
  median time:      23.059 ns (0.00% GC)
  mean time:        23.124 ns (0.00% GC)
  maximum time:     39.215 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark vload(stridedpointer($rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.187 ns (0.00% GC)
  median time:      2.189 ns (0.00% GC)
  mean time:        2.198 ns (0.00% GC)
  maximum time:     17.585 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Let me know if this helps/if you have any other questions!
The Unroll API is pretty awkward, and the order of the parameters is nonsensical to the point I have no clue what I was thinking (I added the doc string because I have to look it up every time myself…). Maybe I’ll clean it up for VectorizationBase 0.20.

But it’s meant to have fields I (or LoopVectorization) can plug things in to get whatever behavior it wants to get optimizations on the loads/stores like in this example.

4 Likes

Do you mean this?

julia> avxfor_loop!(out2, rgb)
ERROR: Failed to parse LLVM assembly: 
llvmcall:6:9: error: '%res.0' defined with type 'i32' but expected 'i24'
ret i24 %res.0

I get that with both LoopVectorization 0.12.7 + VectorizationBase 0.19.14 and with the master branches of both packages.

In case it matters:

julia> Sys.cpu_info()[1]
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz: 

Fixed in VectorizationBase 0.12.5.
I’ve now tTested it on an AVX2 system. The way LoopVectorization handles the remainder n % 8 is slow without AVX512, so here the regular conversion loop cvt_for_loop! was fastest.

Using VectorizationBase, you could handle the n % 8 remainder differently and match cvt_for_loop!'s performance, but it’s probably already more or less optimal on systems without AVX512.

1 Like

Yes, both my lut and f are quite a bit more complex. Unfortunately they are somewhat proprietary so I can’t give a full reproduction. Using those in your functions I have timings on my AVX2 system of

  • while_loop!: 3.1 μs
  • for_loop!: 730 ns
  • avxfor_loop!: 1.0 μs
  • avxfor_loop2!: 920 ns
  • cvt_for_loop!: 3.1 μs
  • cvt_avxfor_loop!: 2.4 μs
  • cvt_avxfor_loop_u2!: 2.1 μs

For comparison my baseline VectorizationBase solution clocks in at 1.2 μs, which I guess goes to show how futile it is to try to SIMD by hand unless you really know what you’re doing (which I clearly don’t).

Still it would be interesting to see how far the VectorizationBase approach can be pushed with better data loading. From the numbers above it seems clear that table lookup should be cheaper in my case but I haven’t found how to do that with a VecUnroll. I have found that I can split the VecUnroll into r, g, b with

julia> vecrgb = vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{8, UInt8}
Vec{8, UInt8}<0x4c, 0xaf, 0x76, 0x5f, 0x7e, 0x13, 0x0c, 0x0d>
Vec{8, UInt8}<0xc9, 0x76, 0x02, 0xa1, 0x3e, 0x63, 0xa1, 0xf4>
Vec{8, UInt8}<0x9b, 0xf4, 0xef, 0xe2, 0xc1, 0x7f, 0x22, 0x7a>

julia> r, g, b = VectorizationBase.data(vecrgb)
(Vec{8, UInt8}<0x4c, 0xaf, 0x76, 0x5f, 0x7e, 0x13, 0x0c, 0x0d>, Vec{8, UInt8}<0xc9, 0x76, 0x02, 0xa1, 0x3e, 0x63, 0xa1, 0xf4>, Vec{8, UInt8}<0x9b, 0xf4, 0xef, 0xe2, 0xc1, 0x7f, 0x22, 0x7a>)

but I suspect that’s not quite the right abstraction. I’ve also managed to do table lookup after splitting with
vload(stridedpointer(lut), (r,)) but that seems rather less likely to be the right thing to do. Computing f with a Vec or VecUnroll is unproblematic at least and I guess I can figure out the right vstore! call to use once I have the appropriate data to store.

On the other hand, if you don’t gain experience doing it by hand, how would you learn how to write or improve packages like LoopVectorization?

Also, re the table vs conversion timings, my AVX2 system was Haswell, which has much slower gather instructions (used for SIMD table lookups) than Intel Skylake and beyond. Zen and Zen2 are slower than Haswell.

That’s reasonable. Once you’re doing table lookups into random indices, there aren’t really any possible optimizations.
But you could wait to split the VecUnroll until after loading:

julia> using VectorizationBase

julia> n =  334*3;

julia> rgb = rand(UInt8, n);

julia> const lut = [Float32(i)/256 for i in 0:255];

julia> vu_rgb = vload(stridedpointer(rgb), Unroll{1,1,3,1,Int(pick_vector_width(Float32)),zero(UInt),3}((1,)))
3 x Vec{16, UInt8}
Vec{16, UInt8}<0x60, 0x24, 0xdc, 0x69, 0xfc, 0x5a, 0x1b, 0xb0, 0xe3, 0x82, 0x24, 0xd7, 0xc8, 0x68, 0x9e, 0x43>
Vec{16, UInt8}<0xc1, 0x55, 0x03, 0x8f, 0x4d, 0xbf, 0x29, 0xbe, 0x56, 0x0f, 0x1f, 0xdc, 0xc6, 0x83, 0x25, 0x72>
Vec{16, UInt8}<0x01, 0xe5, 0xbd, 0xa4, 0x7a, 0x59, 0x8e, 0xd7, 0xb6, 0x38, 0x2d, 0xaf, 0xed, 0x5e, 0x63, 0x0c>

julia> vload(stridedpointer(lut), (vu_rgb,))
3 x Vec{16, Float32}
Vec{16, Float32}<0.37109375f0, 0.13671875f0, 0.85546875f0, 0.40625f0, 0.98046875f0, 0.34765625f0, 0.1015625f0, 0.68359375f0, 0.8828125f0, 0.50390625f0, 0.13671875f0, 0.8359375f0, 0.77734375f0, 0.40234375f0, 0.61328125f0, 0.2578125f0>
Vec{16, Float32}<0.75f0, 0.328125f0, 0.0078125f0, 0.5546875f0, 0.296875f0, 0.7421875f0, 0.15625f0, 0.73828125f0, 0.33203125f0, 0.0546875f0, 0.1171875f0, 0.85546875f0, 0.76953125f0, 0.5078125f0, 0.140625f0, 0.44140625f0>
Vec{16, Float32}<0.0f0, 0.890625f0, 0.734375f0, 0.63671875f0, 0.47265625f0, 0.34375f0, 0.55078125f0, 0.8359375f0, 0.70703125f0, 0.21484375f0, 0.171875f0, 0.6796875f0, 0.921875f0, 0.36328125f0, 0.3828125f0, 0.04296875f0>

It always wants either tuples or Unroll{...}(::Tuple) for indexing.