How to use SIMD with a very simple example

Hello,

I am trying to understand how to use SIMD.jl.
After trying the vadd! example in the documentation I wanted use SIMD in a different example. No success so far.

This is what I would like to simdify:

function myfunc(a, b)
    if a > b
        return a - b
    else
        return a + b
    end
end
x = rand(1_000_000);
# do myfunc.(x, 2.) with explicit simd calls
myfunc.(x, 2.)

I tried the following

function myfunc_simd(x::Vector{T}, value::T, ::Type{Vec{N,T}}) where {N, T}
    @assert length(x) % N == 0
    result = Array{T}(undef, length(x))
    lane   = VecRange{N}(0)
    @inbounds for i in 1:N:length(x)
        result[lane + i] += x[lane+i] > value ? x[k+lane] - value : x[k+lane] +value
    end
end

But I gon an error

x = rand(Float32,1_000_000);
myfunc_simd(x, Float32(2.), Vec{8,Float32})

TypeError: non-boolean (Vec{8,Bool}) used in boolean context

Any ideas on how to do something like this?

Do you know about any tutorial containing examples using SIMD.jl ?

I have seen https://kristofferc.github.io/post/intrinsics/ , which is pretty awesome, but does not contain any example with “flag-boolean vectors” (masks) used to control an if-else statement like the one I have here.

Please ensure your code runs before you post it; it doesn’t produce the same error. You’ll need vifelse.

2 Likes

To quote from @kristoffer.carlsson s blog post

Any type of control flow inside the loop will likely mean the loop will not vectorize.

I guess that might be your problem here?

Maybe this is of some little help:

julia> using SIMD

julia> x = 8 * rand(Float32, 8)
8-element Array{Float32,1}:
 0.10612774
 5.7371054 
 0.8144512 
 6.429695  
 4.938607  
 7.0792313 
 0.9309225 
 4.3026505 

julia> v = vload(Vec{8, Float32}, x, 1)
<8 x Float32>[0.10612774, 5.7371054, 0.8144512, 6.429695, 4.938607, 7.0792313, 0.9309225, 4.3026505]

julia> vifelse(v > 2, v - 2, v + 2)
<8 x Float32>[2.1061277, 3.7371054, 2.8144512, 4.429695, 2.9386072, 5.0792313, 2.9309225, 2.3026505]
1 Like

You are right, post updated.

julia> function myfunc_simd(x::Vector{T}, value::T, ::Type{Vec{N,T}}) where {N, T}
           @assert length(x) % N == 0
           result = Array{T}(undef, length(x))
           lane   = VecRange{N}(0)
           @inbounds for i in 1:N:length(x)        
               x_vslice    = vload(Vec{N, T}, x, i) # i = 2*k+1 where k=1,2,3,4,...
               result[lane + i] = vifelse(x_vslice > 2, x_vslice - value, x_vslice + value)
           end
           return result
       end
myfunc_simd (generic function with 1 method)

Using the vifelse I managed to do the function that does what I want

julia> x = rand(Float32,1_000_000);

julia> result_1 = myfunc.(x,1);

julia> result_2 = myfunc_simd(x, Float32(1), Vec{8,Float32});

julia> result_1 == result_2
true

Nevertheless, both versions take a similar amount of time:


julia> using BenchmarkTools

jjulia> @btime myfunc.(x,1);
  887.890 μs (4 allocations: 3.81 MiB)

julia> @btime myfunc_simd(x, Float32(1), Vec{8,Float32});
  757.392 μs (2 allocations: 3.81 MiB)

Maybe for this example the compiler is smart enough to vectorize the result. Or there is not enough compute to be done.

In any case a good exercise to better understand how to do his manually.

vadd! case

julia> @btime vadd!($x,$y,Vec{32,Float32})
  660.512 μs (0 allocations: 0 bytes)

julia> @btime $x .= $x .+ $y;
  750.731 μs (0 allocations: 0 bytes)

You forgot to dot the plus sign. Also, you should interpolate the variables.

1 Like

I think that’s what’s happening in this case. In the code_llvm for broadcast(myfunc, x, 1), there are already occurrences of <8 x float>, and the native code uses ymm registers and vblendvps masked SIMD instructions.

And for the vadd! case, try @btime $x .+= $y.