Problem with LoopVectorization : @turbo, LoopVectorization.check_args

Hi,
I’m trying to optimize a loop and I get a warning about `LoopVectorization.check_args` on your inputs failed.

Here is my MWE:

psi = rand(ComplexF64,4,4)
dpsi = zeros(ComplexF64,4,4)

function test_turbo(dpsi::Array{ComplexF64,2}, psi::Array{ComplexF64,2}, t::Float64)
    @turbo for pos in 1:length(psi)
        dpsi[pos] = -2psi[pos]
    end
end

julia> test_turbo(dpsi, psi , 0.0)
┌ Warning: #= /path/to/file/test.jl:70 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main ~/.julia/packages/LoopVectorization/ndGJi/src/condense_loopset.jl:825

It looks like I’m missing something very basic.
Any help will be welcome!

I think that package prefers you to use eachindex, eg. for pos in eachindex(psi)

IIRC, you need StructArrays.jl to make LoopVectorization work on Complex.

1 Like

Thank you!
I’ve now stuck in converting to a circular boundary condition lattice, is there any simple solution for that or I’ll just open a new issue?

I tried StructArrays.jl to make LoopVectorization work on Complex but with no success. Could you give an example on how to use it? Thanks.

The following link has more detail:

This is a little bit inelegant. But we don’t have better tools at present.
In general, SOA also helps LLVM to generate better simd code. If your data could be stored in StructArray from the very beginning. You might get better performance even without LV.jl. Just a example:

julia> a = randn(ComplexF64, 4096); b = randn(ComplexF64, 4096); c = similar(a, Float64);

julia> f(c, a, b) = c .= abs2.(a .* b) .+ abs2.(a .+ b).^3
f (generic function with 2 methods)

julia> @btime f($c, $a, $b);
  3.650 μs (0 allocations: 0 bytes)

julia> sa = StructArray(a); sb = StructArray(b);

julia> @btime f($c, $sa, $sb);
  2.267 μs (0 allocations: 0 bytes)
1 Like

Here is my example,
I worked with ghost cells so I also used OffsetArrays.jl for helping me with the indices.
There is a least factor 10 between using @turbo or not.

# mwe_struct_offset.jl
using LoopVectorization, OffsetArrays, StructArrays, BenchmarkTools

const StructOffset = StructArray{ComplexF64, 2, NamedTuple{(:re, :im), Tuple{OffsetMatrix{Float64, Matrix{Float64}}, OffsetMatrix{Float64, Matrix{Float64}}}}, Int64}
function ghost(dpsi::StructOffset, psi::StructOffset, param, time::Float64)
    # simple lattice model 
    m, Jt, Js, lat_size = param
    (L, M) = lat_size
   
    @turbo for t in 1:M, x=1:L
        interaction_u_re = -Jt*(psi.re[x,t+1] + psi.re[x,t-1])
        interaction_u_im = -Jt*(psi.im[x,t+1] + psi.im[x,t-1])

        interaction_u_re += -Js*(psi.re[x+1,t] + psi.re[x-1,t])
        interaction_u_im += -Js*(psi.im[x+1,t] + psi.im[x-1,t])

        dpsi.re[x,t] =  (m*psi.re[x,t] + interaction_u_re)
        dpsi.im[x,t] = -(m*psi.im[x,t] + interaction_u_im)
    end
end

L = 10
M = 10

ghost_psi  =  rand(ComplexF64,L+2,M+2)
psi_struct  = StructArray(OffsetArray(ghost_psi , -1, -1))
dpsi_struct = similar(psi_struct)

# applying circular/periodic boundary condition
psi_struct[1:L,0]   = psi_struct[1:L,M]
psi_struct[1:L,M+1] = psi_struct[1:L,1]
psi_struct[0,1:M]   = psi_struct[L,1:M]
psi_struct[L+1,1:M] = psi_struct[1,1:M]

param = (1.0, 1.0, 1.0, (L,M))

# with turbo
@btime ghost($dpsi_struct, $psi_struct, $param, 0.0)
76.352 ns (0 allocations: 0 bytes)
# without turbo
@btime ghost($dpsi_struct, $psi_struct, $param, 0.0)
957.318 ns (0 allocations: 0 bytes)
2 Likes

There are also tests without using StructArrays here:

Unfortunately, it all requires reinterpret and manually writing the complex operations.
The LoopVectorization rewrite will eliminate the need for this, but that’s probably at least a year away.

In the mean time, PRs for better documentation are always welcome.

2 Likes

StructArrays or reinterpret, which one is the better solution with LoopVecterization?

Depends on the rest of your code.
All else equal, StructArrays is better because that is the optimal data layout for SIMD. LV will need extra shuffles to permute the data otherwise.

But is the rest of your code prefers regular arrays, shuffling a bit within the loop could be faster.

2 Likes