Problem with LoopVectorization : @turbo, LoopVectorization.check_args

Hi,
I’m trying to optimize a loop and I get a warning about `LoopVectorization.check_args` on your inputs failed.

Here is my MWE:

psi = rand(ComplexF64,4,4)
dpsi = zeros(ComplexF64,4,4)

function test_turbo(dpsi::Array{ComplexF64,2}, psi::Array{ComplexF64,2}, t::Float64)
    @turbo for pos in 1:length(psi)
        dpsi[pos] = -2psi[pos]
    end
end

julia> test_turbo(dpsi, psi , 0.0)
┌ Warning: #= /path/to/file/test.jl:70 =#:
│ `LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
│ Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
└ @ Main ~/.julia/packages/LoopVectorization/ndGJi/src/condense_loopset.jl:825

It looks like I’m missing something very basic.
Any help will be welcome!

I think that package prefers you to use eachindex, eg. for pos in eachindex(psi)

IIRC, you need StructArrays.jl to make LoopVectorization work on Complex.

2 Likes

Thank you!
I’ve now stuck in converting to a circular boundary condition lattice, is there any simple solution for that or I’ll just open a new issue?

I tried StructArrays.jl to make LoopVectorization work on Complex but with no success. Could you give an example on how to use it? Thanks.

The following link has more detail:
https://github.com/JuliaSIMD/LoopVectorization.jl#dealing-with-structs

This is a little bit inelegant. But we don’t have better tools at present.
In general, SOA also helps LLVM to generate better simd code. If your data could be stored in StructArray from the very beginning. You might get better performance even without LV.jl. Just a example:

julia> a = randn(ComplexF64, 4096); b = randn(ComplexF64, 4096); c = similar(a, Float64);

julia> f(c, a, b) = c .= abs2.(a .* b) .+ abs2.(a .+ b).^3
f (generic function with 2 methods)

julia> @btime f($c, $a, $b);
  3.650 μs (0 allocations: 0 bytes)

julia> sa = StructArray(a); sb = StructArray(b);

julia> @btime f($c, $sa, $sb);
  2.267 μs (0 allocations: 0 bytes)
1 Like

Here is my example,
I worked with ghost cells so I also used OffsetArrays.jl for helping me with the indices.
There is a least factor 10 between using @turbo or not.

# mwe_struct_offset.jl
using LoopVectorization, OffsetArrays, StructArrays, BenchmarkTools

const StructOffset = StructArray{ComplexF64, 2, NamedTuple{(:re, :im), Tuple{OffsetMatrix{Float64, Matrix{Float64}}, OffsetMatrix{Float64, Matrix{Float64}}}}, Int64}
function ghost(dpsi::StructOffset, psi::StructOffset, param, time::Float64)
    # simple lattice model 
    m, Jt, Js, lat_size = param
    (L, M) = lat_size
   
    @turbo for t in 1:M, x=1:L
        interaction_u_re = -Jt*(psi.re[x,t+1] + psi.re[x,t-1])
        interaction_u_im = -Jt*(psi.im[x,t+1] + psi.im[x,t-1])

        interaction_u_re += -Js*(psi.re[x+1,t] + psi.re[x-1,t])
        interaction_u_im += -Js*(psi.im[x+1,t] + psi.im[x-1,t])

        dpsi.re[x,t] =  (m*psi.re[x,t] + interaction_u_re)
        dpsi.im[x,t] = -(m*psi.im[x,t] + interaction_u_im)
    end
end

L = 10
M = 10

ghost_psi  =  rand(ComplexF64,L+2,M+2)
psi_struct  = StructArray(OffsetArray(ghost_psi , -1, -1))
dpsi_struct = similar(psi_struct)

# applying circular/periodic boundary condition
psi_struct[1:L,0]   = psi_struct[1:L,M]
psi_struct[1:L,M+1] = psi_struct[1:L,1]
psi_struct[0,1:M]   = psi_struct[L,1:M]
psi_struct[L+1,1:M] = psi_struct[1,1:M]

param = (1.0, 1.0, 1.0, (L,M))

# with turbo
@btime ghost($dpsi_struct, $psi_struct, $param, 0.0)
76.352 ns (0 allocations: 0 bytes)
# without turbo
@btime ghost($dpsi_struct, $psi_struct, $param, 0.0)
957.318 ns (0 allocations: 0 bytes)
2 Likes

There are also tests without using StructArrays here:
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/test/shuffleloadstores.jl

Unfortunately, it all requires reinterpret and manually writing the complex operations.
The LoopVectorization rewrite will eliminate the need for this, but that’s probably at least a year away.

In the mean time, PRs for better documentation are always welcome.

2 Likes

StructArrays or reinterpret, which one is the better solution with LoopVecterization?

Depends on the rest of your code.
All else equal, StructArrays is better because that is the optimal data layout for SIMD. LV will need extra shuffles to permute the data otherwise.

But is the rest of your code prefers regular arrays, shuffling a bit within the loop could be faster.

2 Likes

Hi, I’ve been having the same issue and thought it was also because of the complex arrays I’m dealing with. However, I found the same kind of error (check_args fail) in loops that deal with only real input. For example, this loop doesn’t work.

function residual!(r::AbstractArray{T,3}, v::AbstractArray{T,3}, f::AbstractArray{T,3}, x_vec::Tuple{AbstractVector{T}, AbstractVector{T}, AbstractVector{T}}, box_size::SVector{3,T}, box_min::SVector{3,T}, β::T, damping_factor::T, niterations::Int; los=nothing) where T <: AbstractFloat


    cell = map(T, box_size ./ size(v))
    cell2 = cell .^2
    icell2 = cell2 .^-1
    losn = los != nothing ? los ./ cell : nothing
    nmesh = size(v)
    xx, xy, xz = x_vec
    
    @tturbo for I in CartesianIndices(v) #This loop raises a warning and falls back to @inbounds @fastmath

        ix0, iy0, iz0 = Tuple(I)

        ixp = ix0 + 1 
        ixp = ixp > nmesh[1] ? ixp - nmesh[1] : ixp
        ixm = ix0 - 1 
        ixm = ixm < 1 ? ixm + nmesh[1] : ixm

        iyp = iy0 + 1 
        iyp = iyp > nmesh[2] ? iyp - nmesh[2] : iyp
        iym = iy0 - 1 
        iym = iym < 1 ? iym + nmesh[2] : iym

        izp = iz0 + 1 
        izp = izp > nmesh[3] ? izp - nmesh[3] : izp
        izm = iz0 - 1 
        izm = izm < 1 ? izm + nmesh[3] : izm
        
     

        px = los == nothing ? xx[I[1]] / cell[1] : losn[1]
        py = los == nothing ? xy[I[2]] / cell[2] : losn[2]
        pz = los == nothing ? xz[I[3]] / cell[3] : losn[3]

        g = β / (cell2[1] * px^2 + cell2[2] * py^2 + cell2[3] * pz^2)
        gpx2 = icell2[1] + g * px^2
        gpy2 = icell2[2] + g * py^2
        gpz2 = icell2[3] + g * pz^2

        r[I] = 2*(gpx2 + gpy2 + gpz2)*v[I] -
            (gpx2*(v[ixp, iy0, iz0]+v[ixm, iy0, iz0])+
            gpy2*(v[ix0, iyp, iz0]+v[ix0, iym, iz0])+
            gpz2*(v[ix0, iy0, izp]+v[ix0, iy0, izm])+
            g/2*(px*py*(v[ixp, iyp, iz0]+v[ixm, iym, iz0]
                        -v[ixm, iyp, iz0]-v[ixp, iym, iz0])+
                    px*pz*(v[ixp, iy0, izp]+v[ixm, iy0, izm]
                        -v[ixm, iy0, izp]-v[ixp, iy0, izm])+
                    py*pz*(v[ix0, iyp, izp]+v[ix0, iym, izm]
                        -v[ix0, iym, izp]-v[ix0, iyp, izm])));
        
        r[I] = los == nothing ? r[I] - g*(px*(v[ixp, iy0, iz0]-v[ixm, iy0, iz0])+
                    py*(v[ix0, iyp, iz0]-v[ix0, iym, iz0])+
                    pz*(v[ix0, iy0, izp]-v[ix0, iy0, izm])) : r[I]

    end # for v

    @tturbo for I in CartesianIndices(v) #This much simpler loop works ok.
        r[I] = f[I] - r[I]
    end #for
    r
 end #func

In this example, all arrays have real input. Could anyone explain why it fails in this case? Thanks in advance!

Can you give me code to copy/paste so I can run the above example?

@tturbo for I in CartesianIndices(v) #This loop raises a warning and falls back to @inbounds @fastmath

        ix0, iy0, iz0 = Tuple(I)

I’m not sure if that’d work. I don’t remember adding support for it Tuple(I), anyway.
So you could try replacing it with the individual loops manually, i.e.

@tturbo for iz0 = axes(z,3), iy0 = axes(z,2), ix0 = axes(z,1)
    ...
    r[ix0, iy0, iz0] = ...
end

It is a bit long to put here completely I think but I’ll add initialization with some random values

grid_size = (128,128,128)
function x_vec(field::AbstractArray, box_size::SVector{3,T}, box_min::SVector{3,T}) where T<:Real
    dims = [size(field)...]
    cell_size = box_size ./ dims
    Tuple(map(T, collect(box_min[i] + 0.5 * cell_size[i]:cell_size[i]:box_min[i] + box_size[i])) for i in 1:3)
end #func
r = ones(Float32, grid_size...)
v = similar(r)
f = similar(r)
box_size = @SVector [ 1000f0, 1000f0, 1000f0]
box_min = @SVector [ 0f0, 0f0, 0f0]
x_vec = x_vec(f, box_size, box_min) # I realize the variable naming is not good, this is normally not an issue since x_vec is called differently outside the function.
β = 1f0
damping_factor = 0.5f0
niterations = 4
los = nothing # or [0,0,1] should work with both
residual!(r, v, f, x_vec, box_size, box_min, β, damping_factor niterations; los=los) 

I will also test looping as you suggest, but I wanted to use CartesianIndices so I make sure I don’t mix things up and loop the dimensions in the wrong (slow) order.
Would

ix0 = I[1]
iy0 = I[2]
iz0 = I[3]

work?

Thanks!

LoopVectorization.jl picks the order for you. If you place them in a slow order, it’ll swap them to a fast one. The original order they were in doesn’t actually matter/impact what LoopVectorization.jl does.

Note that CartesianIndices will often go in the wrong order, e.g.

julia> CartesianIndices(rand(2,5)') |> collect |> vec
10-element Vector{CartesianIndex{2}}:
 CartesianIndex(1, 1)
 CartesianIndex(2, 1)
 CartesianIndex(3, 1)
 CartesianIndex(4, 1)
 CartesianIndex(5, 1)
 CartesianIndex(1, 2)
 CartesianIndex(2, 2)
 CartesianIndex(3, 2)
 CartesianIndex(4, 2)
 CartesianIndex(5, 2)

the better order would be

 CartesianIndex(1, 1)
 CartesianIndex(1, 2)
 CartesianIndex(2, 1)
 CartesianIndex(2, 2)
 CartesianIndex(3, 1)
 CartesianIndex(3, 2)
 CartesianIndex(4, 1)
 CartesianIndex(4, 2)
 CartesianIndex(5, 1)
 CartesianIndex(5, 2)

LoopVectorization.jl will change the order based on Adjoint or PermutedDimsArrays; CartesianIndices will not and instead always assumes column major.

1 Like

I haven’t tried it, but I don’t think I ever added the ability to understand syntax like that.

The warning comes from indexing cell, cell2, and icell2.
You can extract the scalars, i.e. define

Base.Cartesian.@nexprs 3 i -> begin
    cell_i = cell[i]
    cell2_i = cell2[i]
    icell2_i = icell2[i]
end

before the loop.
and then reference cell_1, cell_2, etc.
Also make the changes to get rid of everything indexing or unpacking the CartesianIndex I.

Hi, thanks for your suggestions. I’ve been implementing them in this loop

function prolong!(v1h::AbstractArray{T,3}, v2h::AbstractArray{T,3}) where T <: AbstractFloat
    nmesh = size(v2h)
    nmesh2 = size(v1h)
    Base.Cartesian.@nexprs 3 i -> begin
        nmesh_i = nmesh[i]
        nmesh2_i = nmesh2[i]
    end
    #@inbounds Threads.@threads for I in CartesianIndices(v2h)
    @tturbo for iz0 = axes(v2h,3), iy0 = axes(v2h,2), ix0 = axes(v2h,1)
        #ix0, iy0, iz0 = Tuple(I)
        

        ixp = ix0 + 1 
        ixp = ixp > nmesh_1 ? ixp - nmesh_1 : ixp
        ixm = ix0 - 1 
        ixm = ixm < 1 ? ixm + nmesh_1 : ixm

        iyp = iy0 + 1 
        iyp = iyp > nmesh_2 ? iyp - nmesh_2 : iyp
        iym = iy0 - 1 
        iym = iym < 1 ? iym + nmesh_2 : iym

        izp = iz0 + 1 
        izp = izp > nmesh_3 ? izp - nmesh_3 : izp
        izm = iz0 - 1 
        izm = izm < 1 ? izm + nmesh_3 : izm

        i2x0 = 2*ix0
        i2xp = i2x0 + 1
        i2xp = i2xp > nmesh2_1 ? i2xp - nmesh2_1 : i2xp

        i2y0 = 2*iy0
        i2yp = i2y0 + 1
        i2yp = i2yp > nmesh2_2 ? i2yp - nmesh2_2 : i2yp

        i2z0 = 2*iz0
        i2zp = i2z0 + 1
        i2zp = i2zp > nmesh2_3 ? i2zp - nmesh2_3 : i2zp

        v1h[i2x0, i2y0, i2z0] = v2h[ix0,iy0,iz0];
        v1h[i2xp, i2y0, i2z0] = (v2h[ix0,iy0,iz0] + v2h[ixp, iy0, iz0])/2;
        v1h[i2x0, i2yp, i2z0] = (v2h[ix0,iy0,iz0] + v2h[ix0, iyp, iz0])/2;
        v1h[i2x0, i2y0, i2zp] = (v2h[ix0,iy0,iz0] + v2h[ix0, iy0, izp])/2;
        v1h[i2xp, i2yp, i2z0] = (v2h[ix0,iy0,iz0] + v2h[ixp, iy0, iz0]
                              + v2h[ix0, iyp, iz0] + v2h[ixp, iyp, iz0])/4;
        v1h[i2x0, i2yp, i2zp] = (v2h[ix0,iy0,iz0] + v2h[ix0, iyp, iz0]
                              + v2h[ix0, iy0, izp] + v2h[ix0, iyp, izp])/4;
        v1h[i2xp, i2y0, i2zp] = (v2h[ix0,iy0,iz0] + v2h[ixp, iy0, iz0]
                              + v2h[ix0, iy0, izp] + v2h[ixp, iy0, izp])/4;
        v1h[i2xp, i2yp, i2zp] = (v2h[ix0,iy0,iz0] + v2h[ixp, iy0, iz0]
                              + v2h[ix0, iyp, iz0] + v2h[ix0, iy0, izp]
                              + v2h[ixp, iyp, iz0] + v2h[ixp, iy0, izp]
                              + v2h[ix0, iyp, izp] + v2h[ixp, iyp, izp])/8;
    end #for v2
    v1h
 end #func

but now when using @tturbo I get an out of memory error that I do not get if I uncomment the @threads for and unpack the ix0,, iy0, iz0, have you encountered this before? Should the vectorized code use more memory? Or has it something to do with the compilation?

Thanks,

Please provide a copy/pastable reproducer. Defining the function/expanding the macro works for me.

Sorry, I forgot the initialization

v1h = zeros(Float32, 512,512,512)
v2h = rand(Float32, 256, 256, 256)
prolong!(v1h, v2h)

Can you try again with VectorizationBase 0.12.54 (which was just released)?
I get

julia> @benchmark prolong!($v1h, $v2h)
BenchmarkTools.Trial: 303 samples with 1 evaluation.
 Range (min … max):  16.214 ms …  20.038 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.333 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.511 ms ± 556.707 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅█▆▂
  ▄█████▇▆▄▄▄▅▁▁▁▄▄▁▅▅▄▁▅▁▆▁▁▁▁▅▄▁▁▄▁▁▁▄▁▁▁▁▁▁▁▁▁▄▁▁▁▅▁▆▄▄▄▄▄▄ ▆
  16.2 ms       Histogram: log(frequency) by time      18.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark prolong_threads!($v1h, $v2h)
BenchmarkTools.Trial: 199 samples with 1 evaluation.
 Range (min … max):  24.969 ms …  27.071 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.063 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.136 ms ± 266.319 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▇██▆▄
  ▇██████▇▅▆▅▅▅▁▁▇▆▁▁▅▁▅▁▁▅▅▁▁▁▆▅▁▅▁▁▆▁▁▅▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅ ▅
  25 ms         Histogram: log(frequency) by time      26.3 ms <

 Memory estimate: 10.59 KiB, allocs estimate: 109.

After moving the @inbounds inside of the @threads for the @threads version (otherwise, it’s ignored).

You could probably improve performance by splitting the loop up to avoid those branches on indexes.
In theory, LoopVectorization should do that automatically, but it’s not something I’d gotten around to implementing.
I’ll very likely implement it in the rewrite, since it is a pattern that shows up so often.

Vectorizing those branches is very expensive, because it makes the loads/stores that use them go from using fairly cheap vector moves, to requiring expensive gather/scatter instructions.

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> f(arg1, args...), 1:N)
foreachf (generic function with 1 method)

julia> @time @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(prolong!, 100, v1h, v2h)
       end
  1.689482 seconds (32.32 k allocations: 1.924 MiB, 2.14% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               1.01e+11   33.3%  #  3.4 cycles per ns
└ task-clock               2.93e+10   33.3%  # 29.3 s
┌ instructions             7.86e+09   66.7%  #  0.1 insns per cycle
│ branch-instructions      5.10e+07   66.7%  #  0.6% of insns
└ branch-misses            1.68e+05   66.7%  #  0.3% of branch insns
┌ L1-dcache-load-misses    1.29e+09   33.3%  # 53.2% of dcache loads
│ L1-dcache-loads          2.42e+09   33.3%
│ cache-misses             9.56e+08   33.3%  # 96.3% of cache refs
└ cache-references         9.93e+08   33.3%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @time @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(prolong_threads!, 100, v1h, v2h)
       end
  2.604119 seconds (43.48 k allocations: 2.966 MiB, 1.39% compilation time)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               1.71e+11   33.3%  #  3.7 cycles per ns
└ task-clock               4.58e+10   33.3%  # 45.8 s
┌ instructions             3.03e+11   66.7%  #  1.8 insns per cycle
│ branch-instructions      6.96e+09   66.7%  #  2.3% of insns
└ branch-misses            5.65e+06   66.7%  #  0.1% of branch insns
┌ L1-dcache-load-misses    1.07e+09   33.3%  #  1.1% of dcache loads
│ L1-dcache-loads          1.01e+11   33.3%
│ cache-misses             1.02e+09   33.3%  # 95.6% of cache refs
└ cache-references         1.06e+09   33.3%
                 aggregated from 18 threads
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Look at the difference in number of instructions needed (and the number of instructions per clock cycle)!
That is, the @threads version required 38.5 times as many instructions as the @tturbo version!
This is running it 100 times.

Howe do I make sure I use the correct VectorizationBase? I just did using VectorizationBase and I got

│ Package VectorizationBase not found, but a package named VectorizationBase is available from a registry. 
 │ Install package?
 │   (v1.7) pkg> add VectorizationBase 
 └ (y/n) [y]: 
   Resolving package versions...
    Updating `/global/u1/d/dforero/.julia/nersc/perlmutter/environments/v1.7/Project.toml`
  [3d5dd08c] + VectorizationBase v0.21.49
  No Changes to `/global/u1/d/dforero/.julia/nersc/perlmutter/environments/v1.7/Manifest.toml`

If I try

Pkg.add(PackageSpec(name="VectorizationBase", version="0.21.54"))

I get

ERROR: Unsatisfiable requirements detected for package VectorizationBase [3d5dd08c]:
 VectorizationBase [3d5dd08c] log:
 ├─possible versions are: 0.1.0-0.21.49 or uninstalled
 └─restricted to versions 0.21.54 by an explicit requirement — no versions left