LoopVectorization does not support functions with kwargs?

I get an error when trying to use LoopVectorization below:

using LoopVectorization

function f(x; kw=kw)
    x + kw
end

function add(z; kw=1)
    s = zero(eltype(kw))
    @turbo for x in 1:z
        s += f(x; kw=kw)
    end
    return s
end

ERROR: LoadError: LoadError: Expression not recognized.
$(Expr(:parameters, :($(Expr(:kw, :kw, :kw)))))

It seems that LoopVectorization does not support functions with kwargs, is that it?

Looks like you can get around this with a helper function, although there may be a better way.

julia> function add(z; kw=1)
           s = zero(eltype(kw))
           g(x) = f(x; kw = kw)
           @turbo for x in 1:z
               s += g(x)
           end
           return s
       end
add (generic function with 1 method)

julia> add(10)
65

Amusingly, the compiler is able to optimize the loop away entirely if you remove the @turbo:

julia> @btime add2(10^8) # sans @turbo
  1.100 ns (0 allocations: 0 bytes)
5000000150000000
1 Like

Yes, that was just a toy example, in my real application @turbo will probably speed things up.

1 Like

I’m getting a weird error:

ERROR: ArgumentError: invalid index: VectorizationBase.MM{4, 1, Int64}<1, 2, 3, 4> of type VectorizationBase.MM{4, 1, Int64}

This is a way to reproduce it:

using BenchmarkTools
using ComponentArrays
using UnPack
using LoopVectorization

const szK=5

θ = ComponentArray(
    α0 = rand(2, 2, 5),
    α = rand(3),
    αh = rand(2, 2),
    αga = rand(2, 46)
)

data = ComponentArray(
    ww = rand(2, 9, 2, 2, 5, 44),
    P_1 = rand(5, 44),
    ē = rand(5, 44)
)

@views function Uk(g, a, e, k, t; data=data, θ=θ)
    # params
    @unpack α0, α = θ
    # data
    @unpack ww, P_1, ē = data
    w = ww[g, a, e, 1, k, t]
    α0[g, e, k] + α[1] * w + α[2] * (e - ē[k, t])^2 + α[3] * P_1[k, t]
end

@views function α̃0(g, a, e, t; data=data, θ=θ)
    @unpack αh, αga = θ
    αh[1, g] + αh[2, g] * t + αga[g, a]
end

function U0(g, a, e, t; data=data, θ=θ)
    α̃0(g, a, e, t; data=data, θ=θ)
end

function Probk(g, a, e, k, t; data=data, θ=θ)
    u00 = U0(g, a, e, t; data=data, θ=θ)
    uk = ( k == 6 ? u00 : Uk(g, a, e, k, t; data=data, θ=θ) )
    hlp(g, a, e, kk, t) = Uk(g, a, e, kk, t; data=data, θ=θ)
    s = zero(eltype(θ))
    @turbo for kk=1:szK
        s += exp(hlp(g, a, e, kk, t) - uk)
    end
    inv(s + exp(u00 - uk))
end

Call Probk(1, 5, 2, 2, 1; data=data, θ=θ) to see the error.

That can be boiled down to

julia> using LoopVectorization

julia> using LoopVectorization: VectorizationBase

julia> ww = rand(2, 9, 2, 2, 5, 44);

julia> kk = VectorizationBase.Vec{4, Int64}(1, 2, 3, 4)
Vec{4, Int64}<1, 2, 3, 4>

julia> ww[1, 1, 1, 1, kk, 1]
ERROR: ArgumentError: invalid index: Vec{4, Int64}<1, 2, 3, 4> of type Vec{4, Int64}
Stacktrace:
 [1] to_index(i::Vec{4, Int64})
   @ Base .\indices.jl:300
 [2] to_index(A::Array{Float64, 6}, i::Vec{4, Int64})
   @ Base .\indices.jl:277
 [3] to_indices (repeats 5 times)
   @ .\indices.jl:333 [inlined]
 [4] to_indices(A::Array{Float64, 6}, I::Tuple{Int64, Int64, Int64, Int64, Vec{4, Int64}, Int64})
   @ Base .\indices.jl:324
 [5] getindex(::Array{Float64, 6}, ::Int64, ::Int64, ::Int64, ::Int64, ::Vararg{Any})
   @ Base .\abstractarray.jl:1218

LoopVectorization has a hard time propagating the getindex call within a wrapped function. I was able to get around it by hand-inlining the function call, but @Elrod may have a better solution:

julia> function Probk(g, a, e, k, t; data=data, θ=θ)
                  u00 = U0(g, a, e, t; data=data, θ=θ)
                  uk = ( k == 6 ? u00 : Uk(g, a, e, k, t; data=data, θ=θ) )
                  hlp(g, a, e, kk, t) = Uk(g, a, e, kk, t; data=data, θ=θ)
                  s = zero(eltype(θ))
                  @turbo for kk=1:szK
                      w = data.ww[g, a, e, 1, kk, t]
                      Uk = θ.α0[g, e, k] + θ.α[1] * w + θ.α[2] * (e - data.ē[k, t])^2 + θ.α[3] * data.P_1[k, t]
                      s += exp(Uk - uk)
                  end
                  inv(s + exp(u00 - uk))
              end
Probk (generic function with 1 method)

julia> Probk(1, 5, 2, 2, 1; data=data, θ=θ)
0.19252954882136444
2 Likes

I could define getindex methods to make it “work”, but the fundamental problem is that the current version of LoopVectorization cannot inspect what is going on inside your user defined functions, and thus it is not able to reason about them – or make anything but limited decisions with respect to them.

In this case, that’d probably be okay since there weren’t many decisions to be made anyway, so it’d probably make the same one anyway.
But imagine you had

@turbo for kk in 1:szK, ii in 1:szI
    foo(ii, kk, data)
end
foo(i, k) = data[i,k]

how is it supposed to know whether it is better to SIMD the i or the k loop?
The plan is to eventually fix this by allowing LV to look inside and figure out what’s going on.

If you do want to hack it into working anyway, we could define the getindex methods anyway. But with the warning that performance can be suboptimal, e.g. in the above example with more than 1 loop.

Additionally, note that any code using AbstractSIMD types like Vecs or MMs is likely to not be inlined by the compiler. Julia thinks these operations are something like 40x (EDIT: actually 10x) more expensive than they actually are, so it ends up making really bad inlining decisions, forcing you to @inline everything yourself.

4 Likes

Is there a julia issue filed about this? A simple MWE might make it fairly straightforward to add an appropriate cost, unless there’s something that requires a whole different model. But it seems much better to see if we can just fix that. See https://github.com/JuliaLang/julia/blob/c8cc1b56c15388ebb2d206dfb71d764fcf049ba3/base/compiler/tfuncs.jl#L49-L59

1 Like

Oops, seems I misremembered:

add_tfunc(Core.Intrinsics.llvmcall, 3, INT_INF,
          (@nospecialize(fptr), @nospecialize(rt), @nospecialize(at), a...) -> instanceof_tfunc(rt)[1], 10)

For some reason I thought the cost was 40, but it seems to be 10.
Would the suggestion be to reduce the cost of llvmcall, or could there be some way to define costs on a per-llvmcall basis?

Also, mul_float being 4 and muladd_float being 5 while add_float being 1 seems excessive. Note on Haswell, FMA was actually faster than add_float.
But it’s worth asking whether these costs should represent reciprocal throughput or latency. Currently, they look more like latency.

1 Like

The numbers were mostly picked out of thin air (more precisely, from relatively generic costs http://ithare.com/wp-content/uploads/part101_infographics_v08.png which may be increasingly outdated). Updates that improve precision are welcomed.

With regards to latency vs throughput, I am not sure (I’d defer to your judgement). If we keep our current model, ideally the best approach might be to tune them against benchmarks to maximize performance, but that’s a pretty challenging road.

2 Likes

Sorry, I forgot to respond to that. It seems ideal to do that, but it’s been long enough since I worked on that code that I don’t remember many details. It seems likely that we’d have to introduce either a “sentinel cost” (which, when encountered, invokes some other pathway) or generalize T_FFUNC_COST to be a Vector{Union{Int,Function}} or something. Again, I haven’t really invested any time looking at this yet, so take this for what little it’s worth.