UndefVarError from LoopVectorization.@turbo

I’m having trouble trying to use LoopVectorization.@turbo—it gives an ERROR: UndefVarError: i1 not defined.

I suspect that the issue stem from the “strange” format of the weights and indices inputs. (The reason for the strange format is that I don’t know the length of vectors in A.)

Can I rewrite test1 so that @turbo works here?

using LoopVectorization

function test1(A)
    n1, n2 = size(A)
    n3 = 100

    # Generate nonsense inputs
    weights = Matrix{Vector{NTuple{2, Float64}}}(
        undef,
        n1,
        n2,
    )
    indices = Matrix{Vector{Int}}(
        undef,
        n1,
        n2,
    )
    nA = length(A[1])
    for ix in eachindex(weights)
        weights[ix] = Vector{NTuple{2, Float64}}(undef, n3)
        for i3 in eachindex(weights[ix])
            weights[ix][i3] = (rand(2)...,)
            weights[ix][i3] = weights[ix][i3] ./ sum(weights[ix][i3])
        end

        indices[ix] = sort(rand(1:nA - 1, n3))
    end

    # Computation
    B = zeros(n1, n2, n3)
    @turbo for i3 in 1:n3
        for i2 in 1:n2
            for i1 in 1:n1
                for i4 in 0:1
                    B[i1, i2, i3] += weights[i1, i2][i3][1 + i4] * A[i1, i2][indices[i1, i2][i3] + i4]
                end
            end
        end
    end

    return B
end

n1 = 3
n2 = 7
A = Matrix{Vector{Float64}}(
    undef,
    n1,
    n2,
)
for ix in eachindex(A)
    A[ix] = sort(rand(60));
end

test1(A)

I’ tend to agree. Assuming that you want to optimize the code I then checked @tturbo (same problem) and @batch from Polyester (no speedup).

A quick glance with a profiler revealed that most of the time is spent in the setup and not in the computation. In the setup you can fix at least two weaknesses (let us wait what @DNF sees;)

            # Allocates
            # weights[ix][i3] = (rand(2)...,)
            weights[ix][i3] = ntuple(_ -> rand(), 2)

and

        # Dynamic dispatch
        # indices[ix] = sort(rand(1:nA - 1, n3))
        indices[ix] = rand(1:nA - 1, n3)
        sort!(indices[ix])

With these fixes @btime drops from 400 μs to 80 μs but the setup (specifically the sorting of indices) is still dominating (no speedup with @batch).