Avoiding allocations of small but non-trivial arrays (work array alternative?)

Strange that it doesn’t work, I am only trying on my phone (Arm), but it seems to run fine for me. Maybe try:

for k in eachindex(xnodes)
      x = Int(k==j)
      barw_j *= (xnodes[j] - xnodes[k]) + x
 end

Note that when I tried it, I also used eachindex instead of 1:n but I’m not sure if it made a difference.

Or maybe

for k in eachindex(xnodes)
      x = ifelse(k==j, 1.0,0.0)
      barw_j *= (xnodes[j] - xnodes[k]) + x
 end

As I mentioned upthread, you don’t need the Val type. Passing the length of an array as a separate argument does not look good, imo. If you know n statically, why are you not encoding it in the type of xnodes instead, as a static array?

I agree that passing the length of an array as a separate argument doesn’t look good. It’d be nice to find a way to get rid of the extra function argument, I just don’t know how to do that in my real-life scenario, where xnodes is actually a @view to a small subset of a large, standard Array. I’m not sure if I could re-cast that as a StaticArray…

Yeah, so this isn’t all that obvious, but I think the key is indexing into the large array with an SVector as index. Maybe there’s a more elegant solution, but this works for me:

function staticslice(x, start, ::Val{N}) where {N}
    ind = SVector{N}(start:start+N-1)
    return x[ind]
end

julia> x = rand(10_000);

julia> @btime staticslice($x, 200, Val(4))
  2.100 ns (0 allocations: 0 bytes)
4-element SVector{4, Float64} with indices SOneTo(4):
 0.521546493646345
 0.33167560079710856
 0.08923575401846218
 0.8255242560839948

This way you also get possible benefits of xnodes being an SVector.

1 Like

Moving to Svector’s, switching a few things inside your function to reduce divisions, merging the two loops on j, moving the stopping condition upper in the function, and a few other modifications, gave me a 4x to 5x speedup :

using LoopVectorization
using StaticArrays
using BenchmarkTools

function Interp1d_no_array( xnodes, fvals, t )
    n = length(xnodes)
    numt, denomt = 0, 0

    @inbounds for j ∈ eachindex(xnodes)
        barw_j = 1.0
        for k ∈ 1:n
            if k != j
            barw_j *= (xnodes[j] - xnodes[k])
            end
        end
        barw_j = 1.0 / barw_j
        tdiff = t - xnodes[j]
        numt = numt + barw_j / tdiff * fvals[j]
        denomt = denomt + barw_j / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end
function run_Interp1d_no_array()
    xnodes = LinRange(0,1,4)
    fvals = xnodes.^4
    t = 0.3
    
    for i ∈ 1:100000
        u = Interp1d_no_array( xnodes, fvals, t )
    end
end

function Interp1d_MVector( xnodes, fvals, t, ::Val{n} ) where n 
    
    barw = zeros(MVector{n,Float64})
    for j ∈ 1:n
        barw[j] = 1.0
        for k ∈ 1:n
            if k != j
                barw[j] = barw[j] * (xnodes[j] - xnodes[k])
            end
        end
        barw[j] = 1.0/barw[j]
    end

    numt, denomt = 0, 0
    @inbounds for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end
function run_Interp1d_MVector()
    xnodes = LinRange(0,1,4)
    fvals = xnodes.^4
    t = 0.3
    
    n = Val(length(xnodes))
    for i ∈ 1:100000
        u = Interp1d_MVector( xnodes, fvals, t, n )
    end
end

function Interp1d_Hardcoded( xn, fvals, t )
    
    barw = SA[ 1.0/((xn[1]-xn[2])*(xn[1]-xn[3])*(xn[1]-xn[4])), 1.0/((xn[2]-xn[1])*(xn[2]-xn[3])*(xn[2]-xn[4])), 1.0/((xn[3]-xn[1])*(xn[3]-xn[2])*(xn[3]-xn[4])), 1.0/((xn[4]-xn[1])*(xn[4]-xn[2])*(xn[4]-xn[3])) ]

    numt, denomt = 0, 0
    @inbounds for j ∈ eachindex(xn)
        tdiff = t - xn[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end
function run_Interp1d_Hardcoded()
    xnodes = LinRange(0,1,4)
    fvals = xnodes.^4
    t = 0.3
    
    for i ∈ 1:100000
        u = Interp1d_Hardcoded( xnodes, fvals, t )
    end
end




function lrnv(xnodes,fvals, t) 

    numt, denomt = 0.0, 0.0
    n = length(xnodes)

    @inbounds for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        if (abs(tdiff) < 1e-14)
            return fvals[j]
        end
    end

    @inbounds for j ∈ eachindex(xnodes)
        
        xⱼ = xnodes[j]
        bⱼ = (t-xⱼ)
        for k ∈ 1:n
            if k != j
                bⱼ *= (xⱼ - xnodes[k])
            end
        end
        inv_bⱼ = 1 / bⱼ
        numt += inv_bⱼ * fvals[j]
        denomt += inv_bⱼ
    end
    return numt / denomt
end
function run_lrnv()
    xnodes = SVector{4}(LinRange(0,1,4))
    fvals = xnodes.^4
    t = 0.3
    u = 0.0
    for i ∈ 1:100000
        u += lrnv( xnodes,fvals, t)
    end
    return u
end

@btime run_Interp1d_no_array()  # 4.984 ms on my machine, 1 allocation
@btime run_Interp1d_MVector()   # 2.846 ms on my machine, 1 allocation
@btime run_Interp1d_Hardcoded() # 3.709 ms on my machine, 1 allocation
@btime run_lrnv()               # 0.728 ms on my machine, 0 allocation

I appologies for the @turbo non-working proposal, i was on my phone and did not ran the code. I tried my best but the k != j is a deal bearker for @turbo, sadly.

There is probably something smart to do to enable @simd or @turbo to work, but i did not find it.

1 Like

In all cases, it is not clear if you know the size of xnodes at compile time. We are all assuming it is known to be 4. If we remove that restriction, by passing that size as a parameter, the allocations come back, even with the MVector version.

One solution is to introduce a function barrier on the loop:

using StaticArrays

function Interp1d_MVector( xnodes, fvals, t, ::Val{n} ) where n 
    
    barw = zeros(MVector{n,Float64})
    for j ∈ 1:n
        barw[j] = 1.0
        for k ∈ 1:n
            if k != j
                barw[j] = barw[j] * (xnodes[j] - xnodes[k])
            end
        end
        barw[j] = 1.0/barw[j]
    end

    numt, denomt = 0, 0
    @inbounds for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end

function the_loop(xnodes, fvals, t, n)
    for i ∈ 1:100000
        u = Interp1d_MVector(xnodes, fvals, t, n)
    end
end

function run_Interp1d_MVector(N)
    xnodes = LinRange(0,1,N)
    fvals = xnodes.^4
    t = 0.3
    n = Val(length(xnodes))
    the_loop(xnodes, fvals, t, n)
end

which gives:

julia> @btime run_Interp1d_MVector(4)
  3.586 ms (2 allocations: 144 bytes)

We can make that faster by removing some small type instability (you had initialized numt and denomt as integers), and also by making xnodes a static array:

using StaticArrays

function Interp1d_MVector2(xnodes::SVector{N,T}, fvals::SVector{N,T}, t) where {N,T}
    barw = zeros(MVector{N,T})
    for j ∈ 1:N
        barw[j] = one(T)
        for k ∈ 1:N
            if k != j
                barw[j] = barw[j] * (xnodes[j] - xnodes[k])
            end
        end
        barw[j] = inv(barw[j])
    end

    numt, denomt = zero(T), zero(T)
    @inbounds for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = one(T)
            break
        end
    end
    return numt / denomt
end

function the_loop(xnodes, fvals, t)
    for i ∈ 1:100000
        u = Interp1d_MVector2(xnodes, fvals, t)
    end
end

function run_Interp1d_MVector2(N)
    xnodes = SVector{N,Float64}(LinRange(0,1,4))
    fvals = xnodes.^4
    t = 0.3
    the_loop(xnodes, fvals, t)
end

which gives:

julia> @btime run_Interp1d_MVector2(4)
1.036 ms (16 allocations: 800 bytes)                                                                                                         

Probably with some other tweaks proposed above, it can be even faster. But you won´t get that complete code to be completely non-allocating if one assumes that the length of xnodes cannot be known at compile time. What is important is to isolate that from the hot loop.

edit: Seems you can use @turbo elimnating the branch with ifelse:

    @turbo for j ∈ 1:N
        barw[j] = one(T)
        for k ∈ 1:N
            barw[j] = ifelse(k == j , barw[j] , barw[j] * (xnodes[j] - xnodes[k]) )
        end
        barw[j] = inv(barw[j])
    end

With which I get:

julia> @btime run_Interp1d_MVector2(4)
584.132 μs (16 allocations: 800 bytes)

(I’m not checking the results)

The whole run becomes non-allocating if N is hard coded, but the time does not change much.

1 Like

Try replacing 1:N with eachindex(barw).
That should give LV some more information.

1 Like
julia> @btime run_Interp1d_MVector2(4)
  515.025 μs (16 allocations: 800 bytes)

Thus, this is the fastest so far for me:

Code
using StaticArrays
using LoopVectorization

function Interp1d_MVector2(xnodes::SVector{N,T}, fvals::SVector{N,T}, t) where {N,T}
    barw = zeros(MVector{N,T})
    @turbo for j ∈ eachindex(barw)
        barw[j] = one(T)
        for k ∈ 1:N
            barw[j] = ifelse(k == j , barw[j] , barw[j] * (xnodes[j] - xnodes[k]) )
        end
        barw[j] = inv(barw[j])
    end

    numt, denomt = zero(T), zero(T)
    for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = one(T)
            break
        end
    end
    return numt / denomt
end

function the_loop(xnodes, fvals, t)
    for i ∈ 1:100000
        u = Interp1d_MVector2(xnodes, fvals, t)
    end
end

function run_Interp1d_MVector2(N)
    xnodes = SVector{N,Float64}(LinRange(0,1,4))
    fvals = xnodes.^4
    t = 0.3
    the_loop(xnodes, fvals, t)
end

ps: replacing 1:N in the inner loop does not work, any particular reason for that?

You can replace the second, too.

Then I get:

ERROR: "Reduction not found."
Stacktrace:
  [1] reduce_to_onevecunroll

Curiously, if I actually return something from the loop (computing the sum of u in it), the function becomes zero-allocating:

using StaticArrays
using LoopVectorization

function Interp1d_MVector2(xnodes::SVector{N,T}, fvals::SVector{N,T}, t) where {N,T}
    barw = zeros(MVector{N,T})
    @turbo for j ∈ eachindex(barw)
        barw[j] = one(T)
        for k ∈ 1:N
            barw[j] = ifelse(k == j , barw[j] , barw[j] * (xnodes[j] - xnodes[k]) )
        end
        barw[j] = inv(barw[j])
    end

    numt, denomt = zero(T), zero(T)
    for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = one(T)
            break
        end
    end
    return numt / denomt
end

function the_loop(xnodes, fvals, t)
    u = zero(eltype(xnodes))
    for i ∈ 1:100000
        u += Interp1d_MVector2(xnodes, fvals, t)
    end
    return u
end

function run_Interp1d_MVector2(N)
    xnodes = SVector{N,Float64}(LinRange(0,1,N))
    fvals = xnodes.^4
    t = 0.3
    the_loop(xnodes, fvals, t)
end
julia> @btime run_Interp1d_MVector2(4)
  538.863 μs (0 allocations: 0 bytes)
7584.269662929191

This is also probably more realistic, otherwise the function does nothing overall.

Thanks for the ifelse, this is perfect. Moving the computation out of the ifelse statement ensures it is done in every case and allows @turbo to perform a little better:

using LoopVectorization
using StaticArrays
using BenchmarkTools

function v2(x,fvals, t) 
    @inbounds for j ∈ eachindex(x)
        tdiff = t - x[j]
        if abs(tdiff) < 1e-14
            return fvals[j]
        end
    end

    nₜ = 0.0
    dₜ = 0.0
    @turbo for j ∈ eachindex(x)
        xⱼ = x[j]
        bⱼ = t - xⱼ
        for k ∈ eachindex(x)
            rₖ = xⱼ - x[k]
            bⱼ = bⱼ * ifelse(k==j, 1.0, rₖ)
        end
        ibⱼ = inv(bⱼ)
        nₜ += ibⱼ * fvals[j]
        dₜ += ibⱼ
    end
    return nₜ / dₜ
end
function run_v2()
    xnodes = SVector{4,Float64}(LinRange(0,1,4))
    fvals = xnodes.^4
    t = 0.3
    u = 0.0
    for i ∈ 1:100000
        u += v2(xnodes,fvals, t)
    end
    return u
end

Giving :

@btime run_v2()               # 0.498 ms, 0 allocation
2 Likes

=( Feel free to file an issue, but I hope to have LoopModels working as a replacement by JuliaCon 2023, so I’m unlikely to spend any time on LoopVectorization.

1 Like

Oh, wow, this is fast!

Casting a @view as a StaticArray ended up being completely straightforward (given that the size is known at compile-time)… I just do this:

xnodes = SVector{N}(@view(A[i:i+N-1]))

where if I don’t, and just pass the @view, the function is also non-allocating (because that was solved by unrwaping the vector barw, but the same holds for the MVector version), but there is a penalty of about 2X in performance.

function run_v2_view()
    A = LinRange(0,1,1_000_000)
    F = A.^4
    t = 0.3
    order = 6
    for i ∈ 1:10000
        xnodes = @view(A[i:i+order-1])
        fvals = @view(F[i:i+order-1])
        u = v2(xnodes,fvals, t)
    end
end

function run_v2_static_view()
    A = LinRange(0,1,1_000_000)
    F = A.^4
    t = 0.3
    order = 6
    for i ∈ 1:10000
        xnodes = SVector{order}(@view(A[i:i+order-1]))
        fvals = SVector{order}(@view(F[i:i+order-1]))
        u = v2(xnodes,fvals, t)
    end
end


@btime run_v2_view()
11.757 ms (2 allocations: 7.63 MiB)

@btime run_v2_static_view()
6.128 ms (2 allocations: 7.63 MiB)

(the allocations are outside of the hot loop, so no problem with that)

Hmmm, I don’t understand, I tested this in my real code, and these kinds of statements do seem to be triggering allocations… but on another level.

However, the new code with SIMD working and the original allocation problem solved seems to be the fastest by a good margin!

Ahh, this changes when using Threads:

function run_v2_view()
    A = LinRange(0,1,1_000_000)
    F = A.^4
    t = 0.3
    order = 6
    Threads.@threads for i ∈ 1:10000
        xnodes = @view(A[i:i+order-1])
        fvals = @view(F[i:i+order-1])
        u = v2(xnodes,fvals, t)
    end
end

function run_v2_static_view()
    A = LinRange(0,1,1_000_000)
    F = A.^4
    t = 0.3
    order = 6
    Threads.@threads for i ∈ 1:10000
        xnodes = SVector{order}(@view(A[i:i+order-1]))
        fvals = SVector{order}(@view(F[i:i+order-1]))
        u = v2(xnodes,fvals, t)
    end
end


@btime run_v2_view()
11.486 ms (51 allocations: 7.63 MiB)

@btime run_v2_static_view()
16.378 ms (240051 allocations: 19.84 MiB)

Looks like the multithreading is causing a type instability since order isn’t being constant-propagated across the task boundary. I’d rewrite it like this:

function run_v2_static_view(::Val{order}) where {order}
    A = LinRange(0,1,1_000_000)
    F = A.^4
    t = 0.3
    Threads.@threads for i ∈ 1:10000
        xnodes = SVector{order}(@view(A[i:i+order-1]))
        fvals = SVector{order}(@view(F[i:i+order-1]))
        u = v2(xnodes,fvals, t)
    end
end
julia> @btime run_v2_static_view(Val(6))
  3.371 ms (39 allocations: 7.63 MiB)
1 Like

What happens if you do

function run_v2_static_index()
    A = LinRange(0,1,1_000_000)
    F = A.^4
    t = 0.3
    inds = SVector{6}(0:5)
    Threads.@threads for i ∈ 1:10000
        xnodes = A[inds.+i]
        fvals = F[inds.+i]
        u = v2(xnodes,fvals, t)
    end
end

?

That works too, however it seems to be only slightly slower than the previous alternative