In all cases, it is not clear if you know the size of xnodes at compile time. We are all assuming it is known to be 4. If we remove that restriction, by passing that size as a parameter, the allocations come back, even with the MVector version.
One solution is to introduce a function barrier on the loop:
using StaticArrays
function Interp1d_MVector( xnodes, fvals, t, ::Val{n} ) where n
barw = zeros(MVector{n,Float64})
for j ∈ 1:n
barw[j] = 1.0
for k ∈ 1:n
if k != j
barw[j] = barw[j] * (xnodes[j] - xnodes[k])
end
end
barw[j] = 1.0/barw[j]
end
numt, denomt = 0, 0
@inbounds for j ∈ eachindex(xnodes)
tdiff = t - xnodes[j]
numt = numt + barw[j] / tdiff * fvals[j]
denomt = denomt + barw[j] / tdiff
if ( abs(tdiff) < 1e-14 )
numt = fvals[j]
denomt = 1.0
break
end
end
return numt / denomt
end
function the_loop(xnodes, fvals, t, n)
for i ∈ 1:100000
u = Interp1d_MVector(xnodes, fvals, t, n)
end
end
function run_Interp1d_MVector(N)
xnodes = LinRange(0,1,N)
fvals = xnodes.^4
t = 0.3
n = Val(length(xnodes))
the_loop(xnodes, fvals, t, n)
end
which gives:
julia> @btime run_Interp1d_MVector(4)
3.586 ms (2 allocations: 144 bytes)
We can make that faster by removing some small type instability (you had initialized numt and denomt as integers), and also by making xnodes a static array:
using StaticArrays
function Interp1d_MVector2(xnodes::SVector{N,T}, fvals::SVector{N,T}, t) where {N,T}
barw = zeros(MVector{N,T})
for j ∈ 1:N
barw[j] = one(T)
for k ∈ 1:N
if k != j
barw[j] = barw[j] * (xnodes[j] - xnodes[k])
end
end
barw[j] = inv(barw[j])
end
numt, denomt = zero(T), zero(T)
@inbounds for j ∈ eachindex(xnodes)
tdiff = t - xnodes[j]
numt = numt + barw[j] / tdiff * fvals[j]
denomt = denomt + barw[j] / tdiff
if ( abs(tdiff) < 1e-14 )
numt = fvals[j]
denomt = one(T)
break
end
end
return numt / denomt
end
function the_loop(xnodes, fvals, t)
for i ∈ 1:100000
u = Interp1d_MVector2(xnodes, fvals, t)
end
end
function run_Interp1d_MVector2(N)
xnodes = SVector{N,Float64}(LinRange(0,1,4))
fvals = xnodes.^4
t = 0.3
the_loop(xnodes, fvals, t)
end
which gives:
julia> @btime run_Interp1d_MVector2(4)
1.036 ms (16 allocations: 800 bytes)
Probably with some other tweaks proposed above, it can be even faster. But you won´t get that complete code to be completely non-allocating if one assumes that the length of xnodes cannot be known at compile time. What is important is to isolate that from the hot loop.
edit: Seems you can use @turbo elimnating the branch with ifelse:
@turbo for j ∈ 1:N
barw[j] = one(T)
for k ∈ 1:N
barw[j] = ifelse(k == j , barw[j] , barw[j] * (xnodes[j] - xnodes[k]) )
end
barw[j] = inv(barw[j])
end
With which I get:
julia> @btime run_Interp1d_MVector2(4)
584.132 μs (16 allocations: 800 bytes)
(I’m not checking the results)
The whole run becomes non-allocating if N is hard coded, but the time does not change much.