In all cases, it is not clear if you know the size of xnodes
at compile time. We are all assuming it is known to be 4
. If we remove that restriction, by passing that size as a parameter, the allocations come back, even with the MVector version.
One solution is to introduce a function barrier on the loop:
using StaticArrays
function Interp1d_MVector( xnodes, fvals, t, ::Val{n} ) where n
barw = zeros(MVector{n,Float64})
for j ∈ 1:n
barw[j] = 1.0
for k ∈ 1:n
if k != j
barw[j] = barw[j] * (xnodes[j] - xnodes[k])
end
end
barw[j] = 1.0/barw[j]
end
numt, denomt = 0, 0
@inbounds for j ∈ eachindex(xnodes)
tdiff = t - xnodes[j]
numt = numt + barw[j] / tdiff * fvals[j]
denomt = denomt + barw[j] / tdiff
if ( abs(tdiff) < 1e-14 )
numt = fvals[j]
denomt = 1.0
break
end
end
return numt / denomt
end
function the_loop(xnodes, fvals, t, n)
for i ∈ 1:100000
u = Interp1d_MVector(xnodes, fvals, t, n)
end
end
function run_Interp1d_MVector(N)
xnodes = LinRange(0,1,N)
fvals = xnodes.^4
t = 0.3
n = Val(length(xnodes))
the_loop(xnodes, fvals, t, n)
end
which gives:
julia> @btime run_Interp1d_MVector(4)
3.586 ms (2 allocations: 144 bytes)
We can make that faster by removing some small type instability (you had initialized numt
and denomt
as integers), and also by making xnodes
a static array:
using StaticArrays
function Interp1d_MVector2(xnodes::SVector{N,T}, fvals::SVector{N,T}, t) where {N,T}
barw = zeros(MVector{N,T})
for j ∈ 1:N
barw[j] = one(T)
for k ∈ 1:N
if k != j
barw[j] = barw[j] * (xnodes[j] - xnodes[k])
end
end
barw[j] = inv(barw[j])
end
numt, denomt = zero(T), zero(T)
@inbounds for j ∈ eachindex(xnodes)
tdiff = t - xnodes[j]
numt = numt + barw[j] / tdiff * fvals[j]
denomt = denomt + barw[j] / tdiff
if ( abs(tdiff) < 1e-14 )
numt = fvals[j]
denomt = one(T)
break
end
end
return numt / denomt
end
function the_loop(xnodes, fvals, t)
for i ∈ 1:100000
u = Interp1d_MVector2(xnodes, fvals, t)
end
end
function run_Interp1d_MVector2(N)
xnodes = SVector{N,Float64}(LinRange(0,1,4))
fvals = xnodes.^4
t = 0.3
the_loop(xnodes, fvals, t)
end
which gives:
julia> @btime run_Interp1d_MVector2(4)
1.036 ms (16 allocations: 800 bytes)
Probably with some other tweaks proposed above, it can be even faster. But you won´t get that complete code to be completely non-allocating if one assumes that the length of xnodes
cannot be known at compile time. What is important is to isolate that from the hot loop.
edit: Seems you can use @turbo
elimnating the branch with ifelse
:
@turbo for j ∈ 1:N
barw[j] = one(T)
for k ∈ 1:N
barw[j] = ifelse(k == j , barw[j] , barw[j] * (xnodes[j] - xnodes[k]) )
end
barw[j] = inv(barw[j])
end
With which I get:
julia> @btime run_Interp1d_MVector2(4)
584.132 μs (16 allocations: 800 bytes)
(I’m not checking the results)
The whole run becomes non-allocating if N
is hard coded, but the time does not change much.