getindex.((Xdata,), iW)
then?
getindex.((Xdata,), iW)
then?
Works, but is slower.
# Xc .= wW .* Xdata[iW] .+ (1.0 .- wW) .* getindex.((Xdata, ), iW .+ 1)
julia> main() # tuple trick
8.547 ms (0 allocations: 0 bytes)
julia> main() # Ref broadcasting
8.266 ms (1 allocation: 16 bytes)
I mean, it’s de-facto standard (also mentioned in ?Ref
) to use Ref
as wrapper in broadcasting and not tuples. I don’t know why but there seem to be some (performance) differences.
And 1 allocation with 16 Byte is really negligible in the context of those operations here. We’re talking about operations processing Megabytes of data.
Avoiding all allocations is critical in some cases. I must admit I am stunned that Ref
allocates, and yet is the idiomatic wrapper to avoid broadcast. That 16b ‘isn’t much’ is unpersuasive.
Yeah, I agree.
In this specific case you could create the Ref wrapper outside of the hot for loop but that’s probably not elegant.
Maybe a broadcast expert such as @mbauman can provide more details?
Would be more efficient (both in time and memory) to just write a loop for this, to avoid 3 passes over the arrays, and the code will be simpler too. (Note also that the slices Pdata[iW]
, Xdata[iW]
, and Xdata[iW.+1]
are making copies. That will go away if you write a loop.). You don’t need to bend over backwards in Julia to vectorize all of your code!
That is, try something like:
for i = 1:length(Pc)
iW = trunc(Int, (Pc[i] - Pmin) / dP) + 1
wW = 1 - (Pc[i] - Pdata[iW]) / dP
Xc[i] = wW * Xdata[iW] + (1 - wW) * Xdata[iW + 1]
end
(A similar strategy applies more globally to your code as well. Why are you doing “vectorized” interpolation at all? Possibly the interpolation can be combined with another step in your code, to eliminate the need for Pc
and Xc
arrays in the first place.)
The whole strategy of splitting up a long computation into a sequence of small vectorized steps, each of which does a small operation in a loop and produces a new array, is a suboptimal anti-pattern from languages (Matlab, Python, R) where loops are slow.
PS. There is no need to write constants as 1.0
— Julia will automatically promote mixed-precision operations. In fact, it is better to write 1 - x
than 1.0 - x
, because the latter automatically promotes the results to at least double precision, whereas the former will use the precision of x
(e.g. so that you can run the calculation in either single or double precision depending on the type of the input).
Yes, in the end I did use a loop since it did not require additional allocation. Unfortunately I got used to write either long production C loop/if codes or short concise vectorised MATLAB prototypes. Suddenly Julia makes me sit in the middle and I indeed need to reconsider these ideas…
Ref
is mutable whereas tuples are immutable; sometimes that can make it hard for Julia to avoid allocating (and yes, 16 bytes is what you’d see for a Ref allocating). Tuples aren’t zero-dimensional, though, so that can act less “scalar-like” than Ref in some situations.
Just as benchmark addition, a lot faster:
function Itp1D_for( Xc, Pc, iW, wW, Xdata, Pdata, dP, Pmin )
for i = 1:length(Pc)
iW = trunc(Int, (Pc[i] - Pmin) / dP) + 1
wW = 1 - (Pc[i] - Pdata[iW]) / dP
Xc[i] = wW * Xdata[iW] + (1 - wW) * Xdata[iW + 1]
end
end
julia> main() # for loop
5.749 ms (0 allocations: 0 bytes)
julia> main() # Ref broadcasting
8.266 ms (1 allocation: 16 bytes)
Shouldn’t this rather be
div(Pc[i] - Pmin, dP)
And this is the kind of stuff where the OP can add a @turbo
:
using LoopVectorization
function Itp1D( Xc, Pc, iW, wW, Xdata, Pdata, dP, Pmin )
@turbo for i = 1:length(Pc)
iW = trunc(Int, (Pc[i] - Pmin) / dP) + 1
wW = 1 - (Pc[i] - Pdata[iW]) / dP
Xc[i] = wW * Xdata[iW] + (1 - wW) * Xdata[iW + 1]
end
end
julia> @btime main_noturbo()
237.058 ms (16 allocations: 45.81 MiB)
julia> @btime main_turbo()
81.057 ms (16 allocations: 45.81 MiB)
(I’m running OPs complete original code here)
Now do @tturbo
julia> Threads.nthreads()
4
julia> @btime main()
37.210 ms (16 allocations: 45.81 MiB)
Excellent, should one worry about:
16 allocations: 45.81 MiB
I mean would it add up if one call many more times the function?
Why is there any allocation done there by the way?
There I benchmarked the complete code, including all the data definition, thus there is where those allocations are.