Undesired memory allocation during interpolation

DNF · December 8, 2021, 4:17pm

getindex.((Xdata,), iW)

then?

roflmaostc · December 8, 2021, 4:22pm

Works, but is slower.

# Xc .= wW .* Xdata[iW] .+  (1.0 .- wW) .* getindex.((Xdata, ), iW .+ 1)

julia> main() # tuple trick
  8.547 ms (0 allocations: 0 bytes)

julia> main() # Ref broadcasting
  8.266 ms (1 allocation: 16 bytes)

I mean, it’s de-facto standard (also mentioned in ?Ref) to use Ref as wrapper in broadcasting and not tuples. I don’t know why but there seem to be some (performance) differences.

And 1 allocation with 16 Byte is really negligible in the context of those operations here. We’re talking about operations processing Megabytes of data.

DNF · December 8, 2021, 4:27pm

Avoiding all allocations is critical in some cases. I must admit I am stunned that Ref allocates, and yet is the idiomatic wrapper to avoid broadcast. That 16b ‘isn’t much’ is unpersuasive.

roflmaostc · December 8, 2021, 4:30pm

Yeah, I agree.
In this specific case you could create the Ref wrapper outside of the hot for loop but that’s probably not elegant.

Maybe a broadcast expert such as @mbauman can provide more details?

stevengj · December 8, 2021, 4:47pm

Gravlax:

    iW .= Int64.(round.(trunc.( (Pc .-  Pmin)./dP ) .+ 1))   # index of the west node is database
    wW .= 1.0 .- (Pc - Pdata[iW])./dP
    Xc .= wW .* Xdata[iW] .+  (1.0 .- wW) .* Xdata[iW.+1]

Would be more efficient (both in time and memory) to just write a loop for this, to avoid 3 passes over the arrays, and the code will be simpler too. (Note also that the slices Pdata[iW], Xdata[iW], and Xdata[iW.+1] are making copies. That will go away if you write a loop.). You don’t need to bend over backwards in Julia to vectorize all of your code!

That is, try something like:

for i = 1:length(Pc)
    iW = trunc(Int, (Pc[i] - Pmin) / dP) + 1
    wW = 1 - (Pc[i] - Pdata[iW]) / dP
    Xc[i] = wW * Xdata[iW] + (1 - wW) * Xdata[iW + 1]
end

(A similar strategy applies more globally to your code as well. Why are you doing “vectorized” interpolation at all? Possibly the interpolation can be combined with another step in your code, to eliminate the need for Pc and Xc arrays in the first place.)

The whole strategy of splitting up a long computation into a sequence of small vectorized steps, each of which does a small operation in a loop and produces a new array, is a suboptimal anti-pattern from languages (Matlab, Python, R) where loops are slow.

PS. There is no need to write constants as 1.0 — Julia will automatically promote mixed-precision operations. In fact, it is better to write 1 - x than 1.0 - x, because the latter automatically promotes the results to at least double precision, whereas the former will use the precision of x (e.g. so that you can run the calculation in either single or double precision depending on the type of the input).

Gravlax · December 8, 2021, 5:03pm

Yes, in the end I did use a loop since it did not require additional allocation. Unfortunately I got used to write either long production C loop/if codes or short concise vectorised MATLAB prototypes. Suddenly Julia makes me sit in the middle and I indeed need to reconsider these ideas…

mbauman · December 8, 2021, 5:06pm

Ref is mutable whereas tuples are immutable; sometimes that can make it hard for Julia to avoid allocating (and yes, 16 bytes is what you’d see for a Ref allocating). Tuples aren’t zero-dimensional, though, so that can act less “scalar-like” than Ref in some situations.

roflmaostc · December 8, 2021, 5:07pm

Just as benchmark addition, a lot faster:

function Itp1D_for( Xc, Pc, iW, wW, Xdata, Pdata, dP, Pmin )
    for i = 1:length(Pc)
        iW = trunc(Int, (Pc[i] - Pmin) / dP) + 1
        wW = 1 - (Pc[i] - Pdata[iW]) / dP
        Xc[i] = wW * Xdata[iW] + (1 - wW) * Xdata[iW + 1]
    end
end


julia> main() # for loop
  5.749 ms (0 allocations: 0 bytes)

julia> main() # Ref broadcasting
  8.266 ms (1 allocation: 16 bytes)

DNF · December 8, 2021, 5:32pm

Shouldn’t this rather be

div(Pc[i] - Pmin, dP)

lmiq · December 8, 2021, 5:47pm

And this is the kind of stuff where the OP can add a @turbo:

using LoopVectorization
function Itp1D( Xc, Pc, iW, wW, Xdata, Pdata, dP, Pmin )
    @turbo for i = 1:length(Pc)
        iW = trunc(Int, (Pc[i] - Pmin) / dP) + 1
        wW = 1 - (Pc[i] - Pdata[iW]) / dP
        Xc[i] = wW * Xdata[iW] + (1 - wW) * Xdata[iW + 1]
    end
end

julia> @btime main_noturbo()
  237.058 ms (16 allocations: 45.81 MiB)

julia> @btime main_turbo()
  81.057 ms (16 allocations: 45.81 MiB)

(I’m running OPs complete original code here)

DNF · December 8, 2021, 6:01pm

Now do @tturbo

lmiq · December 8, 2021, 6:18pm

julia> Threads.nthreads()
4

julia> @btime main()
  37.210 ms (16 allocations: 45.81 MiB)

Gravlax · December 8, 2021, 6:44pm

Excellent, should one worry about:

16 allocations: 45.81 MiB

I mean would it add up if one call many more times the function?
Why is there any allocation done there by the way?

lmiq · December 8, 2021, 6:56pm

There I benchmarked the complete code, including all the data definition, thus there is where those allocations are.

Topic		Replies	Views
Allocation of new variables vs modification of existing variables New to Julia	11	2302	January 12, 2017
Inappropriate memory allocation? Julia at Scale question , memory	24	2185	September 4, 2017
Poor performance due to memory allocations? Performance memory-allocation	17	2874	January 15, 2019
Migrating code from Matlab: Accessing arrays without allocating Performance	20	1987	November 19, 2018
Trying to understand how to write good code (example) New to Julia	34	3034	February 16, 2017

Undesired memory allocation during interpolation

Related topics