Too many allocations?

I’m thinking about optimizing performance and my mind initially goes to reducing the number of allocations. Below is a function with a loop that executes ~ 1,000,000 times.
When I “time” the routine, I find that it executes ~9 million reallocations. I just have no idea whether this is a lot and I should try to reduce it or whether this is about right.

function RK2(pend::pendulum;poincare::Bool= false)
    r = [pend.thetaInitial,pend.omegaInitial]
    t = 0.0
    N = round(Integer,pend.tMax/pend.dt)
    M = round(Integer,pend.omegaD * pend.tMax/(2π))
    theta = zeros(Float64,N+1,1)
    omega = zeros(Float64,N+1,1)
    saveTheta = zeros(Float64,M,1)
    saveOmega = zeros(Float64,M,1)
    omega = zeros(Float64,N+1,1)
    
    time = zeros(Float64,N+1,1)
    theta[1] = pend.thetaInitial
    omega[1] = pend.omegaInitial
    time[1] = 0.0
    counter = 2
    index = 1
    for i=1:N
        k1 =  pend.dt .* getDerivs(pend,r,t)  
        # test = getDerivs(pend,r,t + 1/2 * pend.dt)
        k2 = pend.dt .* getDerivs(pend,r .+ 0.5 .* k1,t + 1/2 * pend.dt)
        
        r += k2
        if r[1] > π && pend.resets
            r[1] -= 2π
        elseif r[1] < -π && pend.resets
            r[1] += 2π
        end

        if inPhase(pend,t)
            saveTheta[index] = r[1]
            saveOmega[index] = r[2]
            index += 1
        end
        t+= pend.dt
        theta[counter] = r[1]
        omega[counter] = r[2]
        time[counter] = t
        counter += 1
    end
    if poincare
        saveTheta,saveOmega
    else
        time,theta,omega  
    end
end

You need to give runnable code, otherwise it is impossible to say.

1 Like
mutable struct pendulum
    thetaInitial::Float64
    omegaInitial::Float64
    dt::Float64
    tMax::Float64
    l::Float64
    q::Float64
    Fd::Float64
    omegaD::Float64
    resets::Bool
    offset::Float64
    g::Float64
    
    function pendulum(thetaInitial::Float64, omegaInitial::Float64, dt::Float64,tMax::Float64,l::Float64,q::Float64,Fd::Float64,omegaD::Float64;resets::Bool=true,offset::Float64 = 0.0,g::Float64=9.8)
        #Only here to handle the defaults

        new(thetaInitial,omegaInitial,dt,tMax,l,q,Fd,omegaD,resets,offset,g)
    end
end

function getDerivs(pend::pendulum, vec::Array{Float64,1},t::Float64)

    dOmega = -pend.g/pend.l * sin(vec[1]) - pend.q * vec[2] + pend.Fd * sin(pend.omegaD * t)

    return [vec[2],dOmega]

end


function inPhase(pend::pendulum,t::Float64)
    if abs(pend.omegaD * (t - pend.offset)/( 2π ) - round(Integer,pend.omegaD * (t - pend.offset)/(2π) ) ) < pend.dt * pend.omegaD/(4π)
        return true
    else
        return false
    end

end
function RK4(pend::pendulum;poincare::Bool= false)
    r = [pend.thetaInitial,pend.omegaInitial]
    t = 0.0
    N = round(Integer,pend.tMax/pend.dt)
    M = round(Integer,pend.omegaD * pend.tMax/(2π))
    theta = zeros(Float64,N+1,1)
    omega = zeros(Float64,N+1,1)
    saveTheta = zeros(Float64,M,1)
    saveOmega = zeros(Float64,M,1)
    omega = zeros(Float64,N+1,1)
    
    time = zeros(Float64,N+1,1)
    theta[1] = pend.thetaInitial
    omega[1] = pend.omegaInitial
    time[1] = 0.0
    counter = 2
    index = 1
    for i=1:N
        k1 =  pend.dt .* getDerivs(pend,r,t)  
        # test = getDerivs(pend,r,t + 1/2 * pend.dt)
        k2 = pend.dt .* getDerivs(pend,r .+ 0.5 .* k1,t + 1/2 * pend.dt)
        
        r += k2
        if r[1] > π && pend.resets
            r[1] -= 2π
        elseif r[1] < -π && pend.resets
            r[1] += 2π
        end

        if inPhase(pend,t)
            saveTheta[index] = r[1]
            saveOmega[index] = r[2]
            index += 1
        end
        t+= pend.dt
        theta[counter] = r[1]
        omega[counter] = r[2]
        time[counter] = t
        counter += 1
    end
    if poincare
        saveTheta,saveOmega
    else
        time,theta,omega  
    end
end

using Profile
using Plots
plotly()
    
thetaInitial = 0.2
omegaInitial = 0.0
dt = 0.001
tMax = 16000.0
l = 9.8
q = 0.5
Fd = 1.2
omegaD = Float64(2/3)
resets = false
offset = 0.0
myPendulum = pendulum(thetaInitial,omegaInitial,dt,tMax,l,q,Fd,omegaD;resets=true)
@time saveTheta,saveOmega = RK4(myPendulum,poincare=true)

scatter(saveTheta,saveOmega,ms=1)
#mySecondPendulum = pendulum(0.202,omegaInitial,dt,tMax,l,q,Fd,omegaD,resets=true)
#timeTwo,thetaTwo = RK4(mySecondPendulum)
#plot!(timeOne,thetaTwo)

#plot!(timeOne,thetaTwo,yaxis=:log,ylim = (1e-6,10))
#Profile.print()

This is my first attempt at a Julia code. If you see anything that is bad practice or have any suggestions, I welcome them.

Thanks

This stands out to me as an easy thing to improve. You’re returning two values (which is fine), but you’re returning them by constructing an Array to hold those two elements. Constructing a new Array is rather expensive (if you do it a million times, at least), and that’s one of your sources of frequent allocations.

Instead, you can return a Tuple, which is basically free to construct. That would look like:

return vec[2], dOmega

or equivalently, (vec[2], dOmega) because the parentheses are optional here

5 Likes

Thank you. Great Idea. Can I use a tuple to do vector math upon returning from the function or will I have to do it component by component.

I should mention that this code is already much faster than it’s python counterpart. I’m just looking to build intuition for optimizing Julia code. Thanks in advance.

You can probably preallocate and re-use k1 and k2. .= assigns in place, = make a new array which would allocate. Also omegaD = Float64(2/3) the Float64 is uneccesary as 2/3 already returns a Float64.

3 Likes

Good Idea. I had that thought but I neglected the .= so I didn’t get an improvement.

You can broadcast over a tuple (e.g. you can do x .+ 1 if x is a tuple), but if you want to do anything more vector-like, then check out GitHub - JuliaArrays/StaticArrays.jl: Statically sized arrays for Julia . An SVector from StaticArrays is as cheap as a tuple (it actually is a tuple under the hood) but acts like a proper vector.

3 Likes

You guys are all awesome. I preallocated k1 and k2 and returned a tuple instead of an array. This reduced the number of allocations from 9 M to about 1 M and sped the code up by a factor of ~5. The code was already crazy fast and now it’s ridiculous fast. Thanks a ton for your quick and helpful replies.

4 Likes

You could use Parameters.jl to initialize the structure instead of the inner constructor (which, in that case, can be also an outer constructor, that is more common). And the name of the struct is recommended to be in upper case (struct Pendulum):

julia> using Parameters

julia> @with_kw mutable struct A
          x :: Float64
          b :: Bool = false
       end
A

julia> a = A(x=1.0)
A
  x: Float64 1.0
  b: Bool false



From your code I could not see why your Pendulum has to be mutable. Immutable structus will be faster, in general.

Also: Array{Float64,1} is the same as Vector{Float64} (but you probably want just AbstractVector in the function signature. That will allow your function to receive views and vectors of other types of numbers (like Float32).

This line probably allocates:

should better be r .+= k2 (with the dot).

Also, are you sure you want theta = zeros(Float64,N+1,1) instead of zeros(N+1)? which is a vector, not a two-dimensional array in which the second dimension has only one element?

2 Likes

Great suggestions. Thank you!. I’m not familiar with Parameters.jl so I’ll read up on it. Inner constructors are not typical? Is it just a question of style or are there efficiency reasons?

Generally they are used when you need to impose constraints for the parameters, and/or if the constructor is self-referential (in which case an outer constructor would loop until segfault). This is very well written here:

https://docs.julialang.org/en/v1/manual/constructors/

2 Likes

Final question on this thread:

What mental routine do you go through for determining how many allocations is appropriate and expected. (i.e. When to stop optimizing and just let it go.) In this case, I’m executing a loop 1 M times and I’m getting ~ 1 M reallocations. Seems reasonable???

If you do these changes, you reduce even further the allocations:

    rtmp = zeros(2)  # preallocate a temporary vector here
    for i=1:N
        k1 = pend.dt .* getDerivs(pend,r,t)
        rtmp .= r .+ 0.5 .* k1 # compute it here
        k2 = pend.dt .* getDerivs(pend,rtmp,t + 1/2 * pend.dt)
        r .+= k2 # add the dot

julia> @time saveTheta,saveOmega = RK4(myPendulum,poincare=true)
  0.977168 seconds (19 allocations: 488.308 MiB, 4.39% gc time)

(now the loop does not allocate anything).

Some tips here on how to find the allocations: Disabling allocations

My impression here is: loops that perform critical tasks should only allocate if one is clearly aware of the reason of the allocation. Otherwise it is a good idea to search where the allocation is and try to solve it. Of course, if the execution time is still a problem at all.

3 Likes

It can be hard to have such a mental model for allocations if a function is complicated and does many things.

My strategy is to split code into functions (where it makes sense) and profile the program to see where time is spent. Then, I can focus on one function at a time and manually look for what operations might be allocating or generally inefficient. Most of the time, those allocations can be minimized or eliminated, e.g. by passing in preallocated arrays and using in-place operations.

1 Like

This seems like a classic case where you should be using StaticArrays: a small array whose size (2) is a compile-time constant. This would also eliminate heap allocations, and would probably be faster.

In general, my feeling is that if your innermost computational loop has any allocations proportional to the number of loop iterations, you are doing something “wrong”.

8 Likes

The above probably doesn’t hurt performance much, but the pattern is quite strange:

function compare(a, b)
    if a > b
        return true
    else
        return false
    end
end

can be replaced with

function compare(a, b)
    return a > b
end

because a > b already is a boolean value. You’re basically writing “if x is true, return true, else return false”, instead of just returning “x” in the first place.

Another style thing: round(Integer, x) should probably be round(Int, x). It doesn’t seem to hurt, it’s just something I’ve never seen before, as Integer is an abstract type.

3 Likes

Shouldn’t this return an error (as confusing as it could be for a new user)?

Why? I think it makes sense as it is:

julia> f(x) = typeof(round(Integer, x))
f (generic function with 1 method)

julia> f(1.1)
Int64

julia> f(big(1.1))
BigInt

2 Likes