Trying to improve readability of an ODE function; attempt using views adds allocations

Hi, I’ve got the following model in which it happens that (so far) the odd elements are of one type of quantity (e.g. voltages) and the evens are another (e.g. currents), although that could change as the model is elaborated. Writing the equations in u[i] notation is somewhat harder to read than if I can use more meaningful names like i[n] and v[n]. I recognize that @ode_def of ParameterizedFunctions.jl addresses that, but I’m using a for loop to create a large number of repeated sections (in this example a lumped model for a transmission line) and as a consequence I’m not sure I can use @ode_def. I’ve tried using views, but that slowed things down with a large number of allocations.

Here’s the ODE function as first written:

function lcs(du,u,p,t)
    L, C, R, RT = p
    n=1
    du[n] = 1/L*(V0(t)-u[n+1]-R*ssq(u[n]))        # di/dt
    for lump in 1:NUM_LUMPS
        n+=1
        du[n] = 1/C*(u[n-1]-u[n+1])               # dv/dt
        du[n+1] = 1/L*(u[n]-u[n+2]-R*ssq(u[n+1])) # di/dt
        n+=1
    end
    du[n+1] = 1/C*(u[n]-u[n+1]/RT)                # dv/dt
end

and here’s a version using views to allow for (slightly) more readable equations:

function lcs(du,u,p,t)
    di = view(du,1:2:length(du))
    i  = view(u, 1:2:length(u))  # odd indices are currents
    dv = view(du,2:2:length(du))
    v  = view(u, 2:2:length(u))  # even indices are voltages
    L, C, R, RT = p
    di[1] = 1/L*(V0(t)-v[1]-R*ssq(i[1]))     
    for n in 1:NUM_LUMPS
        dv[n] = 1/C*(i[n]-i[n+1])          
        di[n+1] = 1/L*(v[n]-v[n+1]-R*ssq(i[n+1])) 
    end
    n = NUM_LUMPS+1
    dv[n] = 1/C*(i[n]-v[n]/RT)               
end

The views version works (I appear to be getting the same results), but I go from 0.625387 seconds (3.71 M allocations: 184.470 MiB, 5.91% gc time) to 1.343471 seconds (35.14 M allocations: 671.296 MiB, 11.95% gc time).

I expected each view to create “a view into the parent array with the given indices instead of making a copy”, and did not expect additional allocations. Thanks for any insight.

You can also use embedded functions instead, e.g. i(n) = u[2n-1]. That’s pretty powerful for more complex examples, but trickier to do with mutation though (set_dv(n,val) = (dv[2n-1] = val) is not as nice). Regarding performance, this should be OK, modulo performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub.

I am surprised that this is happening. In theory, the views should get inlined and elided in that kind of code. Could you try this?

function lcs(du,u,p,t)
    L, C, R, RT = p
   @inbounds begin
    di = view(du,1:2:length(du))
    i  = view(u, 1:2:length(u))  # odd indices are currents
    dv = view(du,2:2:length(du))
    v  = view(u, 2:2:length(u))  # even indices are voltages
    di[1] = 1/L*(V0(t)-v[1]-R*ssq(i[1]))     
    for n in 1:NUM_LUMPS
        dv[n] = 1/C*(i[n]-i[n+1])          
        di[n+1] = 1/L*(v[n]-v[n+1]-R*ssq(i[n+1])) 
    end
    n = NUM_LUMPS+1
    dv[n] = 1/C*(i[n]-v[n]/RT)
    end;
end

The reason for this rearrangement is that the optimizer sometimes gets confused by possible error paths and allocates objects eagerly, just in case that they are later needed for an error message if some exception is thrown.

I can’t reproduce what you are seeing (trying to take some naive defaults for V0, NUM_LUMPS and ssq:

julia> const NUM_LUMPS = 999
999

julia> V0(t) = t
V0 (generic function with 1 method)

julia> ssq(x) = x
ssq (generic function with 1 method)

julia> function lcs(du,u,p,t)
           L, C, R, RT = p
           n=1
           du[n] = 1/L*(V0(t)-u[n+1]-R*ssq(u[n]))        # di/dt
           for lump in 1:NUM_LUMPS
               n+=1
               du[n] = 1/C*(u[n-1]-u[n+1])               # dv/dt
               du[n+1] = 1/L*(u[n]-u[n+2]-R*ssq(u[n+1])) # di/dt
               n+=1
           end
           du[n+1] = 1/C*(u[n]-u[n+1]/RT)                # dv/dt
       end
lcs (generic function with 1 method)

julia> function lcs2(du,u,p,t)
           di = view(du,1:2:length(du))
           i  = view(u, 1:2:length(u))  # odd indices are currents
           dv = view(du,2:2:length(du))
           v  = view(u, 2:2:length(u))  # even indices are voltages
           L, C, R, RT = p
           di[1] = 1/L*(V0(t)-v[1]-R*ssq(i[1]))     
           for n in 1:NUM_LUMPS
               dv[n] = 1/C*(i[n]-i[n+1])          
               di[n+1] = 1/L*(v[n]-v[n+1]-R*ssq(i[n+1])) 
           end
           n = NUM_LUMPS+1
           dv[n] = 1/C*(i[n]-v[n]/RT)               
       end
lcs2 (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime lcs($(rand(2000)), $(rand(2000)), (1E0, 1E0, 1E0, 1E0), 1);
  1.857 μs (0 allocations: 0 bytes)

julia> @btime lcs2($(rand(2000)), $(rand(2000)), (1E0, 1E0, 1E0, 1E0), 1);
  1.655 μs (0 allocations: 0 bytes)

This is on the next Julia (1.2) built from source. @klaff could you share what you are seeing in a similarly self-contained way, so people can help you?

Sorry I didn’t give enough context to be runnable. Here’s 3 versions.

  1. original: 0.623364 seconds (3.71 M allocations: 184.502 MiB, 5.81% gc time):
using DifferentialEquations

NUM_LUMPS = 80

"ODE model of cascaded LC circuits with nonlinear Rs"
function lcs(du,u,p,t)
    L, C, R, RT = p
    n=1
    du[n] = 1/L*(V0(t)-u[n+1]-R*ssq(u[n]))        # di/dt
    for lump in 1:NUM_LUMPS
        n+=1
        du[n] = 1/C*(u[n-1]-u[n+1])               # dv/dt
        du[n+1] = 1/L*(u[n]-u[n+2]-R*ssq(u[n+1])) # di/dt
        n+=1
    end
    du[n+1] = 1/C*(u[n]-u[n+1]/RT)                # dv/dt
end

ssq(x) = x>0 ? x^2 : -x^2
V0(t)=10.0*(t<10.0 ? t/5.0-(t/5.0)^2/4 : 1.0)

u0 = zeros(2*NUM_LUMPS+2)
# TODO use labelled vector here
p = [1.0, 1.0, 0.002, 100.0]
tspan = (0.0, 800.0)
prob = ODEProblem(lcs, u0, tspan, p)
@time sol = solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)

using Plots
plot(sol, vars=[2,NUM_LUMPS, NUM_LUMPS*2], legend=false)
  1. using views w/o @inbounds (1.257474 seconds (35.15 M allocations: 671.780 MiB, 12.95% gc time)):
using DifferentialEquations

NUM_LUMPS = 80

"ODE model of cascaded LC circuits with nonlinear Rs"
function lcs(du,u,p,t)
    di = view(du,1:2:length(du))
    i  = view(u, 1:2:length(u))  # odd indices are currents
    dv = view(du,2:2:length(du))
    v  = view(u, 2:2:length(u))  # even indices are voltages
    L, C, R, RT = p
    di[1] = 1/L*(V0(t)-v[1]-R*ssq(i[1]))     
    for n in 1:NUM_LUMPS
        dv[n] = 1/C*(i[n]-i[n+1])               # dv/dt
        di[n+1] = 1/L*(v[n]-v[n+1]-R*ssq(i[n+1])) # di/dt
    end
    n = NUM_LUMPS+1
    dv[n] = 1/C*(i[n]-v[n]/RT)                # dv/dt
end

ssq(x) = x>0 ? x^2 : -x^2
V0(t)=10.0*(t<10.0 ? t/5.0-(t/5.0)^2/4 : 1.0)

u0 = zeros(2*NUM_LUMPS+2)
# TODO use labelled vector here
p = [1.0, 1.0, 0.002, 100.0]
tspan = (0.0, 800.0)
prob = ODEProblem(lcs, u0, tspan, p)
@time sol = solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)

using Plots
plot(sol, vars=[2,NUM_LUMPS, NUM_LUMPS*2], legend=false)
  1. with views and with @inbounds (1.368173 seconds (35.08 M allocations: 701.929 MiB, 11.61% gc time)):
using DifferentialEquations

NUM_LUMPS = 80

"ODE model of cascaded LC circuits with nonlinear Rs"
function lcs(du,u,p,t)
    @inbounds begin
    di = view(du,1:2:length(du))
    i  = u[1:2:length(u)]  # odd indices are currents
    dv = view(du,2:2:length(du))
    v  = u[2:2:length(u)]  # even indices are voltages
    L, C, R, RT = p
    di[1] = 1/L*(V0(t)-v[1]-R*ssq(i[1]))     
    for n in 1:NUM_LUMPS
        dv[n] = 1/C*(i[n]-i[n+1])               # dv/dt
        di[n+1] = 1/L*(v[n]-v[n+1]-R*ssq(i[n+1])) # di/dt
    end
    n = NUM_LUMPS+1
    dv[n] = 1/C*(i[n]-v[n]/RT)                # dv/dt
    end
end

ssq(x) = x>0 ? x^2 : -x^2
V0(t)=10.0*(t<10.0 ? t/5.0-(t/5.0)^2/4 : 1.0)

u0 = zeros(2*NUM_LUMPS+2)
# TODO use labelled vector here
p = [1.0, 1.0, 0.002, 100.0]
tspan = (0.0, 800.0)
prob = ODEProblem(lcs, u0, tspan, p)
@time sol = solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)

using Plots
plot(sol, vars=[2,NUM_LUMPS, NUM_LUMPS*2], legend=false)

I just realized I didn’t read your reply closely enough. Here’s your version of my function in full context ( 1.262483 seconds (35.14 M allocations: 671.353 MiB, 13.10% gc time)):

using DifferentialEquations

NUM_LUMPS = 80

"ODE model of cascaded LC circuits with nonlinear Rs"
function lcs(du,u,p,t)
    L, C, R, RT = p
   @inbounds begin
    di = view(du,1:2:length(du))
    i  = view(u, 1:2:length(u))  # odd indices are currents
    dv = view(du,2:2:length(du))
    v  = view(u, 2:2:length(u))  # even indices are voltages
    di[1] = 1/L*(V0(t)-v[1]-R*ssq(i[1]))     
    for n in 1:NUM_LUMPS
        dv[n] = 1/C*(i[n]-i[n+1])          
        di[n+1] = 1/L*(v[n]-v[n+1]-R*ssq(i[n+1])) 
    end
    n = NUM_LUMPS+1
    dv[n] = 1/C*(i[n]-v[n]/RT)
    end;
end

ssq(x) = x>0 ? x^2 : -x^2
V0(t)=10.0*(t<10.0 ? t/5.0-(t/5.0)^2/4 : 1.0)

u0 = zeros(2*NUM_LUMPS+2)
# TODO use labelled vector here
p = [1.0, 1.0, 0.002, 100.0]
tspan = (0.0, 800.0)
prob = ODEProblem(lcs, u0, tspan, p)
@time sol = solve(prob, Tsit5(), reltol=1e-8, abstol=1e-8)

using Plots
plot(sol, vars=[2,NUM_LUMPS, NUM_LUMPS*2], legend=false)

Shouldn’t you use const NUM_LUMPS = 80 in order to ensure type stability of global variables?

The same issue appears in the old (non-refactored) version. However, it is written such that lump from for lump in 1:NUM_LUMPS is never used, and hence only the iterator and last two lines after the loop are dynamic dispatch. In contrast, the new version uses for n in 1:NUM_LUMPS, which means that n::Any inside the loop, and hence almost everything is ::Any.

Making NUM_LUMPS const doesn’t change the allocation. Nor does replacing every instance of NUM_LUMPS with 80, so if it’s a type issue that’s not the location. I’ll explore the change in the loop variable usage.

In my testing, your second example modified only with const NUM_LUMPS=80 completely fixes the issue; in fact, it is now faster than your first example.

I think I might be getting confused by comparing things in different cells of the same jupyter notebook, so I put them in different notebooks, and my results now agree with yours (I think) although I have noticed something else that surprises me. If I reset the kernel and rerun the cell, the first run takes longer and does a lot more allocation (no surprise there), and the second run is faster. But then the third run is faster yet!

So for a given file, the allocations might go 17.8M to 2.3M to 1.6M while the memory number goes from 935MiB to 154 MiB to 122 MiB. I don’t know why it drops down from the second run to the third. Seems stable from there on.

I don’t have time right now (and it may be a few days) but I want to see if I can get back to my previous (confusing) state to learn how to avoid it in the future.

Thanks for your help.

Rather than worrying about the speed of the first call, check out GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language for a more precise and robust way to benchmark your code. BenchmarkTools will run your code several times, measuring the minimum, mean, and maximum execution times and memory usage, which should give you a clearer picture of how your code actually performs.

2 Likes