I have stumbled upon a frustrating performance issue with either DifferentialEquations.jl or the Julia compiler (or maybe my code?). The motivating use-case is the application of the 1-dimensional heat equation (and possibly additional dynamics) to a heterogeneous surface that must be broken up into (somewhat) homogeneous pieces for discretization. Thus, we have a large vector representing the energy state of the entire surface, but we need to call different forms of the heat equation for different chunks. For the purposes of this example, we’ll just call the same function for each chunk.
The problem is that, when using views at both levels (i.e. the chunks as well as for computing second-order finite difference), there is a >800x performance hit, apparently from allocations. This is not reproducible when benchmarking the functions standalone. The ODE solve is fast when using slices for the chunks at the top level, even though this should, in theory, involve more allocation.
I’ve talked to Chris R. on Slack and he thinks it might be some weird compiler issue. So if anyone here is more experienced with the nuances of Julia’s compiler, I would really appreciate your help!
I have prepared a simple MWE to demonstrate the problem. See below. This might be clearer than me blabbering anyway!
using DifferentialEquations, OrdinaryDiffEq, DiffEqOperators using ComponentArrays using BenchmarkTools const nknots = 100 const h = 1.0/(nknots+1) const t0,t1 = (0.0,1.0) u0 = randn(300) u0 = ComponentArray(a=u0[1:100],b=u0[101:200],c=u0[201:300]) @inline function heat_conduction(du,u,p,t) du .= zero(eltype(u)) u₃ = @view u[3:end] u₂ = @view u[2:end-1] u₁ = @view u[1:end-2] @. du[2:end-1] = ((u₃ - 2*u₂ + u₁)/h^2.0) nothing end function func_linidx_with_slices(du,u,p,t) du1, u1 = du[1:100], u[1:100] du2, u2 = du[101:200], u[101:200] du3, u3 = du[201:300], u[201:300] heat_conduction(du1,u1,p,t) heat_conduction(du2,u2,p,t) heat_conduction(du3,u3,p,t) end function func_linidx_with_views(du,u,p,t) du1, u1 = (@view du[1:100]),@view u[1:100] du2, u2 = (@view du[101:200]),@view u[101:200] du3, u3 = (@view du[201:300]),@view u[201:300] heat_conduction(du1,u1,p,t) heat_conduction(du2,u2,p,t) heat_conduction(du3,u3,p,t) end prob = ODEProblem(func_linidx_with_slices, u0, (t0, t1)) sol = @btime solve(prob, Tsit5(), saveat=0.2) # 70.031 μs (271 allocations: 234.27 KiB) prob = ODEProblem(func_linidx_with_views, u0, (t0, t1)) sol = @btime solve(prob, Tsit5(), saveat=0.2) # 83.342 ms (11714 allocations: 242.91 KiB) du0 = similar(u0) du0 .= 0.0 @benchmark func_linidx_with_slices($du0,$u0,nothing,0.0) # BenchmarkTools.Trial: # memory estimate: 5.25 KiB # allocs estimate: 6 # -------------- # minimum time: 900.111 ns (0.00% GC) # median time: 1.019 μs (0.00% GC) # mean time: 1.357 μs (15.32% GC) # maximum time: 72.786 μs (95.15% GC) # -------------- # samples: 10000 # evals/sample: 36 @benchmark func_linidx_with_views($du0,$u0,nothing,0.0) # BenchmarkTools.Trial: # memory estimate: 0 bytes # allocs estimate: 0 # -------------- # minimum time: 374.263 ns (0.00% GC) # median time: 375.234 ns (0.00% GC) # mean time: 393.908 ns (0.00% GC) # maximum time: 1.268 μs (0.00% GC) # -------------- # samples: 10000 # evals/sample: 205
The main thing to note here is the apparent difference between benchmarking the ODE solve vs the standalone functions. In the latter case, the function using views is more efficient, as expected.