Using views causes significant slowdown in simple PDE

I have stumbled upon a frustrating performance issue with either DifferentialEquations.jl or the Julia compiler (or maybe my code?). The motivating use-case is the application of the 1-dimensional heat equation (and possibly additional dynamics) to a heterogeneous surface that must be broken up into (somewhat) homogeneous pieces for discretization. Thus, we have a large vector representing the energy state of the entire surface, but we need to call different forms of the heat equation for different chunks. For the purposes of this example, we’ll just call the same function for each chunk.

The problem is that, when using views at both levels (i.e. the chunks as well as for computing second-order finite difference), there is a >800x performance hit, apparently from allocations. This is not reproducible when benchmarking the functions standalone. The ODE solve is fast when using slices for the chunks at the top level, even though this should, in theory, involve more allocation.

I’ve talked to Chris R. on Slack and he thinks it might be some weird compiler issue. So if anyone here is more experienced with the nuances of Julia’s compiler, I would really appreciate your help!

I have prepared a simple MWE to demonstrate the problem. See below. This might be clearer than me blabbering anyway!

using DifferentialEquations, OrdinaryDiffEq, DiffEqOperators
using ComponentArrays
using BenchmarkTools

const nknots = 100
const h = 1.0/(nknots+1)
const t0,t1 = (0.0,1.0)
u0 = randn(300)
u0 = ComponentArray(a=u0[1:100],b=u0[101:200],c=u0[201:300])

@inline function heat_conduction(du,u,p,t)
  du .= zero(eltype(u))
  u₃ = @view u[3:end]
  u₂ = @view u[2:end-1]
  u₁ = @view u[1:end-2]
  @. du[2:end-1] = ((u₃ - 2*u₂ + u₁)/h^2.0)
  nothing
end

function func_linidx_with_slices(du,u,p,t)
    du1, u1 = du[1:100], u[1:100]
    du2, u2 = du[101:200], u[101:200]
    du3, u3 = du[201:300], u[201:300]
    heat_conduction(du1,u1,p,t)
    heat_conduction(du2,u2,p,t)
    heat_conduction(du3,u3,p,t)
end

function func_linidx_with_views(du,u,p,t)
    du1, u1 = (@view du[1:100]),@view u[1:100]
    du2, u2 = (@view du[101:200]),@view u[101:200]
    du3, u3 = (@view du[201:300]),@view u[201:300]
    heat_conduction(du1,u1,p,t)
    heat_conduction(du2,u2,p,t)
    heat_conduction(du3,u3,p,t)
end

prob = ODEProblem(func_linidx_with_slices, u0, (t0, t1))
sol = @btime solve(prob, Tsit5(), saveat=0.2)
# 70.031 μs (271 allocations: 234.27 KiB)
prob = ODEProblem(func_linidx_with_views, u0, (t0, t1))
sol = @btime solve(prob, Tsit5(), saveat=0.2)
# 83.342 ms (11714 allocations: 242.91 KiB)
du0 = similar(u0)
du0 .= 0.0
@benchmark func_linidx_with_slices($du0,$u0,nothing,0.0)
# BenchmarkTools.Trial:
# memory estimate:  5.25 KiB
# allocs estimate:  6
# --------------
# minimum time:     900.111 ns (0.00% GC)
# median time:      1.019 μs (0.00% GC)
# mean time:        1.357 μs (15.32% GC)
# maximum time:     72.786 μs (95.15% GC)
# --------------
# samples:          10000
# evals/sample:     36
@benchmark func_linidx_with_views($du0,$u0,nothing,0.0)
# BenchmarkTools.Trial: 
#   memory estimate:  0 bytes
#   allocs estimate:  0
#   --------------
#   minimum time:     374.263 ns (0.00% GC)
#   median time:      375.234 ns (0.00% GC)
#   mean time:        393.908 ns (0.00% GC)
#   maximum time:     1.268 μs (0.00% GC)
#   --------------
#   samples:          10000
#   evals/sample:     205

The main thing to note here is the apparent difference between benchmarking the ODE solve vs the standalone functions. In the latter case, the function using views is more efficient, as expected.

Your func_linidx_with_slices does not modify du itself, does it? It modifies only the slices, which are copies of some parts. Hence, the ODE algorithm just sees zero change and can step fast without needing to do real work.
Caveat: I’m on my phone and can’t check that right now. This is just a guess based on the code.

5 Likes

Wow, I’m dumb. You’re totally right, that’s gotta be it. Thanks!