Performance comparision for equivalent DiffEqFlux models

Hello,

I am considering these 3 ODEs

function minimalModelNonAlloc!(D, u, p, t)
	M = p
	v = 0.2 .* ones(length(u))

	D .= v - M * u.^3 - u.^5
	return nothing
end

function minimalModelAlloc(u, p, t)
	M = p
	v = 0.2 .* ones(length(u))

	return v - M * u.^3 - u.^5
end

function minimalModelAllocContainer(u, p, t)
	@unpack M = p
	v = 0.2 .* ones(length(u))

	return v - M * u.^3 - u.^5
end

and compare their performance when optimising some loss function via M (see below) for different dimensions of u. In the last model I use a ComponentArray for p as a container.

I find following result:
Performance test
(There is also a similar result for the allocate memory, but I am only allowed to upload one file :confused: )

Okay, unsurprisingly, the containerised version fairs worse than minimalModelAlloc which I expected, since the container adds some overhead. But the magnitude of it surprises me nevertheless. However, the biggest surprise is, that the non allocating version is by far the worst. We are talking 1.9 seconds per run vs ~ 80 ms for minimalModelAlloc at length(u). These results are completely unintuitve for me. Could anyone please tell me what part of the equation I am missing :smiley:

The code for benchmarking I use reads

dims     = [2, 5, 10, 20, 50, 100, 200]
memories = []
times    = []
for dim ∈ dims
	@info "Processing $dim dimensions"
	function lossBenchmark(p)
		sol  = solve(prob, Tsit5(), p = p)
		loss = sum(abs2, sol[end])
		return loss, sol
	end

	u0 = zeros(dim)
	# pTest = ones(dim, dim)
	pTest = ComponentArray(M = ones(dim, dim))
	prob = ODEProblem(minimalModelAllocContainer, u0, (0.0, 10.0), pTest)
	benchmark = @benchmark DiffEqFlux.sciml_train($lossBenchmark, $pTest, $ADAM(0.05), maxiters = 30) samples = 100 seconds = 30

	push!(memories, benchmark.memory)
	push!(times,    benchmark.times |> mean)
end

All models are tested under exactly the same u0 and M so that cannot be the reason. Also, I know there is still so much room for improvement. My goal here was just to make all three approaches comparable. Thanks in advance for any input on that puzzeling finding :slight_smile: .

It has to do with how reverse mode AD is implemented. Reverse mode AD requires knowing the value in order to reverse on it. In order to handle mutation, it goes into a fallback behavior based on scalar mode.

https://diffeq.sciml.ai/stable/analysis/sensitivity/#Internal-Automatic-Differentiation-Options-(ADKwargs)

https://diffeqflux.sciml.ai/dev/ControllingAdjoints/

Basically, you might want to try benchmarking with:

	function lossBenchmark(p)
		sol  = solve(prob, Tsit5(), p = p, sensealg = InterpolatingAdjoint(ReverseDiffVJP(true)))
		loss = sum(abs2, sol[end])
		return loss, sol
	end

That would compile the tape and improve the speed of scalar mode. You’ll probably see a smaller difference in that case (and all will be faster than before).

The choice of sensealg can matter a lot for this. But yes, reverse-mode can be quite unintuitive in terms of allocation handling, or at least it will look like it breaks the standard rules.

Thanks for the answer! The performance between the allocating and non-allocating model become indeed quite comparible. The model using ComponentArrays, however, now fails with type TrackedArray has no field M in the line @unpack M = p; full stack trace below.

This aligns with what I observed in testing, so I just assume ComponentArrays do not properly work with AD? I know, this now goes a little of topic, but are there any containers which work for AD / how hard would it be to create a struct and the necessary functions for this?

type TrackedArray has no field M
in top-level scope at Performance Tests.jl:38
in macro expansion at BenchmarkTools\eCEpo\src\execution.jl:287
in warmup at BenchmarkTools\eCEpo\src\execution.jl:141
in #warmup#45 at BenchmarkTools\eCEpo\src\execution.jl:141 
in run##kw at BenchmarkTools\eCEpo\src\execution.jl:94 
in run##kw at BenchmarkTools\eCEpo\src\execution.jl:94 
in #run#40 at BenchmarkTools\eCEpo\src\execution.jl:94
in run_result##kw at BenchmarkTools\eCEpo\src\execution.jl:32 
in #run_result#37 at BenchmarkTools\eCEpo\src\execution.jl:32 
in invokelatest##kw at base\essentials.jl:711 
in #invokelatest#1 at base\essentials.jl:716 
in  at base\essentials.jl:715
in _run##kw at BenchmarkTools\eCEpo\src\execution.jl:399 
in #_run#38 at BenchmarkTools\eCEpo\src\execution.jl:405
in ##sample#564 at BenchmarkTools\eCEpo\src\execution.jl:377
in ##core#563 at BenchmarkTools\eCEpo\src\execution.jl:371
in sciml_train##kw at DiffEqFlux\FZMwP\src\train.jl:77 
in sciml_train##kw at DiffEqFlux\FZMwP\src\train.jl:77 
in #sciml_train#31 at DiffEqFlux\FZMwP\src\train.jl:42
in maybe_with_logger at DiffEqBase\V7P18\src\utils.jl:259
in  at DiffEqFlux\FZMwP\src\train.jl:43
in macro expansion at ProgressLogging\BBN0b\src\ProgressLogging.jl:328 
in macro expansion at DiffEqFlux\FZMwP\src\train.jl:98 
in gradient at Zygote\1GXzF\src\compiler\interface.jl:54
in  at Zygote\1GXzF\src\compiler\interface.jl:177
in  at Zygote\1GXzF\src\compiler\interface2.jl
in #33 at DiffEqFlux\FZMwP\src\train.jl:99 
in  at ZygoteRules\OjfTt\src\adjoint.jl:59
in #175 at Zygote\1GXzF\src\lib\lib.jl:182 
in  at Zygote\1GXzF\src\compiler\interface2.jl
in lossBenchmark at Performance Tests.jl:29 
in  at Zygote\1GXzF\src\compiler\interface2.jl
in solve##kw at DiffEqBase\V7P18\src\solve.jl:100 
in  at ZygoteRules\OjfTt\src\adjoint.jl:59
in  at Zygote\1GXzF\src\lib\lib.jl:182
in  at Zygote\1GXzF\src\compiler\interface2.jl
in #solve#460 at DiffEqBase\V7P18\src\solve.jl:102 
in  at ZygoteRules\OjfTt\src\adjoint.jl:59
in #175 at Zygote\1GXzF\src\lib\lib.jl:182 
in #498#back at ZygoteRules\OjfTt\src\adjoint.jl:59 
in  at DiffEqSensitivity\WiCRA\src\local_sensitivity\concrete_solve.jl:144
in  at DiffEqSensitivity\WiCRA\src\local_sensitivity\sensitivity_interface.jl:6
in #adjoint_sensitivities#42 at DiffEqSensitivity\WiCRA\src\local_sensitivity\sensitivity_interface.jl:6
in _adjoint_sensitivities at DiffEqSensitivity\WiCRA\src\local_sensitivity\sensitivity_interface.jl:13
in _adjoint_sensitivities at DiffEqSensitivity\WiCRA\src\local_sensitivity\sensitivity_interface.jl:13
in #_adjoint_sensitivities#43 at DiffEqSensitivity\WiCRA\src\local_sensitivity\sensitivity_interface.jl:17
in ODEAdjointProblem##kw at DiffEqSensitivity\WiCRA\src\local_sensitivity\interpolating_adjoint.jl:161 
in #ODEAdjointProblem#114 at DiffEqSensitivity\WiCRA\src\local_sensitivity\interpolating_adjoint.jl:173
in ODEInterpolatingAdjointSensitivityFunction at DiffEqSensitivity\WiCRA\src\local_sensitivity\interpolating_adjoint.jl:22 
in #ODEInterpolatingAdjointSensitivityFunction#110 at DiffEqSensitivity\WiCRA\src\local_sensitivity\interpolating_adjoint.jl:55
in  at DiffEqSensitivity\WiCRA\src\local_sensitivity\adjoint_common.jl:27
in #adjointdiffcache#73 at DiffEqSensitivity\WiCRA\src\local_sensitivity\adjoint_common.jl:131
in ReverseDiff.GradientTape at ReverseDiff\vScHI\src\api\tape.jl:204
in ReverseDiff.GradientTape at ReverseDiff\vScHI\src\api\tape.jl:207
in  at DiffEqSensitivity\WiCRA\src\local_sensitivity\adjoint_common.jl:132
in  at DiffEqBase\V7P18\src\diffeqfunction.jl:248
in minimalModelAllocContainer at Performance Tests.jl:17
in macro expansion at UnPack\EkESO\src\UnPack.jl:101 
in unpack at UnPack\EkESO\src\UnPack.jl:34 
in getproperty at base\Base.jl:33

ReverseDiff only has a small set of array types it supports, so it’s a ReverseDiff.jl issue, while Zygote.jl is fine with this. Unless you want to get in the weeds for fun, I’d say to just do what you can for now and Diffractor will solve a lot of these little issues in a better way (or Enzyme).