Large ODE Solver for Metal.jl

Gavin-Rockwood · December 27, 2025, 9:16pm

Is there any implementation for ODE solving using Metal.jl backends? I know there is DiffEqGPU, but the metal support is only for ensemble problems. I cannot for instance do

f(u,p,t) = A*u
tspan = (0.0f0, 1.0f0)
prob = ODEProblem(f, u0, tspan)
solve(prob, Tsit5())

or

solve(prob, GPUTsit5)

where A, u, are Metal arrays. Tsit5() copies back and forth to CPU, and GPUTsit5 is designed for the ensemble solver. Am I just going about this in a wrong way or would I need to write my own solving algorithm to do this?

ChrisRackauckas · December 28, 2025, 10:56am

It does not. Where did you get that?

Gavin-Rockwood · December 28, 2025, 9:17pm

If I set up the same problem, with f(u,p,t) = A*u as the function, the number of allocations when A, u are MtlArrays is almost 50x higher than the allocations when both are Float32 arrays (and the MtlArray takes significantly longer than the Float32 arrays). My assumption is that there are steps in the Tsit5 algorithm which aren’t handled properly in Metal, and maybe the data is getting saved in non Metal arrays and is getting copied back and forth? I’m not sure what other reason there would be for such high memory allocations. It is not just Tsit5 either, the vern solvers also have the same problem and I’m guessing that all the other solvers will as well.

I could be totally wrong, and the issue is elsewhere. I just don’t have any idea of what I would be.

ChrisRackauckas · December 28, 2025, 9:24pm

What about f(du,u,p,t) = mul!(du,A,u)? You’re using the allocating path instead of the non-allocating path.

ChrisRackauckas · December 28, 2025, 9:25pm

MtlArray operations are generally pretty slow until things get large. Are you testing 50,000x50,000 matrices?

Gavin-Rockwood · December 28, 2025, 9:38pm

I just gave that a try and it still has very large allocations and is orders of magnitude slower.
Also no, I am testing with smaller matrices, the largest I did was a couple ~4000x4000 (it was becoming marginally equivalent). Is there a way to speed up the MtlArray operations or is this some optimization that needs to happen in Metal.jl or on apples end before it is fast for more intermediate scale problems?

Though even if the operations themselves are slower, I’m still a bit confused by the whole memory allocations thing. I feel like it should be comparable.

ChrisRackauckas · December 28, 2025, 9:59pm

Metal operations are much slower at that size. See for example:

github.com/SciML/LinearSolve.jl

Changing the default LU?

opened 03:49AM - 08 Aug 23 UTC

closed 07:24AM - 26 Oct 23 UTC

ChrisRackauckas

This is a thread for investigating changes to the LU defaults, based off of benc…hmarks like https://github.com/SciML/LinearSolve.jl/pull/356 . (Note: there's a Mac-specific version 3 posts down) ```julia using BenchmarkTools, Random, VectorizationBase using LinearAlgebra, LinearSolve, MKL_jll nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads()) BLAS.set_num_threads(nc) BenchmarkTools.DEFAULT_PARAMETERS.seconds = 0.5 function luflop(m, n = m; innerflop = 2) sum(1:min(m, n)) do k invflop = 1 scaleflop = isempty((k + 1):m) ? 0 : sum((k + 1):m) updateflop = isempty((k + 1):n) ? 0 : sum((k + 1):n) do j isempty((k + 1):m) ? 0 : sum((k + 1):m) do i innerflop end end invflop + scaleflop + updateflop end end algs = [LUFactorization(), GenericLUFactorization(), RFLUFactorization(), MKLLUFactorization(), FastLUFactorization(), SimpleLUFactorization()] res = [Float64[] for i in 1:length(algs)] ns = 4:8:500 for i in 1:length(ns) n = ns[i] @info "$n × $n" rng = MersenneTwister(123) global A = rand(rng, n, n) global b = rand(rng, n) global u0= rand(rng, n) for j in 1:length(algs) bt = @belapsed solve(prob, $(algs[j])).u setup=(prob = LinearProblem(copy(A), copy(b); u0 = copy(u0), alias_A=true, alias_b=true)) push!(res[j], luflop(n) / bt / 1e9) end end using Plots __parameterless_type(T) = Base.typename(T).wrapper parameterless_type(x) = __parameterless_type(typeof(x)) parameterless_type(::Type{T}) where {T} = __parameterless_type(T) p = plot(ns, res[1]; ylabel = "GFLOPs", xlabel = "N", title = "GFLOPs for NxN LU Factorization", label = string(Symbol(parameterless_type(algs[1]))), legend=:outertopright) for i in 2:length(res) plot!(p, ns, res[i]; label = string(Symbol(parameterless_type(algs[i])))) end p savefig("lubench.png") savefig("lubench.pdf") ``` [lubench.pdf](https://github.com/SciML/LinearSolve.jl/files/12287070/lubench.pdf) ![lubench](https://github.com/SciML/LinearSolve.jl/assets/1814174/84d091a9-f9f6-4e5e-a208-0a1c65365cec) The justification for RecursiveFactorization.jl still looks very strong from the looks of this. ``` julia> versioninfo() Julia Version 1.9.1 Commit 147bdf428c (2023-06-07 08:27 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-14.0.6 (ORCJIT, znver3) Threads: 32 on 32 virtual cores Environment: JULIA_IMAGE_THREADS = 1 JULIA_EDITOR = code JULIA_NUM_THREADS = 32 ``` Needs examples on other systems.

It should be, it’s worth looking into. But first make it in-place like I showed, and then it should be not allocating in the steps.

Gavin-Rockwood · December 28, 2025, 10:08pm

Ah, interesting. I have seen that before but I wasn’t sure how much that would translate into ode solving.

I did try the in-place, the number of allocations went down, but nowhere near as much as it did for the CPU version. In fact, the gap in allocations got worse actually, ~120x more for the MtlArray.

ChrisRackauckas · December 28, 2025, 10:19pm

What does the allocations profiler say the source is?

Gavin-Rockwood · December 28, 2025, 10:44pm

Oh, uh, I have just been using @benchmark. I’m not very familiar with the memory allocation profiler. That might take some time for me to figure out how to interpret.

Gavin-Rockwood · December 28, 2025, 11:10pm

Is this the right thing to look at? (following [Profiling · The Julia Language] using PProf.)

ChrisRackauckas · December 29, 2025, 2:49am

Use the VS Code profiler to get it into a flamegraph? Usually I find that easier to find the lines of code that it can be attributed to.

Topic		Replies	Views
Trouble with NeuralODE and Metal Numerics question , sciml	3	461	December 28, 2023
Python vs. Julia ODE Solver Performance python	30	4671	March 3, 2021
Solving linear system without allocating Numerics	19	634	August 9, 2025
MATLAB is faster for my large system of ODEs? Numerics matlab , ode , speed-optimization , solver	21	1082	November 15, 2024
Specifying ode solver options to speed up compute time New to Julia question	42	2151	June 2, 2021

Large ODE Solver for Metal.jl

Related topics