Performance tips for differential equation RHS

madeline · January 18, 2024, 12:19am

Hello, I’ve been working on a project which relies on being able to swiftly solve something like the following system:

using DifferentialEquations
using StaticArrays
using Einsum
using Random
using Graphs

function wattsstrogatzmatrix(size, neighbors, rewiring_prob)
    g = watts_strogatz(size, 2*neighbors, rewiring_prob)
    coupling_matrix = adjacency_matrix(g)
    coupling_matrix = Matrix(coupling_matrix)
    coupling_matrix = coupling_matrix - Diagonal(vec(sum(coupling_matrix, dims=2))) # Ensure zero row sum
    return -coupling_matrix
end

function fhn_eom(x, params)
    a = params[1]
    eps = params[2]
    dx = (x[1] - (x[1]^3)/3 - x[2])/eps
    dy = x[1] + a
    return SVector{2}(dx, dy)
end

function bmatrix(phi, eps)
    return -[cos(phi)/eps sin(phi)/eps; -sin(phi) cos(phi)]
    #return SArray{Tuple{2,2}}(-cos(phi)/eps, sin(phi)/eps, -sin(phi)/eps, -cos(phi))
end

function coupled_fhn_eom!(dx, x, a, eps, coupling_strength, coupling_matrix, b)
    N = length(coupling_matrix[1, :])
    eachneuron = reshape(x, (2, N))
    coupling_terms = b * eachneuron
    @einsum coupling[i, j] := coupling_matrix[i, r] * coupling_terms[j, r]
    eom_params = SVector{2}(a, eps)
    for i in range(1, N)
        thisneuron = @view eachneuron[:, i]
        thiscoupling = @view coupling[i, :]
        dx_i = fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling
        # dx_i = fhn_eom(eachneuron[:, i], eom_params) .+ coupling_strength .* coupling[i, :]
        dx[2*i-1:2*i] = dx_i
    end
    nothing
end

N = 175
eps = 0.05
a = 0.5
b = bmatrix(pi/2-0.1, eps)
σ = 0.0506
G = wattsstrogatzmatrix(N, 3, 1);#0.232)

x_0 = zeros(2*N)
x_0[2 .* (1:N) .- 1] = rand(N) .* 2 .* a .- a
x_0[2 .* (1:N)] = rand(N) .* 2 .* (-a + a^3 / 3) .- (-a + a^3 / 3)

prob = ODEProblem((dx, x, params, t) -> coupled_fhn_eom!(dx, x, params[1], params[2], params[3], G, b), x_0, (0.0, 1000.0), [a, eps, σ])
sol = solve(prob);

I’m fairly new to Julia, and tried optimizing it as much as I could with some tips I found online. I’ve had some success, the first versions were much slower than this. However I’ve a hard time knowing how I could optimize this further, with more Julia-specific optimizations, which I’m keen to learn.

In my pc, this is taking close to 7 seconds (measured using btime).

Any tips would be appreciated!

ChrisRackauckas · January 18, 2024, 7:31am

In successive order of going deeper into the rabbit hole of performance:

https://www.youtube.com/watch?v=h-xVBD2Pk9o (and associated notes Code Profiling and Optimization - MIT Parallel Computing and Scientific Machine Learning (SciML))
Code Optimization for Differential Equations · DifferentialEquations.jl
Solving Large Stiff Equations · DifferentialEquations.jl
Optimizing Serial Code - MIT Parallel Computing and Scientific Machine Learning (SciML) (and associated video https://www.youtube.com/watch?v=M2i7sSRcSIw)

For the first one, the Juno profiler was replaced with VS Code, and @profview. See Profiler · Julia in VS Code.

Also for sharing flamegraphs, I highly recommend GitHub - tkluck/StatProfilerHTML.jl: Show Julia profiling data in an explorable HTML page as a nice tool to generate flamegraphs that share all of the information.

Take a stab at a some of this, and when you’re ready isolate the RHS into an entity to be directly called and share some flame graphs to highlight what the next steps are in the optimization process.

madeline · January 18, 2024, 5:40pm

Thank you! I generated the flame graph. It’s the first time I publish a website, so hopefully this works. Here is the flame graph..

I noticed most of the time is spent unfolding the macro, so by defining a function with the output of @macroexpand @einsum… there was a time reduction from 7s to around 2.5. However, I believe there could be some further improvements, since I didn’t really touch the @macroexpand output (well I did, but it stopped working).

Here is the function that replaced the einsum line:


function einsum_expansion(coupling, coupling_matrix, coupling_terms)
    quote
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:207 =#
        local var"##T#236" = eltype(coupling)
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:208 =#
        nothing
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:209 =#
        if size(coupling_matrix, 2) == size(coupling_terms, 2)
            nothing
        else
            Base.throw(Base.AssertionError("size(coupling_matrix, 2) == size(coupling_terms, 2)"))
        end
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:212 =#
        let i, j, r
            #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:213 =#
            begin
                $(Expr(:inbounds, true))
                local var"#25#val" = for j = 1:min(size(coupling, 2), size(coupling_terms, 1))
                            #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:278 =#
                            for i = 1:min(size(coupling, 1), size(coupling_matrix, 1))
                                #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:278 =#
                                begin
                                    #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:176 =#
                                    local var"##s#237" = zero(var"##T#236")
                                    #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:177 =#
                                    for r = 1:size(coupling_matrix, 2)
                                        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:278 =#
                                        var"##s#237" += coupling_matrix[i, r] * coupling_terms[j, r]
                                        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:279 =#
                                    end
                                    #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:178 =#
                                    coupling[i, j] = var"##s#237"
                                end
                                #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:279 =#
                            end
                            #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:279 =#
                        end
                $(Expr(:inbounds, :pop))
                var"#25#val"
            end
        end
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:216 =#
        coupling
    end
    return coupling
end

With some small tweaks to the original coupled_fnm_eom! function, displayed here:

function coupled_fhn_eom!(dx, x, a, eps, coupling_strength, coupling_matrix, b)
    N = length(coupling_matrix[1, :])
    eachneuron = reshape(x, (2, N))
    coupling_terms = b * eachneuron
    coupling = zeros(Float64, (N, 2))
    einsum_expansion(coupling, coupling_matrix, coupling_terms)
    eom_params = SVector{2}(a, eps)
    for i in range(1, N)
        thisneuron = @view eachneuron[:, i]
        thiscoupling = @view coupling[i, :]
        dx_i::SVector{2,Float64} = fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling
        dx[2*i-1:2*i] = dx_i
    end
    nothing
end

I don’t believe I can host more than one github page, so here is an image of the new flame graph:

I think I could gain further improvements if I took care of the yellow Array block. Opening the file it links to reveals that the following function is being called:

Array{T,1}(::UndefInitializer, m::Int) where {T} =
    ccall(:jl_alloc_array_1d, Array{T,1}, (Any, Int), Array{T,1}, m)

Is there any way I can speed this up? I still don’t understand type stability too much, would learning that help?

madeline · January 18, 2024, 5:47pm

It sucks to say, but when I tested it against the original code, I noticed the outputs were all wrong. Back to square one I think. How could I make the macro expansion work? I now tested with some code I actually understand but it’s just as slow as the code I began with (7s). Here is the new einsum function (which works)

function einsum_expansion!(coupling, coupling_matrix, coupling_terms, N)
    for j = 1:2
        for i = 1:N
            local s = zero(Float64)
            for r = 1:N
                @inbounds s += coupling_matrix[i, r] * coupling_terms[j, r]
            end
            @inbounds coupling[i, j] = s
        end
    end
    nothing
end

Oscar_Smith · January 18, 2024, 6:28pm

macroexpand happens during compile time, so I think you might be profiling the wrong thing.

madeline · January 18, 2024, 7:04pm

How so? I added this line for profiling:

@profview solve(prob);

Also tried

@profview for i in 1:100000 coupled_fhn_eom!(dx, x_0, a, eps, σ, G, b) end

Both gave me more or less the same information (most of the time is spent in the coupled_fhn function, in the @einsum line). Should I be doing something else?

ChrisRackauckas · January 18, 2024, 8:27pm

Make sure solve is run once before doing the profile, otherwise your profile will show the compilation process.

The allocation cost is the line coupling = zeros(Float64, (N, 2)). Remove that allocation by pre-allocating it.

madeline · January 18, 2024, 10:29pm

Thank you! I’m now caching and profiling as follows:

couplingcache = zeros(Float64, (N, 2))
prob = ODEProblem((dx, x, params, t) -> coupled_fhn_eom!(dx, x, params[1], params[2], params[3], G, b, couplingcache), x_0, (0.0, 1000.0), [a, eps, σ])
sol = solve(prob);

@profview solve(prob);
@btime solve(prob);

It reduced the memory allocations some 10%, and sped up accordingly. However, the flame graph remains unchanged, so I’m still a bit doubtful if it is the correct way to do this.

ChrisRackauckas · January 19, 2024, 3:29am

Is everything in dx_i::SVector{2,Float64} = fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling a static vector? If so, then the ::SVector{2,Float64} is redundant. But if not, then it will allocate and then convert to a static vector, which isn’t what you want. Make sure that you never construct it.

The easiest way to fix this line is to just make a cache vector for dx_i and then change this to:

dx_i .= fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling
dx[2*i-1:2*i] .= dx_i

Once that allocation is handled, the only thing that shows up in the profile is the einsum. Did you try replacing einsum with Tullio.jl?

mihalybaci · January 19, 2024, 4:38pm

I gave this a try, and maybe (probably?) I did it poorly, because somehow my performance came out worse. What did help though was removing dx_i altogether and doing everything on the same line like this

dx[2*i-1:2*i] .= fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling

That one change improved the speed by 12% and dropped the number and size of the allocations in half.

ChrisRackauckas · January 19, 2024, 7:35pm

Alright cool, then all that’s left is the einsum.

madeline · January 19, 2024, 7:57pm

Thank you! Both @tullio and mihalybaci’s modifications gave about a 10% improvement each! I tried some further options for tullio (loopvectorization and some others displayed on the github page) but the default behaviour seemed to perform best. I failed to mention that coupling_matrix is a symmetric matrix (would be nice if the code admitted asymmetric couplings for future reference though), so I also tried just using mul! and calculating the transpose of the coupling matrix described originally, without needing to call transpose() (and later calling view on the other dimension). This yielded slower performance, even with a preallocated output matrix. Is it reasonable to conclude that there’s just a limit to how fast I can calculate the coupling array?

ChrisRackauckas · January 19, 2024, 7:59pm

What’s the newest stacktrace?

madeline · January 20, 2024, 9:13pm

I just updated the web page (here). It’s a bit different from the last one, since it’s now just profiling the coupled_fhn_eom function, with a 90 element network, instead of 150 (just for speed when profiling). Thank you again for the help with this!

ChrisRackauckas · January 21, 2024, 1:27am

All of the time is Tullio, so focus on that.

GitHub - mcabbott/Tullio.jl: ⅀ it looks like it can be made faster if you add LoopVectorization.jl as an extension. Also, you want to use the = form, instead of the :=, into a pre-allocated vector so that the allocation goes away.

madeline · January 21, 2024, 3:03pm

Seems like adding LoopVectorization slows it down though. The following is without LoopVec:

And with LoopVec, I get the following:

The red text, if I’m not mistaken, means that Julia had to do some type inference when running. How do I know where that is happening, and how to fix it? I can narrow it down to the loop, but at least the container already has a defined type for its elements.

ChrisRackauckas · January 21, 2024, 3:29pm

How much did the preallocation do?

madeline · January 21, 2024, 3:50pm

I added a line in the function that separates the creation of the array from the @tullio line, as so:

    coupling = zeros(Float64, N, 2)
    @tullio coupling[i, j] = coupling_matrix[i, r] * coupling_terms[j, r]

I know this just moves the allocations to another line, but when profiling, it shows that only 2% of the time is spent on the creation of the array, so for cleaner code on the rest of the project I think it may be worth it to not cache it. Anyways, here are the results of a proper cache implementation:

It’s admittedly more than a 2% performance increase. Still not sure if worth the restructuring of existing code.

ChrisRackauckas · January 21, 2024, 4:35pm

Just pre-allocate that and enclose it.

Then everything is Tullio, which seems like a bound.

madeline · January 21, 2024, 5:41pm

Thank you for the help! It seems that with that, it’s as fast as it’ll go.

Topic		Replies	Views
Reducing the memory allocation in ODEs Performance question	11	1402	January 22, 2022
Speed up solution system of ODEs New to Julia diffeq	23	4416	June 2, 2019
Python vs. Julia ODE Solver Performance python	30	4549	March 3, 2021
Julia vs Matlab - building a Jacobian matrix New to Julia	30	4317	March 28, 2019
Trying to understand how to write good code (example) New to Julia	34	3087	February 16, 2017

Performance tips for differential equation RHS

Related topics