Performance tips for differential equation RHS

Hello, I’ve been working on a project which relies on being able to swiftly solve something like the following system:

using DifferentialEquations
using StaticArrays
using Einsum
using Random
using Graphs

function wattsstrogatzmatrix(size, neighbors, rewiring_prob)
    g = watts_strogatz(size, 2*neighbors, rewiring_prob)
    coupling_matrix = adjacency_matrix(g)
    coupling_matrix = Matrix(coupling_matrix)
    coupling_matrix = coupling_matrix - Diagonal(vec(sum(coupling_matrix, dims=2))) # Ensure zero row sum
    return -coupling_matrix
end

function fhn_eom(x, params)
    a = params[1]
    eps = params[2]
    dx = (x[1] - (x[1]^3)/3 - x[2])/eps
    dy = x[1] + a
    return SVector{2}(dx, dy)
end

function bmatrix(phi, eps)
    return -[cos(phi)/eps sin(phi)/eps; -sin(phi) cos(phi)]
    #return SArray{Tuple{2,2}}(-cos(phi)/eps, sin(phi)/eps, -sin(phi)/eps, -cos(phi))
end

function coupled_fhn_eom!(dx, x, a, eps, coupling_strength, coupling_matrix, b)
    N = length(coupling_matrix[1, :])
    eachneuron = reshape(x, (2, N))
    coupling_terms = b * eachneuron
    @einsum coupling[i, j] := coupling_matrix[i, r] * coupling_terms[j, r]
    eom_params = SVector{2}(a, eps)
    for i in range(1, N)
        thisneuron = @view eachneuron[:, i]
        thiscoupling = @view coupling[i, :]
        dx_i = fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling
        # dx_i = fhn_eom(eachneuron[:, i], eom_params) .+ coupling_strength .* coupling[i, :]
        dx[2*i-1:2*i] = dx_i
    end
    nothing
end

N = 175
eps = 0.05
a = 0.5
b = bmatrix(pi/2-0.1, eps)
σ = 0.0506
G = wattsstrogatzmatrix(N, 3, 1);#0.232)

x_0 = zeros(2*N)
x_0[2 .* (1:N) .- 1] = rand(N) .* 2 .* a .- a
x_0[2 .* (1:N)] = rand(N) .* 2 .* (-a + a^3 / 3) .- (-a + a^3 / 3)

prob = ODEProblem((dx, x, params, t) -> coupled_fhn_eom!(dx, x, params[1], params[2], params[3], G, b), x_0, (0.0, 1000.0), [a, eps, σ])
sol = solve(prob);

I’m fairly new to Julia, and tried optimizing it as much as I could with some tips I found online. I’ve had some success, the first versions were much slower than this. However I’ve a hard time knowing how I could optimize this further, with more Julia-specific optimizations, which I’m keen to learn.

In my pc, this is taking close to 7 seconds (measured using btime).

Any tips would be appreciated!

In successive order of going deeper into the rabbit hole of performance:

For the first one, the Juno profiler was replaced with VS Code, and @profview. See Profiler · Julia in VS Code.

Also for sharing flamegraphs, I highly recommend GitHub - tkluck/StatProfilerHTML.jl: Show Julia profiling data in an explorable HTML page as a nice tool to generate flamegraphs that share all of the information.

Take a stab at a some of this, and when you’re ready isolate the RHS into an entity to be directly called and share some flame graphs to highlight what the next steps are in the optimization process.

7 Likes

Thank you! I generated the flame graph. It’s the first time I publish a website, so hopefully this works. Here is the flame graph..

I noticed most of the time is spent unfolding the macro, so by defining a function with the output of @macroexpand @einsum… there was a time reduction from 7s to around 2.5. However, I believe there could be some further improvements, since I didn’t really touch the @macroexpand output (well I did, but it stopped working).

Here is the function that replaced the einsum line:


function einsum_expansion(coupling, coupling_matrix, coupling_terms)
    quote
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:207 =#
        local var"##T#236" = eltype(coupling)
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:208 =#
        nothing
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:209 =#
        if size(coupling_matrix, 2) == size(coupling_terms, 2)
            nothing
        else
            Base.throw(Base.AssertionError("size(coupling_matrix, 2) == size(coupling_terms, 2)"))
        end
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:212 =#
        let i, j, r
            #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:213 =#
            begin
                $(Expr(:inbounds, true))
                local var"#25#val" = for j = 1:min(size(coupling, 2), size(coupling_terms, 1))
                            #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:278 =#
                            for i = 1:min(size(coupling, 1), size(coupling_matrix, 1))
                                #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:278 =#
                                begin
                                    #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:176 =#
                                    local var"##s#237" = zero(var"##T#236")
                                    #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:177 =#
                                    for r = 1:size(coupling_matrix, 2)
                                        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:278 =#
                                        var"##s#237" += coupling_matrix[i, r] * coupling_terms[j, r]
                                        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:279 =#
                                    end
                                    #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:178 =#
                                    coupling[i, j] = var"##s#237"
                                end
                                #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:279 =#
                            end
                            #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:279 =#
                        end
                $(Expr(:inbounds, :pop))
                var"#25#val"
            end
        end
        #= C:\Users\Miguel\.julia\packages\Einsum\AVMOj\src\Einsum.jl:216 =#
        coupling
    end
    return coupling
end

With some small tweaks to the original coupled_fnm_eom! function, displayed here:

function coupled_fhn_eom!(dx, x, a, eps, coupling_strength, coupling_matrix, b)
    N = length(coupling_matrix[1, :])
    eachneuron = reshape(x, (2, N))
    coupling_terms = b * eachneuron
    coupling = zeros(Float64, (N, 2))
    einsum_expansion(coupling, coupling_matrix, coupling_terms)
    eom_params = SVector{2}(a, eps)
    for i in range(1, N)
        thisneuron = @view eachneuron[:, i]
        thiscoupling = @view coupling[i, :]
        dx_i::SVector{2,Float64} = fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling
        dx[2*i-1:2*i] = dx_i
    end
    nothing
end

I don’t believe I can host more than one github page, so here is an image of the new flame graph:

I think I could gain further improvements if I took care of the yellow Array block. Opening the file it links to reveals that the following function is being called:

Array{T,1}(::UndefInitializer, m::Int) where {T} =
    ccall(:jl_alloc_array_1d, Array{T,1}, (Any, Int), Array{T,1}, m)

Is there any way I can speed this up? I still don’t understand type stability too much, would learning that help?

It sucks to say, but when I tested it against the original code, I noticed the outputs were all wrong. Back to square one I think. How could I make the macro expansion work? I now tested with some code I actually understand but it’s just as slow as the code I began with (7s). Here is the new einsum function (which works)

function einsum_expansion!(coupling, coupling_matrix, coupling_terms, N)
    for j = 1:2
        for i = 1:N
            local s = zero(Float64)
            for r = 1:N
                @inbounds s += coupling_matrix[i, r] * coupling_terms[j, r]
            end
            @inbounds coupling[i, j] = s
        end
    end
    nothing
end

macroexpand happens during compile time, so I think you might be profiling the wrong thing.

How so? I added this line for profiling:

@profview solve(prob);

Also tried

@profview for i in 1:100000 coupled_fhn_eom!(dx, x_0, a, eps, σ, G, b) end

Both gave me more or less the same information (most of the time is spent in the coupled_fhn function, in the @einsum line). Should I be doing something else?

Make sure solve is run once before doing the profile, otherwise your profile will show the compilation process.

The allocation cost is the line coupling = zeros(Float64, (N, 2)). Remove that allocation by pre-allocating it.

Thank you! I’m now caching and profiling as follows:

couplingcache = zeros(Float64, (N, 2))
prob = ODEProblem((dx, x, params, t) -> coupled_fhn_eom!(dx, x, params[1], params[2], params[3], G, b, couplingcache), x_0, (0.0, 1000.0), [a, eps, σ])
sol = solve(prob);

@profview solve(prob);
@btime solve(prob);

It reduced the memory allocations some 10%, and sped up accordingly. However, the flame graph remains unchanged, so I’m still a bit doubtful if it is the correct way to do this.

Is everything in dx_i::SVector{2,Float64} = fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling a static vector? If so, then the ::SVector{2,Float64} is redundant. But if not, then it will allocate and then convert to a static vector, which isn’t what you want. Make sure that you never construct it.

The easiest way to fix this line is to just make a cache vector for dx_i and then change this to:

dx_i .= fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling
dx[2*i-1:2*i] .= dx_i

Once that allocation is handled, the only thing that shows up in the profile is the einsum. Did you try replacing einsum with Tullio.jl?

1 Like

I gave this a try, and maybe (probably?) I did it poorly, because somehow my performance came out worse. What did help though was removing dx_i altogether and doing everything on the same line like this

dx[2*i-1:2*i] .= fhn_eom(thisneuron, eom_params) .+ coupling_strength .* thiscoupling

That one change improved the speed by 12% and dropped the number and size of the allocations in half.

2 Likes

Alright cool, then all that’s left is the einsum.

Thank you! Both @tullio and mihalybaci’s modifications gave about a 10% improvement each! I tried some further options for tullio (loopvectorization and some others displayed on the github page) but the default behaviour seemed to perform best. I failed to mention that coupling_matrix is a symmetric matrix (would be nice if the code admitted asymmetric couplings for future reference though), so I also tried just using mul! and calculating the transpose of the coupling matrix described originally, without needing to call transpose() (and later calling view on the other dimension). This yielded slower performance, even with a preallocated output matrix. Is it reasonable to conclude that there’s just a limit to how fast I can calculate the coupling array?

What’s the newest stacktrace?

I just updated the web page (here). It’s a bit different from the last one, since it’s now just profiling the coupled_fhn_eom function, with a 90 element network, instead of 150 (just for speed when profiling). Thank you again for the help with this!

All of the time is Tullio, so focus on that.

GitHub - mcabbott/Tullio.jl: ⅀ it looks like it can be made faster if you add LoopVectorization.jl as an extension. Also, you want to use the = form, instead of the :=, into a pre-allocated vector so that the allocation goes away.

Seems like adding LoopVectorization slows it down though. The following is without LoopVec:

image

And with LoopVec, I get the following:

image

The red text, if I’m not mistaken, means that Julia had to do some type inference when running. How do I know where that is happening, and how to fix it? I can narrow it down to the loop, but at least the container already has a defined type for its elements.

How much did the preallocation do?

I added a line in the function that separates the creation of the array from the @tullio line, as so:

    coupling = zeros(Float64, N, 2)
    @tullio coupling[i, j] = coupling_matrix[i, r] * coupling_terms[j, r]

I know this just moves the allocations to another line, but when profiling, it shows that only 2% of the time is spent on the creation of the array, so for cleaner code on the rest of the project I think it may be worth it to not cache it. Anyways, here are the results of a proper cache implementation:

image

It’s admittedly more than a 2% performance increase. Still not sure if worth the restructuring of existing code.

Just pre-allocate that and enclose it.

Then everything is Tullio, which seems like a bound.

2 Likes

Thank you for the help! It seems that with that, it’s as fast as it’ll go.