Parellelization for large ODEs, Multithreading fails

Hey everyone, I am new to Julia and I wanted to solve a large ODE as fast as possible. I’ve tried multithreading using Threads.@threads and EnsembleThreads() approaches, but they both reduce performance. I’ve also tried multiprocessing and GPU implementation, but it didn’t help. It is possible that I am not able to properly make use of these tools, since I am new to Julia. Is it possible to use parallel methods where allocations are large?

Are there any other ways to run this code faster? Any other tips that could improve the performance of this code would be an invaluable help.

Thanks!

Here is the code:

using OrdinaryDiffEq, SparseArrays, Random, LinearAlgebra

mutable struct Param
    α::Float64
    γ::Float64
    β::Vector{Float64}
    λ::Float64
    G::SparseMatrixCSC
    coup_alloc::Vector{Float64}
end

function mainfun()
    N = 4941;
    nλ=100;

    A = rand(MersenneTwister(0),N,N);
    Aij = reshape([A[j] > 0.9993 ? 1.0 : 0.0 for j in 1:N*N],N,N)
    degree = sum(Aij,dims=2);
    Gmatrix = diagm(vec(degree))-Aij;
    sparseL = sparse(Gmatrix); 
    
    α = 0.1;
    γ = 18;
    bf_vec = 0.0995 .+ 0.01*rand(MersenneTwister(0),N);
    
    lambda = range(0,0.35,length = nλ);
    u0 = [-2.68 -24.05 0.0] .+ 0.01.*rand(MersenneTwister(0),N,3)
    coupling = similar(u0[:,2]);
    par = Param(α,γ,bf_vec,0.0,sparseL,coupling);

    function rossler!(du, u, p, t)
        x = @view u[:,1];
        y = @view u[:,2];
        z = @view u[:,3];
        β = @view p.β[:];
        coup = @view p.coup_alloc[:];
        mul!(coup,p.G,y);
        @. du[:,1] = - y - z;
        @. du[:,2] = x + p.α*y - p.λ*coup;
        @. du[:,3] = β - p.γ*z + z*x;
    end

    for j in 1:nλ
        par.λ = lambda[j];
        println(j, ".\tλ : ", par.λ)
        prob = ODEProblem(rossler!, u0, (0, 1500), par);
        solve(prob, BS3(), saveat = 1400:0.25:1500);
    end
end

Y = @time mainfun();
1 Like

In general, you pay a penalty for kicking off a new thread so parallelizing at the for-loop is going to make you pay that penalty for every iteration. You might try something like

# rest of main body

# Do one loop per thread
n = Threads.nthreads()
Threads.@threads for i in 1:n
    # do every nth j all within a single thread
    for j in i:n:nλ
        # loop body
    end
end

I have actually tried this, but it ruins the performance.

I’m definitely out of my area of expertise here, so I can’t help you with optomizing much, but I do notice a couple things:

  1. This is a mutating operation. par changes with each iteration. This means that if you try to run this in parallel, you’ll have multiple threads trying to manipulate the same object simultaneously. This is a data race problem. I don’t think parallelization at the top level loop is an option here.
  2. It’s generally best to define all your functions at the top level. I’d recommend pulling rossler! out of main. I don’t think it will affect performance here, just a good practice.

Looking at this optomization tutorial, performance faq, and the compile vs runtime docs, I don’t see any mention of leveraging parallelization by the user. I think this is because the solver is already working in parallel under the hood, so it should be taking advantange of whatever parallelization is available on your hardward without you going out of your way.

Probably better to start with the approaches brought up in these resources. (Sorry I don’t know enough to know what does and does not apply to your specific use case :grimacing: )

What about a static scheduler? Threads.@threads :static for i in 1:n. Or using Polyester?

Some are, not this one.

Show a flame graph: there is the code actually spending its time?

Glad someone who knows what they’re doing is here now :slight_smile:

What about a static scheduler? Threads.@threads :static for i in 1:n . Or using Polyester?

Thanks chris, I tried both static scheduler and Polyester and they didn’t help.

Show a flame graph: there is the code actually spending its time?

And about profiling, what flame graph shows is that solve function takes almost the whole time and most of that is spent on multiplication of mul!(coup,p.G,y).
It is what it looks like:

Thanks anyway, I’ll check them out.

Yeah, so anything else really doesn’t matter. How big is that matmul?

It is a multiplication of a [4941,4941] array in a [4941,1] vector.

Did you try MKL?

No, I didn’t know about that. I will try it out.

MKL doesn’t work for sparse matrices, does it?

Forgot about that. Yeah it’s just SuiteSparse still. Basic optimizations may be at its limit then.

What’s the reason for using BS3?

BS3() was the fastest algorithm that worked for me

What about methods like ROCK2?

I checked that, BS3 is faster