Parellelization for large ODEs, Multithreading fails

Rezair · March 29, 2023, 10:33pm

Hey everyone, I am new to Julia and I wanted to solve a large ODE as fast as possible. I’ve tried multithreading using Threads.@threads and EnsembleThreads() approaches, but they both reduce performance. I’ve also tried multiprocessing and GPU implementation, but it didn’t help. It is possible that I am not able to properly make use of these tools, since I am new to Julia. Is it possible to use parallel methods where allocations are large?

Are there any other ways to run this code faster? Any other tips that could improve the performance of this code would be an invaluable help.

Thanks!

Here is the code:

using OrdinaryDiffEq, SparseArrays, Random, LinearAlgebra

mutable struct Param
    α::Float64
    γ::Float64
    β::Vector{Float64}
    λ::Float64
    G::SparseMatrixCSC
    coup_alloc::Vector{Float64}
end

function mainfun()
    N = 4941;
    nλ=100;

    A = rand(MersenneTwister(0),N,N);
    Aij = reshape([A[j] > 0.9993 ? 1.0 : 0.0 for j in 1:N*N],N,N)
    degree = sum(Aij,dims=2);
    Gmatrix = diagm(vec(degree))-Aij;
    sparseL = sparse(Gmatrix); 
    
    α = 0.1;
    γ = 18;
    bf_vec = 0.0995 .+ 0.01*rand(MersenneTwister(0),N);
    
    lambda = range(0,0.35,length = nλ);
    u0 = [-2.68 -24.05 0.0] .+ 0.01.*rand(MersenneTwister(0),N,3)
    coupling = similar(u0[:,2]);
    par = Param(α,γ,bf_vec,0.0,sparseL,coupling);

    function rossler!(du, u, p, t)
        x = @view u[:,1];
        y = @view u[:,2];
        z = @view u[:,3];
        β = @view p.β[:];
        coup = @view p.coup_alloc[:];
        mul!(coup,p.G,y);
        @. du[:,1] = - y - z;
        @. du[:,2] = x + p.α*y - p.λ*coup;
        @. du[:,3] = β - p.γ*z + z*x;
    end

    for j in 1:nλ
        par.λ = lambda[j];
        println(j, ".\tλ : ", par.λ)
        prob = ODEProblem(rossler!, u0, (0, 1500), par);
        solve(prob, BS3(), saveat = 1400:0.25:1500);
    end
end

Y = @time mainfun();

mrufsvold · March 30, 2023, 3:08am

In general, you pay a penalty for kicking off a new thread so parallelizing at the for-loop is going to make you pay that penalty for every iteration. You might try something like

# rest of main body

# Do one loop per thread
n = Threads.nthreads()
Threads.@threads for i in 1:n
    # do every nth j all within a single thread
    for j in i:n:nλ
        # loop body
    end
end

Rezair · March 30, 2023, 10:12am

I have actually tried this, but it ruins the performance.

mrufsvold · March 30, 2023, 2:22pm

I’m definitely out of my area of expertise here, so I can’t help you with optomizing much, but I do notice a couple things:

This is a mutating operation. par changes with each iteration. This means that if you try to run this in parallel, you’ll have multiple threads trying to manipulate the same object simultaneously. This is a data race problem. I don’t think parallelization at the top level loop is an option here.
It’s generally best to define all your functions at the top level. I’d recommend pulling rossler! out of main. I don’t think it will affect performance here, just a good practice.

Looking at this optomization tutorial, performance faq, and the compile vs runtime docs, I don’t see any mention of leveraging parallelization by the user. I think this is because the solver is already working in parallel under the hood, so it should be taking advantange of whatever parallelization is available on your hardward without you going out of your way.

Probably better to start with the approaches brought up in these resources. (Sorry I don’t know enough to know what does and does not apply to your specific use case )

ChrisRackauckas · March 30, 2023, 2:45pm

What about a static scheduler? Threads.@threads :static for i in 1:n. Or using Polyester?

Some are, not this one.

Show a flame graph: there is the code actually spending its time?

mrufsvold · March 30, 2023, 2:55pm

Glad someone who knows what they’re doing is here now

Rezair · March 30, 2023, 5:00pm

What about a static scheduler? Threads.@threads :static for i in 1:n . Or using Polyester?

Thanks chris, I tried both static scheduler and Polyester and they didn’t help.

Show a flame graph: there is the code actually spending its time?

And about profiling, what flame graph shows is that solve function takes almost the whole time and most of that is spent on multiplication of mul!(coup,p.G,y).
It is what it looks like:

Rezair · March 30, 2023, 5:02pm

Thanks anyway, I’ll check them out.

ChrisRackauckas · March 30, 2023, 5:05pm

Yeah, so anything else really doesn’t matter. How big is that matmul?

Rezair · March 30, 2023, 5:13pm

It is a multiplication of a [4941,4941] array in a [4941,1] vector.

ChrisRackauckas · March 30, 2023, 5:22pm

Did you try MKL?

Rezair · March 30, 2023, 5:26pm

No, I didn’t know about that. I will try it out.

Oscar_Smith · March 30, 2023, 5:28pm

MKL doesn’t work for sparse matrices, does it?

ChrisRackauckas · March 30, 2023, 5:33pm

Forgot about that. Yeah it’s just SuiteSparse still. Basic optimizations may be at its limit then.

What’s the reason for using BS3?

Rezair · March 30, 2023, 5:36pm

BS3() was the fastest algorithm that worked for me

ChrisRackauckas · March 30, 2023, 5:40pm

What about methods like ROCK2?

Rezair · March 30, 2023, 5:49pm

I checked that, BS3 is faster

Topic		Replies	Views
Possible performance drop when using more than one socket threads Performance	2	342	May 29, 2021
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3917	February 11, 2019
Having trouble parallelizing a system of ODEs Performance question	1	450	May 14, 2020
Speeding up solution to the large system of non-linear, complex-valued ODEs General Usage diffeq , first-steps , performance , optimization	6	1135	May 27, 2020
Python vs. Julia ODE Solver Performance python	30	4246	March 3, 2021

Parellelization for large ODEs, Multithreading fails

Related topics