Distributed Differential Equations with large matrix

Hello,

I’m trying to increase the performances for my differential equation system, where the differential function is

function drho_dt(rho, p, t)
    global L, L_t
    A, w_l = p
    return (L + A * sin(w_l * t) * L_t) * rho
end

with L and L_t two constant sparse matrices of sizes varying from 900x900 to 2000x2000 or more. Since they are constant, I previously declared them as global variables, speeding up the problem.
What I need is the different behaviors depending on the variation on the frequency w_l.
I’m yet able to reproduce this problem using EnsembleThreads() method, which parallelize all the ODEs into my 12 cores (with calculation times of about 1min for 50 trjectories).

Nevertheless, I can access to two other servers with 12 cores each, obtaining a total of 36 cores. Thus, I turned to the EnsembleDistributed() method.
After programming the required code for the Distributed method, which can be summarized

using Distributed

addprocs(11; restrict=false)
addprocs([("test@host1", :auto)], tunnel=true, exename = "/home/test/julia-1.6.1/bin/julia", dir = "/home/test/alberto/Rabi Raman Scattering")
addprocs([("ezio@host2", 12)], tunnel=true, exename = "/home/ezio/julia-1.6.1/bin/julia", dir = "/home/ezio/alberto/Rabi Raman Scattering")

@everywhere include("MyQuantumModule.jl")
@everywhere using .MyQuantumModule
@everywhere using LinearAlgebra
@everywhere using DifferentialEquations
@everywhere using SparseArrays

I immediately thought about declaring L and L_t in all processors (@everywhere L = $L), and than simply writing

@everywhere function drho_dt(rho, p, t)
    global L, L_t
    A, w_l = p
    return (L + A * sin(w_l * t) * L_t) * rho
end

@everywhere function prob_func(prob,i,repeat)
  remake(prob,p=[prob.p[1], w_l_l[i]])
end

p = [A, w_l]
tspan = (0.0, 15.0 / gam_c)

prob = ODEProblem(drho_dt, rho0_vec, tspan, p)
ensemble_prob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
@time sim = solve(ensemble_prob, BS3(), EnsembleDistributed(), trajectories=length(w_l_l))

but L and L_t are very large (despite being sparse), and it tooks a lot of time to pass that variables on all the remote processors.

My second plan was to obtain these two matrices indipendently on each processor

@everywhere begin
... Some code to get L and L_t, so
L = some stuff
L_t = some stuff
end
# The code written above is executed very quickly.

... and than the same code for the differential equation

With this method i can see all the processors working on all the servers with htop command, however the solver is so much slower compared to the EnsembleThreads() method. I tried with 25x25 matrices, the Threads method took about 8 seconds, while the Distributed 300!

Am I doing something wrong?

Does BS3 even make sense here? That likely is slowing it down a lot. ROCK2? ROCK4? Tsit5? BS3 is a really specific case and you should make sure to check a few others.

Adding cores disables BLAS threading. For a problem of this size, it’s probably best to just allow the cache efficient methods be used, and so just use one process per server and then BLAS.set_num_threads(12).

Do you mean something like this?

@everywhere BLAS.set_num_threads(12)

A = sprand(100, 100, 0.01)
@everywhere A = $A

@everywhere function du_dt(u,p,t)
    global A
    return A * u
end

@everywhere function prob_func(prob,i,repeat)
  remake(prob,u0=rand()*prob.u0)
end

u0 = rand(100)
t_span = (0.0, 100.0)

# Linear ODE which starts at 0.5 and solves from t=0.0 to t=1.0
prob = ODEProblem(du_dt,u0,t_span)

ensemble_prob = EnsembleProblem(prob,prob_func=prob_func)
@time sim = solve(ensemble_prob,Tsit5(),EnsembleDistributed(),trajectories=1000)

It tooks a lot more time than EnsembleThreads() method. Looking at the Task manager I see that all the processors worked for 2-3s and that for more that 5 minutes it is downloading at about 300Kb/s

Did you setup the ClusterManager to use the MPI backend?

No, the summarized code (with an example matrix) is

using PyPlot
using DSP # Signal Processing
using ProgressMeter
using Distributed

addprocs(11; restrict=false)
addprocs([("test@host1", :auto)], tunnel=true, exename = "/home/test/julia-1.6.1/bin/julia", dir = "/home/test/alberto/Rabi Raman Scattering")
addprocs([("ezio@host2", 12)], tunnel=true, exename = "/home/ezio/julia-1.6.1/bin/julia", dir = "/home/ezio/alberto/Rabi Raman Scattering")

@everywhere include("MyQuantumModule.jl")
@everywhere using .MyQuantumModule
@everywhere using LinearAlgebra
@everywhere using DifferentialEquations
@everywhere using SparseArrays

@everywhere BLAS.set_num_threads(12) # if needed

A = sprand(100, 100, 0.01)
@everywhere A = $A

@everywhere function du_dt(u,p,t)
    global A
    return A * u
end

@everywhere function prob_func(prob,i,repeat)
  remake(prob,u0=rand()*prob.u0)
end

u0 = rand(100)
t_span = (0.0, 100.0)

# Linear ODE which starts at 0.5 and solves from t=0.0 to t=1.0
prob = ODEProblem(du_dt,u0,t_span)

ensemble_prob = EnsembleProblem(prob,prob_func=prob_func)
@time sim = solve(ensemble_prob,Tsit5(),EnsembleDistributed(),trajectories=1000)

Yeah if you need faster transfer times, making sure it uses the interconnects would be very helpful.

Another optimization might be to make this function and the matrix multiplications work in-place. Something along the lines of this:

const drho_cache = similar(rho0_vec)
function drho_dt(drho, rho, p, t)
    global L, L_t
    A, w_l = p
    mul!(drho, L, rho) # write L*rho into drho
    mul!(drho_cache, L_t, rho) # write L_t*rho into drho_cache
    mul!(drho, A, drho_cache, sin(w_l*t), 1) # write A*drho_cache*sin(w_l*t) + drho into drho
    return drho
end

(Hopefully I didn’t mess up the order of matrices here).

Also, now I’m just taking a guess here, but since you’re doing quantum stuff: is the L a superoperator of some kind (e.g. Lindblad?). These are usually defined by their action on the density operator. You can do the same here by just specifying the functional application of L on rho. Or you can use LinearMaps.jl to define L by a function with which it acts on rho. This would save you some memory as you never really have to store L as a matrix.

For the differential equation solvers, you’d need to specify it as an AbstractDiffEqOperator, and then ETD2 would work. LinearMaps.jl does not have all of the right actions.

https://diffeq.sciml.ai/stable/features/diffeq_operator/

Ye is for quantum research purposes. I used the DiffEqOperator method but I have not seen any improvement (and the result is always zero!). I don’t know if the code is correct

function update_func(A,u,p,t)
    A_l, w_l = p
    A =  L + A_l * sin(w_l * t) * L_t
end

A = DiffEqArrayOperator(L, update_func=update_func)
prob = ODEProblem(A, rho0_vec, tspan, p)
@time sol = solve(prob, MagnusGL6(), dt = (tspan[2] - tspan[1]) / 1000)

However I guessed that the problem was the huge amount of data transfer, since each ODE returns an array of all the states at all the times. SInce the states are big arrays, the data to transfer is too big. My question is why DifferentialEquations.jl implemented Distributed computing if it can be used only for small systems. If i need all the time dependent states i cannot use Distributed Computing.

Since i need only the last state, my workaround was to declare all the variables and functions @everywhere. Than i declared

@everywhere function f(some arguments)
    return only a Float64
end

and that i can parallelize with pmap. Very easy!

You’re creating a new A in the scope of the function rather than updating the argument. That’s probably why everything is zero. Use broadcasting: A .= L + A_l * sin(w_l * t) * L_t (mind the .).

1 Like

Thank you for your help. It now returns non-zero values. The dynamics is similar to the original, but not perfect. Moreover, it takes from 50s to 300s instead of 4s (depending on the dt parameter). The main code is

function update_func(L,u,p,t)
    A, w_l = p
    L .= L_0 + A * sin(w_l * t) * L_t
end

L = DiffEqArrayOperator(L_0, update_func=update_func)

prob = ODEProblem(L, rho0_vec, tspan, p)
@time sol = solve(prob, MagnusGL4(), dt = (tspan[2] - tspan[1]) / 1000)

where I remeber L_0 and L_t are large sparse matrices.

I don’t know if this method can significantly increase the performance. With the distribution method I described above I can speed up my problem using my laptop and other two host servers.

Of course it would be perfect if the time can still be decreased!

Make that update function non-allocating.

Isn’t it already a non-allocating function? I’m updating its components at every call.

If not, how to do that? With two for loops L[i,j] = L_0[i, j] + ...?

I guess not.