Simplest way to convert a program for parallel (multithreaded) runs on multiple servers/cores?

You just turn sparse=true on in the generation call.

With CVODE_BDF finally working, using that instead of Rodas5, the runtime of my main program for Npix = 1601 was reduced from 8 hours to 2 hours! Wow!

1 Like

One last question, if I may… since I am running on a system with multiple cores, is there a straightforward way to make the command DifferentialEquations.solve (from my sample program above) run in a distributed way on the multiple cores, for savings of time and memory? (BTW, I run the program from the IDE, not from command line.)

(I tried reading through “Multi-processing and Distributed Computing · The Julia Language”, but I’m afraid that it did not illuminate me.)

Thanks again…!

Concerning your code, you can accelerate that inner loop with (start julia with julia -t N where N is the number of cores you want to use):

    Threads.@threads for i in 2:(Npts-1)
        du[i,2] = u[i,1]
        du[i,1] = (-k*u[i,2]) + ((u[(i+1),2]-(2*u[i,2])+u[(i-1),2])/(DX^2))
    end

If that makes a difference for overall peformance, you might want to rewrite the loop to actually take advantage of multi-threading, but I would test that first.

you can also (with a single core), use:

using LoopVectorization

and change that loop to:

    @avx for i in 2:(Npts-1)
        du[i,2] = u[i,1]
        du[i,1] = (-k*u[i,2]) + ((u[(i+1),2]-(2*u[i,2])+u[(i-1),2])/(DX^2))
    end

that provides some additional speedup.

But those things will only make any difference if the computation of that loop is important for the overall performance, which I am not sure (I could not run your example).

1 Like

The bottleneck is very likely the computation of the jacobian. Also, you can pass options to CVODE_BDF like dense, banded if I recall

2 Likes

The simplest thing to try is CVODE_BDF(linear_solver=:GMRES) which would be matrix-free. Without a preconditioner :man_shrugging: you might get a speedup you might not, but it’s at least easy to check.

1 Like

Hi again folks, I have implemented some of these changes, and the results are of the “impossibly too good to be true” variety.

For my real (very large) program, solving the coupled ODE system originally took over 3 days… using Rodas5(), the whole program with everything took over 82 hours!

Using CVODE_BDF() the program took only 2.5 hours (148 minutes).

Then, using CVODE_BDF(linear_solver=:GMRES) took less than 1 hour (51 minutes), and most of that time wasn’t even doing ODE solving!

I was enormously skeptical about this impossible amount of speedup, so I compared the ODE solutions… and the biggest difference in any value between the different solvers was less than 3E-10.

The solver doesn’t even seem to scale with the number of pixels anymore! Running my simpler program shown above with CVODE_BDF(linear_solver=:GMRES) – noting that it must include the corrected line, "maxiters = Int(1e7)", to run successfully – going from Npix = 801 to 1601 to 3201 only increased the DifferentialEquations.solve command runtime from 4.87 to 5.87 to 6.52 seconds!

I’m not sure how any of this is possible, but I’m glad it is! Thanks for all of your suggestions… I’m still implementing more of them as necessary…

1 Like