Simplest way to convert a program for parallel (multithreaded) runs on multiple servers/cores?

ChrisRackauckas · February 18, 2021, 4:06pm

You just turn sparse=true on in the generation call.

CosmoProf · February 18, 2021, 9:40pm

With CVODE_BDF finally working, using that instead of Rodas5, the runtime of my main program for Npix = 1601 was reduced from 8 hours to 2 hours! Wow!

CosmoProf · February 19, 2021, 8:48am

One last question, if I may… since I am running on a system with multiple cores, is there a straightforward way to make the command DifferentialEquations.solve (from my sample program above) run in a distributed way on the multiple cores, for savings of time and memory? (BTW, I run the program from the IDE, not from command line.)

(I tried reading through “Multi-processing and Distributed Computing · The Julia Language”, but I’m afraid that it did not illuminate me.)

Thanks again…!

lmiq · February 19, 2021, 11:04am

Concerning your code, you can accelerate that inner loop with (start julia with julia -t N where N is the number of cores you want to use):

    Threads.@threads for i in 2:(Npts-1)
        du[i,2] = u[i,1]
        du[i,1] = (-k*u[i,2]) + ((u[(i+1),2]-(2*u[i,2])+u[(i-1),2])/(DX^2))
    end

If that makes a difference for overall peformance, you might want to rewrite the loop to actually take advantage of multi-threading, but I would test that first.

you can also (with a single core), use:

using LoopVectorization

and change that loop to:

    @avx for i in 2:(Npts-1)
        du[i,2] = u[i,1]
        du[i,1] = (-k*u[i,2]) + ((u[(i+1),2]-(2*u[i,2])+u[(i-1),2])/(DX^2))
    end

that provides some additional speedup.

But those things will only make any difference if the computation of that loop is important for the overall performance, which I am not sure (I could not run your example).

rveltz · February 19, 2021, 11:30am

The bottleneck is very likely the computation of the jacobian. Also, you can pass options to CVODE_BDF like dense, banded if I recall

ChrisRackauckas · February 19, 2021, 1:56pm

The simplest thing to try is CVODE_BDF(linear_solver=:GMRES) which would be matrix-free. Without a preconditioner you might get a speedup you might not, but it’s at least easy to check.

CosmoProf · February 22, 2021, 2:00am

Hi again folks, I have implemented some of these changes, and the results are of the “impossibly too good to be true” variety.

For my real (very large) program, solving the coupled ODE system originally took over 3 days… using Rodas5(), the whole program with everything took over 82 hours!

Using CVODE_BDF() the program took only 2.5 hours (148 minutes).

Then, using CVODE_BDF(linear_solver=:GMRES) took less than 1 hour (51 minutes), and most of that time wasn’t even doing ODE solving!

I was enormously skeptical about this impossible amount of speedup, so I compared the ODE solutions… and the biggest difference in any value between the different solvers was less than 3E-10.

The solver doesn’t even seem to scale with the number of pixels anymore! Running my simpler program shown above with CVODE_BDF(linear_solver=:GMRES) – noting that it must include the corrected line, "maxiters = Int(1e7)", to run successfully – going from Npix = 801 to 1601 to 3201 only increased the DifferentialEquations.solve command runtime from 4.87 to 5.87 to 6.52 seconds!

I’m not sure how any of this is possible, but I’m glad it is! Thanks for all of your suggestions… I’m still implementing more of them as necessary…

Topic		Replies	Views
How to convert a thread-parallelized code into a core-parallelized code? Julia at Scale multithreading , linearalgebra , distributed , threads , matrix	3	306	May 19, 2024
How to distribute computation over different CPU's of my desktop New to Julia parallel , multithreading	14	707	February 5, 2022
Independent threads much slower by parallelizable Performance	2	243	January 25, 2024
Same multi-threaded code, scaling observed only on some machines Performance	2	72	August 14, 2024
Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads Julia at Scale fortran , parallel , linearalgebra , threads	73	2017	January 22, 2024

Simplest way to convert a program for parallel (multithreaded) runs on multiple servers/cores?

Related topics