Checking that work is being sent to processors: GPU vs Multiple CPUs

In the Parameter Parallelism section of the DiffEqGPU.jl repo’s documentation there is some code. I got it to work. I want to check that its running on the processing units of the GPU. Is there someway to see workload is running on the individual processing units?

I also want to compare this workload running solve with EnsembleThreads() instead. In particular, I want to see that workloads are being sent to all my CPUs. Is there a way to see this in real time? I checked top expecting to see multiple instances of Julia running but I didn’t see this.

If you are using NVIDIA gpus you can check nvidia-smi in a console, which should say how much memory each process is using (similar to top but for the GPU). Usually this is enough for me, but there are more sophisticated tools available for CUDA.

Do you have any ideas about the CPUs? Top shows me the % used by each process. But I have a 40 core processor that I’m testing on and I only see ~500% usage when I should expect to see >3000%…

Also when I check nvdia-smi I don’t see any processes

I am not 100% sure but usually the diagnostic tools external to Julia are not totally accurate. Maybe someone else knows a good tool for profiling.

Personally, I test the scaling of functions to know what I can expect which is done with some simple benchmarking. Either I change the number of threads used and look at the speedup for the same workload, or I change the problem size to see how much more performance you can get with more cores.
For example, if you were to plot the parallel speedup vs problem size, you would expect to see something like this, which hopefully shows a speedup equal to the number of threads for large enough problem sizes:

Obviously, you don’t need to go to this much effort, as you can just manually run some quick checks in the REPL.

This likely means that the program doesn’t use the GPU at all

uh oh – but this package is specifically designed to run on GPUs. The code I ran is an example they share with us.

What are you running?

using DiffEqGPU, OrdinaryDiffEq
function lorenz(du,u,p,t)
    du[1] = p[1]*(u[2]-u[1])
    du[2] = u[1]*(p[2]-u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(), trajectories = 10_000, adaptive = true, dt = 0.1f0, save_everystep = false)

Those last two calls will definitely use GPUs and you’ll see it in nvidia-smi calls via the utilization percentages. Though note those examples will only run for like <1 second IIRC, so you’ll need to be fast (or make the example bigger)

Multiple instances is different from multiple threads. Multiple instances would be the result of multiprocessing, i.e. using Distributed. If you’re using top, you’ll just see >100% CPU utilization (using htop is usually a lot nicer for investigating this kind of thing)

Ya I’m running this –

For the last two lines:

  • I changed modified the penultimate line so that trajectories=1000000, saveat=200.0f.0
    I checked nvidia-smi and saw nothing.

  • In the last time, GPUTsit5(), didn’t load. I am using the DiffEqGPU library, so I’m unsure of why this is giving me an error. I’m getting the error:
    UndefVarError: GPUTsit5 not defined

I did not know this – I’ll check htop, thanks!

How long did it take to solve? It can be hard to see with nvidia-smi if it only lasts a second.

What package version?

  1. I kept re-running nvidia-smi while solve was running. I kept seeing nothing. So I terminated solve call. Does solve need to complete running for me to see something in nvidia-smi?

  2. I’m running julia 1.6.7. Is this what you mean?

Did you terminate it when it was compiling?

oh hm, maybe.

Since this exchange, I’ve been letting solve run without terminating it (trajectories=1000000,saveat=200.0f0). So the call has been running at least 10 mins. I still don’t see anything in nvidia-smi.

I don’t know how compilation is supposed to work in Julia. Am I waiting for Julia to compile 10^6 kernels (I think this is the right term for this) to send to the graphics processing units?

That’s odd. It should be a few seconds or so.

It just compiles one kernel function.

Is there a way to check if I have everything configured correctly?

I’m running Julia 1.6.7 (downloaded from website and run directly from the bin folder). I then added the packages DiffEqGPU, OrdinaryDiffEq.

You probably want to use v1.8.1 with this. I haven’t tested LTS at least, but I know v1.8.1 should be fine? If that works, I can probably pin down what’s going on with LTS, but most things using advanced compiler toolchains run best with the latest release.

okay let me try this with 1.8.1

Same thing is occurring with 1.8.1: solve hasn’t completed running for > 10 mins when trajectories=1000000,saveat=200.0f0 and nvidia-smi shows nothing. Any ideas?

Update: It took a half hour to run the first time. I tried running it a second time, in case this decreased the compilation time because something was stored in memory. I checked nvidia-smi again and saw nothing. It took about the same time to run the second time.