GPU simulation time of neural ODE program

Hello,

I’m trying to run the following neural ODE program using GPU. If I run the code using GPU (NVIDIA GeForce GTX 1050), the simulation time of Flux.train!(loss_n_ode, ps, data, opt, cb = cb) part was much longer than simulation using only CPU as shown below.

GPU: 362.607972 seconds (719.08 M allocations: 24.526 GiB, 3.86% gc time)
CPU: 35.925501 seconds (150.99 M allocations: 13.473 GiB, 9.14% gc time)

Could you tell me how to fix the issue to make simulation speed of GPU code faster?

using Flux, DiffEqFlux, DifferentialEquations, Plots, CSV, CuArrays

# Read CSV file
flamedata = CSV.read("./results.csv")
flamedata2 = flamedata[2:2:6002,:]

u0 = Float32[0.; 0.11189834407236525; 0.8881016559276348; 0.]|>gpu

datasize = 3001
tspan = (0.0f0,0.0003f0)|>gpu

t = range(tspan[1],tspan[2],length=datasize)|>gpu

ode_data2 = Matrix(flamedata2[[:1,:4,:7,:9]])
ode_data2 = transpose(ode_data2)
ode_data2 = convert(Array{Float32}, ode_data2)
ode_data = ode_data2|>gpu
ode_data[1,:] = tanh.(ode_data[1,:]*100)|>gpu

dudt = Chain(
        Dense(4,32,swish),
	Dense(32,16,swish),
	Dense(16,8,swish),
        Dense(8,4)
	)|>gpu

ps = Flux.params(dudt)|>gpu
n_ode = x->neural_ode(dudt,gpu(x),gpu(tspan),AutoTsit5(Rodas5(autodiff=false)),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)

function predict_n_ode()
  n_ode(u0)
end
loss_n_ode() = sum(abs2,ode_data .- predict_n_ode())

data = Iterators.repeated((), 5)
opt = ADAM(0.1, (0.9, 0.999))
cb = function () #callback function to observe training
  display(loss_n_ode())
  # plot current prediction against data
  cur_pred = Flux.data(predict_n_ode())
  pl1 = plot(t,ode_data[1,:],label="data1",lw=2)
  plot!(pl1,t,cur_pred[1,:],label="prediction1",lw=2)
  plot!(pl1,t,ode_data[2,:],label="data2",lw=2)
  plot!(pl1,t,cur_pred[2,:],label="prediction2",lw=2)
  plot!(pl1,t,ode_data[3,:],label="data3",lw=2)
  plot!(pl1,t,cur_pred[3,:],label="prediction3",lw=2)
  plot!(pl1,t,ode_data[4,:],label="data4",lw=2)
  plot!(pl1,t,cur_pred[4,:],label="prediction4",lw=2)
  gui(plot(pl1))
end

# Display the ODE with the initial parameter values.
cb()

@time Flux.train!(loss_n_ode, ps, data, opt, cb = cb)
1 Like

The problem is Rodas5 here. If you just do Tsit5 you’ll be fine, for now (or use a ROCK method). Jacobian construction without autodiff is slow on the GPU. We are building a new forward-mode autodiff precisely for this reason, and I think we will have something to show for this by the end of next week.

3 Likes

Dear Chris,

Thank you for your prompt reply.

I tried two types of ODE algorithms as shown below. However, simulation time for the runs using GPU was still slower than CPU only.

n_ode = x->neural_ode(dudt,gpu(x),gpu(tspan),Tsit5(),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)
286.258497 seconds (699.98 M allocations: 23.458 GiB, 3.58% gc time)

n_ode = x->neural_ode(dudt,gpu(x),gpu(tspan),AutoTsit5(Tsit5()),saveat=t,dtmin=1.0E-14,maxiters=1e10,reltol=1e-7,abstol=1e-9)
302.667069 seconds (700.02 M allocations: 23.523 GiB, 3.62% gc time)

For your case, the matrices are still too small. A 32x32 matmul isn’t large enough to overcome the kernel call cost. The GPU will just help scale larger layers.

Dear Chris,

Difference of simulation time between CPU and GPU became smaller than before if number of parameters in neural network layers increased as shown below.

dudt = Chain(
        Dense(4,256,swish),
	Dense(256,128,swish),
	Dense(128,64,swish),
        Dense(64,4)
	)|>gpu
GPU:  279.341810 seconds (713.16 M allocations: 23.783 GiB, 4.39% gc time)
CPU:  172.805359 seconds (137.03 M allocations: 361.775 GiB, 21.18% gc time)

Simulation speed of GPU is still slower than that of CPU, since GPU in my laptop PC (NVIDIA GeForce GTX 1050) is not so powerful. Therefore, I will use CPU if I run neural ODE code on my laptop PC. I hope MPI or OpenMP support in future version of DiffEqFlux.

What does your computer show at that size?

julia> using BenchmarkTools, CuArrays

julia> A = cu(rand(4,256)); B = cu(rand(256,128));

julia> @btime A*B
  3.406 μs (19 allocations: 592 bytes)
4×128 CuArray{Float32,2}:
 70.0315  66.4814  71.9346  71.5927  69.5655  65.5431  …  70.0417  63.0131  70.5124  67.8303  68.7845  65.3313
 59.1719  59.0919  63.0657  59.825   61.1214  57.4047     56.0934  52.0603  59.7451  61.2429  58.7901  57.6035
 61.2603  60.0343  65.5527  65.2218  60.3737  59.5472     59.8305  53.4624  64.5687  62.1284  63.2896  57.3726
 67.5751  64.2045  71.7379  71.1343  68.4805  63.8018     66.5502  60.405   68.8187  67.7905  65.8484  64.2003

julia> A = rand(4,256); B = rand(256,128);

julia> @btime A*B
  31.357 μs (1 allocation: 4.13 KiB)
4×128 Array{Float64,2}:
 62.8857  63.1846  60.4739  58.7771  59.831   62.9932  …  57.5166  59.5414  60.4711  59.8159  63.8812  68.3361
 65.5783  65.8799  62.7385  63.8594  63.1873  65.7564     62.3821  61.9301  64.7102  60.7978  62.7974  67.8307
 65.8959  68.8122  61.7253  65.4755  64.1282  66.0544     64.3825  63.6596  67.8998  62.6193  67.2302  71.0087
 65.2023  65.7254  63.5841  64.6367  62.0464  64.6357     64.951   63.2562  65.7277  62.2329  67.7674  66.351

You don’t need OpenMP support because all of the matmuls are already multithreaded. Check your CPU usage and you should see it using all cores automatically.

Dear Chris,

Calculation time of the matmul on my computer is following.

julia> using BenchmarkTools, CuArrays

julia> A = cu(rand(4,256)); B = cu(rand(256,128));

julia> @btime A*B
  3.699 μs (19 allocations: 592 bytes)
4×128 CuArray{Float32,2}:
 63.3189  58.2083  67.3136  56.9396  63.9851  62.1672  …  62.191   62.884   65.5695  63.5641  62.2439  58.2934
 67.5529  59.881   63.778   60.0911  66.3862  61.5216     62.1403  59.062   67.0465  66.907   66.459   59.2792
 65.3215  59.3315  62.6717  54.5476  64.2242  63.1418     60.2677  61.5783  64.8999  64.8344  63.723   57.5102
 68.5587  61.1417  66.4793  59.7681  67.7273  64.8121     63.6989  61.4582  65.1183  67.98    65.8259  61.5581

julia> A = rand(4,256); B = rand(256,128);

julia> @btime A*B
  32.820 μs (1 allocation: 4.13 KiB)
4×128 Array{Float64,2}:
 66.8995  67.3647  62.8093  63.0263  64.6963  66.201   …  62.7533  63.4769  64.2497  63.7303  65.1254  64.2671
 60.0078  58.9639  60.5994  59.0173  58.324   61.5357     61.5452  63.7883  60.4779  60.7345  60.6196  61.6659
 66.8389  67.7833  66.9895  64.8955  66.6696  67.4492     65.1754  71.0193  66.63    66.6087  68.0931  67.0778
 63.7755  66.7935  66.5961  62.9272  66.9261  66.9876     65.8134  66.7046  65.4326  65.149   66.0973  71.374

The attached figure shows CPU usage during the neural ODE code run. My laptop PC has Intel Core i7-7700HQ (4 physical processors and 8 logical processors). It seems that all cores were not used in the simulation. My simulation environment is following.

A 1050 isn’t a good contender against the 7700HQ, especially given their clock speeds, and a Dense(256, 128) layer is still fairly small for a GPU. At least you can validate it works and motivate trying it on a faster GPU.

Dear Marmaduke,

Thank you for your reply.

Actually, Dense(32,16) layers is sufficient for my case now. Therefore, I will use CPU. I confirmed that the GPU computation worked without any error for my case. If I use more large data set, I will try to use GPU.