Saving values from GPU during Euler stepping in ODE

arrayfire
gpu
gpuarrays

#1

Dear all,

I have written a simple Euler stepping algorithm for a particular problem which runs on GPU. The vector y_current is big (~200_000) and it is updated like y_current .= y_current .+ dt * RHS. A working example can be found here. It was done in ArrayFire. I have a working example in CLArrays but it is 10x slower.

I can’t save every step in a matrix because I need a lot of steps. Instead, I want to record sum(y_current) at every step. However, including a save mechanism slows down the whole thing 50 times and it is not useful then.

Hence, I am doing something like

for ii=1:10000
     updateEuler!(y_current,dt)
     #save data, a is own by the GPU
     a[ii] = sum(y_current)
end

Not being written in an array fashion, this is slow indeed.

Does anyone ones a trick to save the sum value by staying on the GPU? May be one has to write a specific kernel…

Thank you for your help and suggestions,

Best regards,


#2

I don’t think that’s the issue here. Returning a scalar from a GPU reduction should be cheap. Why is a on the GPU you though?


#3

The Euler step is something like

y_current .= y_current .+ dt .* (F.(y_current) .+ sum(y_current) )

Removing .+ sum(y_current) runs much faster going from 1.1s to 0.16s. It don’t expect computing a sum to be so slow.

Why is a on the GPU you though?

I thought sum(y_current) would stay on the gpu, so I initialised a on the gpu as well.


#4

No, it’ll reduce to a scalar on the CPU. That’s why the indexing is slow.

I believe you don’t want to broadcast ArrayFire commands and let its internal memory management handle it. I think if you do that, the scalar add will be much faster?


#5

Blockquote I think if you do that, the scalar add will be much faster?

Indeed, it improves a bit.

Now removing the sum changes the timing from 0.85s to 0.13s.

Still, I did not expect the sum to decrease the speed like that.


#6

Do allowslow(AFArray, false) and see if your code still runs.


#7

I use @gaika version which does not seem to have that functionality. Thank you for the suggestion.


#8

But @gaika implemented that function? Try ArrayFire.allowslow(AFArray, false)


#9

The code was merged with official version, my branch is stale now.


#10

@gaika You mean I can use the official ArrayFire to get your code?


#11

Main reason the sum is slow is that it stops JIT and forces evaluation of all AFArrays, even temporary.


#12

Yes, updated description of my fork to that effect.


#13

Oh, can we “avoid” this behaviour?

The dot product behaves the same?


#14

Then using u*y_current with u=ones(AFArray,1,N) is faster than sum (I tried)… :joy:


#15

Can you post the full CLArray version? I can see if I can make it faster :slight_smile:


#16

I can already tell you that I made a mistake. It is not that slow compared to ArrayFire…

using CLArrays, GPUArrays
CLArrays.init(CLArrays.devices()[2])
const TY  = Float32
# using ProgressMeter, StatsBase, Distributions, JLD




const N     = 200_000
const w     = TY(1.0)
const H0    = TY(2.)
const Iext  = TY(1.0)

F(v::TY) = Iext - v*0.1f0
f(v::TY) = max(v-0.2f0,0f0).^20

# F(v) = Iext - v*0.1
# f(v) = max(v-0.2,0).^20

function clstep!(y_current,rates,kappa,dt,a,ii,rnd_cl)
	kappa     .= sqrt.(2 * H0 * sum(rates,1)/N) .* w
	y_current .= y_current .+ dt .* (F.(y_current) .+ kappa)
	rates     .= f.(y_current)
	GPUArrays.rand!(rnd_cl)
	y_current .= y_current .* (rates .< -log.(rnd_cl) ./dt )
	if mod(ii,500) == 0
		push!(a,kappa)
end


function euler_cl(T::TY,dt::TY,n::Int32,y0, plot_ = false)
    println("\n--> Euler algo, up to T = ",n*dt)
    n = min(n,Int(ceil(T/dt)))
	println("OK")	
    y_current = copy(y0)
    rndcl     = copy(y0) # to hold randomness
    rates     = f.(y_current)
	println("OK")
    kappa     = sqrt.(2 * H0 * sum(rates,1)/N) .* w
    nb_jumps  = 0
	y_current .= y_current .+ dt .* (F.(y_current) .+ kappa)
	println("OK for Euler step")	

    # variables for saving data
    tout = TY[]
    a = TY[]
	
    for ii=1:n
        # @show ii
        clstep!(y_current,rates,kappa,dt,a,ii,rndcl)
        # save data
        push!(tout,ii*dt)

    end
    return tout,a
end

#######################################################################################
xc0 = rand(TY,N) |> CLArray;
@show typeof(xc0)


parms = TY.([0., 0.])
dt = TY(1./(5.4N))
t,a_tot = euler_cl(TY(5.5),dt*2,Int32(10),xc0);
t,a_tot = @time euler_cl(TY(5.5),dt,Int32(1000),xc0,false);

#17

Okay I got it a bit faster:

Initial version: 2.363006s

1.0s -> Insert forgotten points (meaning only partially fused kernels, and tmp allocations)
0.7s -> using sum(x) instead of sum(x, 1), possibly doesn’t generalize anymore? But the current version was 1D anyways
0.4s -> moving everything into one gpu kernel

final version:


function clstep!(state, randstate, ratesum, kappa, y_current, rates, dt)
    i = @linearidx(y_current, state)
    yi = y_current[i] # create tmp variables to not needlessly access array multiple times
    ytmp = yi + dt * (F(yi) + kappa)
    rate = (rates[i] = f(ytmp))
    rnd = GPUArrays.gpu_rand(Float32, state, randstate)
    y_current[i] = ytmp * (rate < -log(rnd) / dt)
    return
end

function euler_cl(T::TY,dt::TY,n::Int32,y0, plot_ = false)
    println("\n--> Euler algo, up to T = ",n*dt)
    n = min(n,Int(ceil(T/dt)))
	println("OK")
    y_current = copy(y0)
    rndcl     = copy(y0) # to hold randomness
    rates     = f.(y_current)
	println("OK")
    kappa     = sqrt.(2 * H0 * sum(rates)/N) .* w
    nb_jumps  = 0
	y_current .= y_current .+ dt .* (F.(y_current) .+ kappa)
	println("OK for Euler step")

    # variables for saving data
    tout = TY[]
    a = TY[]
    randstate = GPUArrays.cached_state(y_current)
    for ii=1:n
        ratesum = sum(rates)
        kappa = sqrt(2f0 * H0 * ratesum ./ N) .* w
        gpu_call(clstep!, y_current, (randstate, ratesum, kappa, y_current, rates, dt))
        # save data
        push!(tout, ii*dt)
    end
    return tout, a
end

#######################################################################################
xc0 = rand(TY,N) |> CLArray;

parms = TY.([0., 0.])
dt = TY(1. / (5.4N))
t,a_tot = euler_cl(TY(5.5),dt*2,Int32(10),xc0);


t,a_tot = @time euler_cl(TY(5.5), dt,Int32(1000),xc0);

I removed the push!(a, kappa) since there was a type mismatch… I guess that shouldn’t change anything for the benchmarks…

Best,
Simon


Generating Random Number from inside Kernel
#18

@sdanisch a bit?

:heart_eyes:


#19

I thought it would be straightforward to port this example to CuArrays from CLArrays, but I’m getting some errors that a GPU newbie like myself can’t easily resolve. Is there a guide to go between the two?


#20

Do you really mean it or rates = (rates[i] = f(ytmp))?