Dear Oscar,
Thanks for your suggestion. For now, I only work with partial differential equations
You can use an arbitrary ODE solver to timestep your PDEs. I believe something like this should work.
using OrdinaryDiffEqTsit5
# formulate energy as ODE
@views function energy3(dT, T, p, t)
Fo, h, N = p
@. dT[2:N-1,2:N-1] = (Fo / h) * ((T[3:N,2:N-1]) - 2 * T[2:N-1,2:N-1] + T[1:N-2,2:N-1]) + (T[2:N-1,3:N] - 2 * T[2:N-1,2:N-1] + T[2:N-1,1:N-2])
dT[1:N,1] .= 0;
dT[1:N,N] .= 0;
dT[1,1:N] .= dT[2,1:N];
dT[N,1:N] .= dT[N-1,1:N];
end
# parameters
N = 101;
Fo = 0.01;
Th = 0.5;
Tc = -0.5;
h = 1/(N-1);
# initial T
T0 = zeros(N, N)
T0[1:N,1] .= Th;
T0[1:N,N] .= Tc;
prob = ODEProblem(energy3, T0, (0, 1000), (Fo, h, N))
sol = solve(prob, Tsit5())
Semantically it does copy the array. Implementation-wise itās done with copy-on-write, so they share the data until either reference gets mutated.
The same happens with arguments to functions.
I dare to ask another question. Now letās switch to GPU computing. Firstly, letās execute the code on CPU
N = 2001;
Fo = 0.01;
h = 1/(N-1);
T=(zeros(N,N));
#T=CuArray((zeros(N,N)));
Th = 0.5;
Tc = -0.5;
Time = 100;
dt=0.01;
function energy(Time, N, T, Th, Fo, dt, Tc, h)
for iter = 1:Time
Tn = T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
@time energy(Time, N, T, Th, Fo, dt, Tc, h)
8.872417 seconds (1.09 M allocations: 20.899 GiB, 4.28% gc time, 3.45% compilation time)
Now, letās swtich to GPU
N = 2001;
Fo = 0.01;
h = 1/(N-1);
#T=(zeros(N,N));
T=CuArray((zeros(N,N)));
Th = 0.5;
Tc = -0.5;
Time = 100;
dt=0.01;
function energy(Time, N, T, Th, Fo, dt, Tc, h)
for iter = 1:Time
Tn = T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
@time energy(Time, N, T, Th, Fo, dt, Tc, h)
0.685531 seconds (1.17 M allocations: 64.837 MiB, 3.00% gc time, 60.79% compilation time)
It seems like a good speed up. However, Matlab executed this code in 0.21 seconds on GPU. Can someone suggest any easy-to-understand optimization tips? Or should i straightaway try to write a CUDA kernel?
On a GPU, you definitely want to use Float32 for all your numbers. You also definitely should be using an ODE solver as previously mentioned.
Why does Julia not use such a mechanism?
Copy on write makes multithreading, and getindex more expensive. Itās very far from a free optimization.
Also, if Arr2 = Arr1
, we want mutations to Arr2
to be visible in Arr1
. This sharing is central to Julia semantics, and is sorely missed in Matlab.
Or do you mean that slices could be views until they are mutated?
Yes, absolutely.
In future, of course it will be the float32. I just did the test case. In matlab, the double precision was also used. What concerns the ODE solver, i donāt feel that it is suitable in my case since i study different numerical schemes for computaional fluid dynamics in my research
There are two important optimization:
-
Use
@views
as mentioned before. That will make slices (likeTn[2:N-1,2:N-1]
) not allocate new temporary arrays. -
Again: the line
Tn = T
does not really makes sense here. And this turns out to be important for performance. WithTn = T
you are just saying thatTn
andT
are the same array. This is wrong when you use the loop version of the code. In this last version, with the broadcast, it is not wrong, but because, to avoid aliasing (to avoid concurrent writing to the arrayT
), the compiler will create temporary copies of the slices of the array where it cannot prove that concurrency does not occur.
In summary, use the function like this:
@views function energy(Time, N, T, Th, Fo, dt, Tc, h)
Tn = similar(T)
for iter = 1:Time
Tn .= T;
# the rest is identical
I get, before these changes:
julia> @btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
109.071 ms (66900 allocations: 1.82 MiB)
After these two changes:
julia> @btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
2.984 ms (43405 allocations: 978.33 KiB)
Thus a ~30x speedup.
Edit: The path for obtaining optimal code is to eliminate the allocation of intermediates. Here, for example, I add Tn
as an additional variable, to not allocate inside the function. Then, I get this:
julia> @views function energy(Time, N, T, Th, Fo, dt, Tc, h, Tn) # Additional parameter Tn
for iter = 1:Time
Tn .= T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
energy (generic function with 2 methods)
julia> T = zeros(N,N); Tn = similar(T); # running on CPU
julia> @btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h, $Tn)
1.382 s (0 allocations: 0 bytes)
Very well, the function does not allocate any intermediate when running on the CPU. This is probably a reasonably good implementation, and running it on the GPU confirms it. The previous version, even with the use of @views
, was allocating and, thus was slower:
julia> @views function energy_old(Time, N, T, Th, Fo, dt, Tc, h) # Additional parameter Tn
for iter = 1:Time
Tn = T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
energy_old (generic function with 1 method)
julia> @btime energy_old($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
3.193 s (800 allocations: 11.91 GiB)
The allocations coming from the fact that the compiler was unable to guarantee, in the broadcast, that concurrent writing to T was not occurring.
You can use an ODE solver for that of course: thereās about 300 already available and tested for correctness.
Just to expand on the recommendations about functions, it is idiomatic in Julia to indicate a mutating function by adding ! to its name, and to have the mutated variable as first argument. Even better, follow the function template of OrdinaryDifferentialEquations.jl
like @Oscar_Smith did, but with !:
energy3!(dT, T, p, t)
You may be interested in PDE packages, many of which are organized under JuliaFEM.jl
, for example HeatTransfer.jl
.
You may be interested in PDE packages, many of which are organized under
JuliaFEM.jl
, for exampleHeatTransfer.jl
.
JuliaFEM.jl is extremely outdated now. HeatTransfer.jl hasnāt even been updated in 4 years. A slightly more up to date reference of the available PDE packages is probably(?) GitHub - JuliaPDE/SurveyofPDEPackages: Survey of the packages of the Julia ecosystem for solving partial differential equations
Many thanks for clear and concise explanation! With these changes, Julia executes the GPU code ~1.6 faster than Matlab. However, when i use Float32 variables, i get almost the same performance as Float64.
N = 2001;
Nx = N; Ny = N;
Fo = 0.01;
h = 1/(N-1);
#T=(zeros(N,N));
T=CuArray((zeros(N,N)));
Tn = similar(T)
Th = 0.5;
Tc = -0.5;
Time = 10000;
dt=0.01;
@views function energy_2(Time, N, T, Th, Fo, dt, Tc, h, Tn)
for iter = 1:Time
Tn .= T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
##@time energy_2(Time, N, T, Th, Fo, dt, Tc, h, Tn)
@btime energy_2($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h, $Tn)
17.028 s (3079034 allocations: 517.09 MiB)
And Float32
N = 2001;
Fo = 0.01;
h = 1/(N-1);
T=CUDA.zeros(Float32,N,N);
Tn = similar(T)
Th = 0.5;
Tc = -0.5;
Time = 10000;
dt=0.01;
Fo_gpu = CUDA.fill(Float32(Fo),1)
dt_gpu = CUDA.fill(Float32(dt),1)
Th_gpu = CUDA.fill(Float32(Th),1)
Tc_gpu = CUDA.fill(Float32(Tc),1)
h_gpu = CUDA.fill(Float32(h),1);
@views function energy(Time, N, T, Th, Fo, dt, Tc, h, Tn)
for iter = 1:Time
Tn .= T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
##@time energy(Time, N, T, Th, Fo, dt, Tc, h, Tn)
@btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h, $Tn)
15.284 s (3628806 allocations: 577.51 MiB)
Is it an expectable result in this test case?
Dear Chris,
Thank you for this note. I develop hybrid lattice Boltzmann schemes which i believe still unavailable in commercial softwares and free libraries or packages
Unexpectedly low performance might come from Float64
in the intermediate computation. Maybe a better pattern would be like this:
function energy(
Time::Int32,
T::AbstractMatrix{F},
Th::AbstractMatrix{F},
Fo::F,
dt::F,
Tc::F,
h::F,
Tn::AbstractMatrix{F}
) where {F<:AbstractFloat}
N = size(T, 1)
@views for iter = 1:Time
Tn .= T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .= Th;
T[1:N,N] .= Tc;
T[1,1:N] .= T[2,1:N];
T[N,1:N] .= T[N-1,1:N];
end
end
function energy(
Time::Integer,
T::AbstractMatrix{F},
Th::AbstractMatrix{F},
Fo::Real,
dt::Real,
Tc::Real,
h::Real,
Tn::AbstractMatrix{F}
) where {F<:AbstractFloat}
Fo1, dt1, Tc1, h1 = convert.(F, (Fo, dt, Tc, h))
return energy(Int32(Time), T, Th, Fo1, dt1, Tc1, h1, Tn)
end
This way, the scalar arguments will be first converted into eltype(T)
, to avoid mixed-precision operations.
I think you need only Fo = 1.0f0
(a scalar of 32bits).
And you forgot to change the function call to use the new values.
That said, I tested with 32bits in my machine and didnāt get a significant improvement either. Maybe some more experienced CUDA user can explain.
I noticed that too. But after changing the values inside the function, i got the same performance result. Anyway, many thanks for this useful discussion
It is very unclear for me what intermediate computations can be in Float64 since all variables are Float32 After computation, i just in case checked every variable and they all were Float32.
Many thanks for your code modification. Unfortunately, i did not understand how to run it properly.