Julia can be dramatically slower than Matlab when solving 2D PDE

Dear Oscar,
Thanks for your suggestion. For now, I only work with partial differential equations

You can use an arbitrary ODE solver to timestep your PDEs. I believe something like this should work.

using OrdinaryDiffEqTsit5
# formulate energy as ODE
@views function energy3(dT, T, p, t)
	Fo, h, N = p
	@. dT[2:N-1,2:N-1] =  (Fo / h) * ((T[3:N,2:N-1]) - 2 * T[2:N-1,2:N-1] + T[1:N-2,2:N-1]) + (T[2:N-1,3:N] - 2 * T[2:N-1,2:N-1] + T[2:N-1,1:N-2])
    dT[1:N,1] .= 0;
    dT[1:N,N] .= 0;
    dT[1,1:N] .= dT[2,1:N];
    dT[N,1:N] .= dT[N-1,1:N];
end

# parameters
N = 101;
Fo = 0.01;
Th = 0.5;
Tc = -0.5;
h = 1/(N-1);

# initial T
T0 = zeros(N, N)
T0[1:N,1] .= Th;
T0[1:N,N] .= Tc;
prob = ODEProblem(energy3, T0, (0, 1000), (Fo, h, N))
sol = solve(prob, Tsit5())
1 Like

Semantically it does copy the array. Implementation-wise itā€™s done with copy-on-write, so they share the data until either reference gets mutated.
The same happens with arguments to functions.

4 Likes

I dare to ask another question. Now letā€™s switch to GPU computing. Firstly, letā€™s execute the code on CPU

N = 2001;
Fo = 0.01;
h = 1/(N-1);
T=(zeros(N,N));
#T=CuArray((zeros(N,N)));
Th = 0.5;
Tc = -0.5;
Time = 100;
dt=0.01;
function energy(Time, N, T, Th, Fo, dt, Tc, h)
    for iter = 1:Time
        Tn = T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .=  Th;
T[1:N,N] .=  Tc;
T[1,1:N] .=  T[2,1:N];
T[N,1:N] .=  T[N-1,1:N];
    end
end
@time energy(Time, N, T, Th, Fo, dt, Tc, h)
8.872417 seconds (1.09 M allocations: 20.899 GiB, 4.28% gc time, 3.45% compilation time)

Now, letā€™s swtich to GPU

N = 2001;
Fo = 0.01;
h = 1/(N-1);
#T=(zeros(N,N));
T=CuArray((zeros(N,N)));
Th = 0.5;
Tc = -0.5;
Time = 100;
dt=0.01;
function energy(Time, N, T, Th, Fo, dt, Tc, h)
    for iter = 1:Time
        Tn = T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .=  Th;
T[1:N,N] .=  Tc;
T[1,1:N] .=  T[2,1:N];
T[N,1:N] .=  T[N-1,1:N];
    end
end
@time energy(Time, N, T, Th, Fo, dt, Tc, h)
0.685531 seconds (1.17 M allocations: 64.837 MiB, 3.00% gc time, 60.79% compilation time)

It seems like a good speed up. However, Matlab executed this code in 0.21 seconds on GPU. Can someone suggest any easy-to-understand optimization tips? Or should i straightaway try to write a CUDA kernel?

On a GPU, you definitely want to use Float32 for all your numbers. You also definitely should be using an ODE solver as previously mentioned.

Why does Julia not use such a mechanism?

Copy on write makes multithreading, and getindex more expensive. Itā€™s very far from a free optimization.

Also, if Arr2 = Arr1, we want mutations to Arr2 to be visible in Arr1. This sharing is central to Julia semantics, and is sorely missed in Matlab.

Or do you mean that slices could be views until they are mutated?

Yes, absolutely.

In future, of course it will be the float32. I just did the test case. In matlab, the double precision was also used. What concerns the ODE solver, i donā€™t feel that it is suitable in my case since i study different numerical schemes for computaional fluid dynamics in my research

There are two important optimization:

  1. Use @views as mentioned before. That will make slices (like Tn[2:N-1,2:N-1]) not allocate new temporary arrays.

  2. Again: the line Tn = T does not really makes sense here. And this turns out to be important for performance. With Tn = T you are just saying that Tn and T are the same array. This is wrong when you use the loop version of the code. In this last version, with the broadcast, it is not wrong, but because, to avoid aliasing (to avoid concurrent writing to the array T), the compiler will create temporary copies of the slices of the array where it cannot prove that concurrency does not occur.

In summary, use the function like this:

@views function energy(Time, N, T, Th, Fo, dt, Tc, h)
      Tn = similar(T)
      for iter = 1:Time
          Tn .= T;
# the rest is identical

I get, before these changes:

julia> @btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
  109.071 ms (66900 allocations: 1.82 MiB)

After these two changes:

julia> @btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
  2.984 ms (43405 allocations: 978.33 KiB)

Thus a ~30x speedup.

Edit: The path for obtaining optimal code is to eliminate the allocation of intermediates. Here, for example, I add Tn as an additional variable, to not allocate inside the function. Then, I get this:

julia> @views function energy(Time, N, T, Th, Fo, dt, Tc, h, Tn) # Additional parameter Tn
           for iter = 1:Time
               Tn .= T;
       @. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
       T[1:N,1] .=  Th;
       T[1:N,N] .=  Tc;
       T[1,1:N] .=  T[2,1:N];
       T[N,1:N] .=  T[N-1,1:N];
           end
       end
energy (generic function with 2 methods)

julia> T = zeros(N,N); Tn = similar(T); # running on CPU

julia> @btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h, $Tn)
  1.382 s (0 allocations: 0 bytes)

Very well, the function does not allocate any intermediate when running on the CPU. This is probably a reasonably good implementation, and running it on the GPU confirms it. The previous version, even with the use of @views, was allocating and, thus was slower:

julia> @views function energy_old(Time, N, T, Th, Fo, dt, Tc, h) # Additional parameter Tn
           for iter = 1:Time
               Tn = T;
       @. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
       T[1:N,1] .=  Th;
       T[1:N,N] .=  Tc;
       T[1,1:N] .=  T[2,1:N];
       T[N,1:N] .=  T[N-1,1:N];
           end
       end
energy_old (generic function with 1 method)

julia> @btime energy_old($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h)
  3.193 s (800 allocations: 11.91 GiB)

The allocations coming from the fact that the compiler was unable to guarantee, in the broadcast, that concurrent writing to T was not occurring.

2 Likes

You can use an ODE solver for that of course: thereā€™s about 300 already available and tested for correctness.

6 Likes

Just to expand on the recommendations about functions, it is idiomatic in Julia to indicate a mutating function by adding ! to its name, and to have the mutated variable as first argument. Even better, follow the function template of OrdinaryDifferentialEquations.jl like @Oscar_Smith did, but with !:
energy3!(dT, T, p, t)

You may be interested in PDE packages, many of which are organized under JuliaFEM.jl, for example HeatTransfer.jl.

You may be interested in PDE packages, many of which are organized under JuliaFEM.jl, for example HeatTransfer.jl.

JuliaFEM.jl is extremely outdated now. HeatTransfer.jl hasnā€™t even been updated in 4 years. A slightly more up to date reference of the available PDE packages is probably(?) GitHub - JuliaPDE/SurveyofPDEPackages: Survey of the packages of the Julia ecosystem for solving partial differential equations

1 Like

Many thanks for clear and concise explanation! With these changes, Julia executes the GPU code ~1.6 faster than Matlab. However, when i use Float32 variables, i get almost the same performance as Float64.

N = 2001;
Nx = N; Ny = N;
Fo = 0.01;
h = 1/(N-1);
#T=(zeros(N,N));
T=CuArray((zeros(N,N)));
Tn = similar(T)
Th = 0.5;
Tc = -0.5;
Time = 10000;
dt=0.01; 
@views function energy_2(Time, N, T, Th, Fo, dt, Tc, h, Tn)
    for iter = 1:Time
        Tn .= T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .=  Th;
T[1:N,N] .=  Tc;
T[1,1:N] .=  T[2,1:N];
T[N,1:N] .=  T[N-1,1:N];
    end
end
##@time energy_2(Time, N, T, Th, Fo, dt, Tc, h, Tn)
@btime energy_2($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h, $Tn)
17.028 s (3079034 allocations: 517.09 MiB)

And Float32

N = 2001;
Fo = 0.01;
h = 1/(N-1);
T=CUDA.zeros(Float32,N,N);
Tn = similar(T)
Th = 0.5;
Tc = -0.5;
Time = 10000;
dt=0.01;
Fo_gpu = CUDA.fill(Float32(Fo),1)
dt_gpu = CUDA.fill(Float32(dt),1)
Th_gpu = CUDA.fill(Float32(Th),1)
Tc_gpu = CUDA.fill(Float32(Tc),1)
h_gpu = CUDA.fill(Float32(h),1);
@views function energy(Time, N, T, Th, Fo, dt, Tc, h, Tn)
    for iter = 1:Time
        Tn .= T;
@. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
T[1:N,1] .=  Th;
T[1:N,N] .=  Tc;
T[1,1:N] .=  T[2,1:N];
T[N,1:N] .=  T[N-1,1:N];
    end
end
##@time energy(Time, N, T, Th, Fo, dt, Tc, h, Tn)
@btime energy($Time, $N, $T, $Th, $Fo, $dt, $Tc, $h, $Tn)
15.284 s (3628806 allocations: 577.51 MiB)

Is it an expectable result in this test case?

Dear Chris,
Thank you for this note. I develop hybrid lattice Boltzmann schemes which i believe still unavailable in commercial softwares and free libraries or packages

Unexpectedly low performance might come from Float64 in the intermediate computation. Maybe a better pattern would be like this:

function energy(
    Time::Int32,
    T::AbstractMatrix{F},
    Th::AbstractMatrix{F},
    Fo::F,
    dt::F,
    Tc::F,
    h::F,
    Tn::AbstractMatrix{F}
) where {F<:AbstractFloat}
    N = size(T, 1)
    @views for iter = 1:Time
        Tn .= T;
        @. T[2:N-1,2:N-1] = Tn[2:N-1,2:N-1] + Fo * dt * ((Tn[3:N,2:N-1] - 2 * Tn[2:N-1,2:N-1] + Tn[1:N-2,2:N-1]) / h + (Tn[2:N-1,3:N] - 2 * Tn[2:N-1,2:N-1] + Tn[2:N-1,1:N-2]) / h)
        T[1:N,1] .=  Th;
        T[1:N,N] .=  Tc;
        T[1,1:N] .=  T[2,1:N];
        T[N,1:N] .=  T[N-1,1:N];
    end
end

function energy(
    Time::Integer,
    T::AbstractMatrix{F},
    Th::AbstractMatrix{F},
    Fo::Real,
    dt::Real,
    Tc::Real,
    h::Real,
    Tn::AbstractMatrix{F}
) where {F<:AbstractFloat}
    Fo1, dt1, Tc1, h1 = convert.(F, (Fo, dt, Tc, h))
    return energy(Int32(Time), T, Th, Fo1, dt1, Tc1, h1, Tn)
end

This way, the scalar arguments will be first converted into eltype(T), to avoid mixed-precision operations.

3 Likes

I think you need only Fo = 1.0f0 (a scalar of 32bits).

And you forgot to change the function call to use the new values.

That said, I tested with 32bits in my machine and didnā€™t get a significant improvement either. Maybe some more experienced CUDA user can explain.

1 Like

I noticed that too. But after changing the values inside the function, i got the same performance result. Anyway, many thanks for this useful discussion

It is very unclear for me what intermediate computations can be in Float64 since all variables are Float32 :thinking: After computation, i just in case checked every variable and they all were Float32.
Many thanks for your code modification. Unfortunately, i did not understand how to run it properly.