# Optimize a simple code for performance

Hi folks,
the other day I was discussing with a colleague about how to speed up things in Julia for performance, and was wondering if anybody could help me improve (so as to learn) the following code.
This is a minimal part of a code, where one is supposed to evaluate the laplacian in cartesian coordinates of a given function discretised in an array. I have here a grid, the tabulated function and the tabulated (analytical) laplacian

``````xmax    =  4.0
xmin    = -xmax
Nx      =  2001
x       = collect(range(xmin, stop=xmax,length=Nx))
dx      = x-x;

ymax    =  3.0
ymin    = -ymax
Ny      =  2001
y       = collect(range(ymin, stop=ymax,length=Ny))
dy      = y-y;

# Testing the 2D case

fi     = zeros(Nx,Ny);
lap_fi = zero(fi);
for i in 1:Nx
xx = x[i]
for j in 1:Ny
yy = y[j]
fi[i,j]     = exp(-0.1*(xx*xx+2.0*yy*yy))
lap_fi[i,j] = fi[i,j]*(-0.6 + 0.04*xx*xx + 0.16*yy*yy)
end
end
lap_fi
``````

Then I wrote a function the good-old-fortran way… I mean, performing all the operations element-wise

``````function Lapla(y::Array{T}, dr::Array{T}) where {T}
Nx, Ny    = size(y)
idx2,idy2 = 1.0./(dr.*dr)
d2y       = zero(y)
@inbounds for i in 2:Nx-1
@inbounds for j in 2:Ny-1
d2y[i,j]  = idx2*(y[i+1,j] - 2.0*y[i,j] + y[i-1,j])
d2y[i,j] += idy2*(y[i,j+1] - 2.0*y[i,j] + y[i,j-1])
end
end
@inbounds for j in 2:Ny-1
d2y[1,j]    = idy2*(y[1,j+1]       - 2.0*y[1,j]       + y[1,j-1]) +
idx2*(2.0*y[1,j]     - 5.0*y[2,j]       + 4.0*y[3,j]       - y[4,j])
d2y[end,j]  = idy2*(y[end,j+1]     - 2.0*y[end,j]     + y[end,j-1]) +
idx2*(2.0*y[end,j]   - 5.0*y[end-1,j]   + 4.0*y[end-2,j]   - y[end-3,j])
end
@inbounds for j in 2:Nx-1
d2y[j,1]    = idx2*(y[j+1,1]       - 2.0*y[j,1]       + y[j-1,1]) +
idy2*(2.0*y[j,1]     - 5.0*y[j,2]       + 4.0*y[j,3]       - y[j,4])
d2y[j,end]  = idx2*(y[j+1,end]     - 2.0*y[j,end]     + y[j-1,end]) +
idy2*(2.0*y[j,end]   - 5.0*y[j,end-1] + 4.0*y[j,end-2]   - y[j,end-3])
end
d2y[1,1]        = idy2*(2.0*y[1,1]     - 5.0*y[1,2]       + 4.0*y[1,3]       - y[1,4]) +
idx2*(2.0*y[1,1]     - 5.0*y[2,1]       + 4.0*y[3,1]       - y[4,1])
d2y[1,end]      = idy2*(2.0*y[1,end]   - 5.0*y[1,end-1]   + 4.0*y[1,end-2]   - y[1,end-3]) +
idx2*(2.0*y[1,end]   - 5.0*y[2,end]     + 4.0*y[3,end]     - y[4,end])
d2y[end,1]      = idy2*(2.0*y[end,1]   - 5.0*y[end,2]     + 4.0*y[end,3]     - y[end,4]) +
idx2*(2.0*y[end,1]   - 5.0*y[end-1,1]   + 4.0*y[end-2,1]   - y[end-3,1])
d2y[end,end]    = idy2*(2.0*y[end,end] - 5.0*y[end,end-1] + 4.0*y[end,end-2] - y[end,end-3]) +
idx2*(y[end,end]     - 5.0*y[end-1,end] + 4.0*y[end-2,end] - y[end-3,end])
d2y
end;
``````

with the result

``````@btime Lapla(fi,[dx,dy])
>  90.100 ms (4 allocations: 30.55 MiB)
``````

…so quite fast with only 4 allocations. I’m happy with the result… But one of the reasons why I moved to Julia was to avoid having to write code like this, and use functions that operate on vectors/arrays directly.
I then wrote the following code

``````function Laplacian_1D!(d2y, y, dr)
idx2 = 1.0/prod(dr)^2
Nx   = size(y,1)
@inbounds for i in 2:Nx-1
d2y[i] = idx2*(y[i+1]-2*y[i]+y[i-1])
end
d2y     = idx2*(2.0*y   - 5.0*y     + 4.0*y     - y)
d2y[end]   = idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end;

function Laplacian_2D!(d2y, y, dr)
dx, dy     = dr
# idx2, idy2 = 1.0./(dr.*dr)
Nx, Ny     = size(y)
aux        = zero(d2y)
@inbounds for i in 1:Ny
Laplacian_1D!(view(aux,:,i),view(y,:,i),dx)
end
@inbounds for i in 1:Nx
Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
end
d2y .+= aux
end;
``````

which does the laplacian in 1 dimension, and uses this function to compute the laplacian in two dimensions. Then

``````d2y = zero(fi[:,1]);
@btime Laplacian_1D!(d2y,view(fi,:,1),dx)
>   1.688 μs (3 allocations: 80 bytes)
``````

with 3 allocations (nice!), and

``````d2xy = zero(fi)
@btime Laplacian_2D!(d2xy,view(fi,:,:),[dx,dy])
>   100.073 ms (8010 allocations: 30.91 MiB)
``````

which is similar in terms of computing time, though a 10% worse, probably(?) but to allocations.

Now the question is: can the second way (with Laplacian_1D! and Laplacian_2D!) be modified slightly to perform as the Lapla function above? Is the speed & allocation penalty unavoidable without having to re-think the whole problem?
I keep wondering if I must code everything element wise as in Lapla all the time… which I’d like to avoid.

Ferran.

First, you’ve got a small bug in your `Lapla` implementation: the last value is computed incorrectly. Although bugs of this nature are difficult to avoid, I guess this says something about the balance between code quality/correctness vs sheer speed.

Fixing this makes both your implementations yield the same results, as checked by a test like the following:

``````julia> using LinearAlgebra;
julia> steps = (dx,dy);
julia> let d2xy = zero(fi)
norm(Lapla(fi, steps) - Laplacian_2D!(d2xy2,fi,steps))
end

0.0
``````

Now that both implementations produce the same results, optimization can begin.Starting with your code, I get these reference timings on my machine:

``````julia> let d2xy = zero(fi)
@btime Lapla(\$fi, \$steps);
@btime Laplacian_2D!(\$d2xy,\$fi,\$steps);
nothing
end
24.429 ms (2 allocations: 30.55 MiB)
31.646 ms (8006 allocations: 30.91 MiB)
``````

This is not really fair, because `Laplacian_2D!` operates in-place and does not have to allocate space to store its results, like `Lapla` does. Fixing `Lapla` so that it operates in-place widens the performance gap between both versions: the overhead of the second version went from ~30% to ~50%.

``````julia> let d2xy = zero(fi)
@btime Lapla!(\$d2xy,\$fi, \$steps);
@btime Laplacian_2D!(\$d2xy,\$fi,\$steps);
nothing
end
20.570 ms (0 allocations: 0 bytes)
31.763 ms (8006 allocations: 30.91 MiB)
``````

Since the `Laplacian_2D!` version is meant to operate in place, it should also not allocate so much memory. This can be fixed by getting rid of the `aux` vector:

``````function Laplacian_1D!(d2y, y, dr)
idx2 = 1.0/prod(dr)^2
Nx   = size(y,1)
@inbounds for i in 2:Nx-1
d2y[i] += idx2*(y[i+1]-2*y[i]+y[i-1])
end
@inbounds d2y   += idx2*(2.0*y   - 5.0*y     + 4.0*y     - y)
@inbounds d2y[end] += idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end

function Laplacian_2D!(d2y, y, dr)
dx, dy     = dr
Nx, Ny     = size(y)

@inbounds for i in 1:Ny
Laplacian_1D!(view(d2y,:,i),view(y,:,i),dx)
end

@inbounds for i in 1:Nx
Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
end
d2y
end
``````
``````julia> let d2xy = zero(fi)
@btime Lapla!(\$d2xy, \$fi, \$steps);
@btime Laplacian_2D!(\$d2xy,\$fi,\$steps);
nothing
end
19.764 ms (0 allocations: 0 bytes)
26.161 ms (8004 allocations: 375.19 KiB)
``````

It does not get rid of all allocations, but the memory usage is more reasonable. And we’re back to ~30% overhead.

The remaining 8k allocations are probably due to the creation of temporary objects with the view. (I think there are plans to get rid of them in future Julia versions). I don’t think these allocations explain the performance loss, but let’s make sure of this: by slightlmy modifying `Laplacian_1D!` to get an additional index, we can make it operate on 2D arrays:

``````function Laplacian_1D!(d2y, y, dr, j)
idx2 = 1.0/prod(dr)^2
Nx   = size(y,1)
@inbounds for i in 2:Nx-1
d2y[i,j] += idx2*(y[i+1,j]-2*y[i,j]+y[i-1,j])
end
@inbounds d2y[1,j]   += idx2*(2.0*y[1,j]   - 5.0*y[2,j]     + 4.0*y[3,j]     - y[4,j])
@inbounds d2y[end,j] += idx2*(2.0*y[end,j] - 5.0*y[end-1,j] + 4.0*y[end-2,j] - y[end-3,j])
end

function Laplacian_2Dv2!(d2y, y, dr)
dx, dy     = dr
Nx, Ny     = size(y)

@inbounds for i in 1:Ny
Laplacian_1D!(d2y, y, dx, i)
end

@inbounds for i in 1:Nx
Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
end
d2y
end
``````
``````julia> let d2xy = zero(fi)
@btime Lapla!(\$d2xy, \$fi, \$steps);
@btime Laplacian_2D!(\$d2xy,\$fi,\$steps);
@btime Laplacian_2Dv2!(\$d2xy,\$fi,\$steps);
nothing
end
19.526 ms (0 allocations: 0 bytes)
25.246 ms (8004 allocations: 375.19 KiB)
26.015 ms (4002 allocations: 187.59 KiB)
``````

Doing this only on the first loop, we get half as many allocations, but no significant impact on the run time.

I think the performance issues of your second version are actually related to the memory access pattern: the first loop is very fast because it accesses memory in the correct order, while the second loop accesses memory in a very inefficient pattern. In the `Lapla` version, the inefficient memory accesses are interspersed among correct ones, and I would guess they are somehow handled more efficiently.

I think a Julian way to rewrite the `Lapla` version in a more readable way could involve `CartesianIndices` (see for example this blog post), but I’m not sure how…

Complete code
``````fi, lap_fi, dx, dy = let
xmax    =  4.0
xmin    = -xmax
Nx      =  2001
x       = collect(range(xmin, stop=xmax,length=Nx))
dx      = x-x;

ymax    =  3.0
ymin    = -ymax
Ny      =  2001
y       = collect(range(ymin, stop=ymax,length=Ny))
dy      = y-y;

# Testing the 2D case

fi     = zeros(Nx,Ny);
lap_fi = zero(fi);
for i in 1:Nx
xx = x[i]
for j in 1:Ny
yy = y[j]
fi[i,j]     = exp(-0.1*(xx*xx+2.0*yy*yy))
lap_fi[i,j] = fi[i,j]*(-0.6 + 0.04*xx*xx + 0.16*yy*yy)
end
end
fi, lap_fi, dx, dy
end

function Lapla!(d2y, y::Array{T}, dr) where {T}
Nx, Ny    = size(y)
idx2,idy2 = 1.0./(dr.*dr)
@inbounds for i in 2:Nx-1
@inbounds for j in 2:Ny-1
d2y[i,j]  = idx2*(y[i+1,j] - 2.0*y[i,j] + y[i-1,j])
d2y[i,j] += idy2*(y[i,j+1] - 2.0*y[i,j] + y[i,j-1])
end
end
@inbounds for j in 2:Ny-1
d2y[1,j]    = idy2*(y[1,j+1]       - 2.0*y[1,j]       + y[1,j-1]) +
idx2*(2.0*y[1,j]     - 5.0*y[2,j]       + 4.0*y[3,j]       - y[4,j])
d2y[end,j]  = idy2*(y[end,j+1]     - 2.0*y[end,j]     + y[end,j-1]) +
idx2*(2.0*y[end,j]   - 5.0*y[end-1,j]   + 4.0*y[end-2,j]   - y[end-3,j])
end
@inbounds for j in 2:Nx-1
d2y[j,1]    = idx2*(y[j+1,1]       - 2.0*y[j,1]       + y[j-1,1]) +
idy2*(2.0*y[j,1]     - 5.0*y[j,2]       + 4.0*y[j,3]       - y[j,4])
d2y[j,end]  = idx2*(y[j+1,end]     - 2.0*y[j,end]     + y[j-1,end]) +
idy2*(2.0*y[j,end]   - 5.0*y[j,end-1] + 4.0*y[j,end-2]   - y[j,end-3])
end
d2y[1,1]        = idy2*(2.0*y[1,1]     - 5.0*y[1,2]       + 4.0*y[1,3]       - y[1,4]) +
idx2*(2.0*y[1,1]     - 5.0*y[2,1]       + 4.0*y[3,1]       - y[4,1])
d2y[1,end]      = idy2*(2.0*y[1,end]   - 5.0*y[1,end-1]   + 4.0*y[1,end-2]   - y[1,end-3]) +
idx2*(2.0*y[1,end]   - 5.0*y[2,end]     + 4.0*y[3,end]     - y[4,end])
d2y[end,1]      = idy2*(2.0*y[end,1]   - 5.0*y[end,2]     + 4.0*y[end,3]     - y[end,4]) +
idx2*(2.0*y[end,1]   - 5.0*y[end-1,1]   + 4.0*y[end-2,1]   - y[end-3,1])
d2y[end,end]    = idy2*(2.0*y[end,end] - 5.0*y[end,end-1] + 4.0*y[end,end-2] - y[end,end-3]) +
idx2*(2.0*y[end,end] - 5.0*y[end-1,end] + 4.0*y[end-2,end] - y[end-3,end])
d2y
end

function Laplacian_1D!(d2y, y, dr)
idx2 = 1.0/prod(dr)^2
Nx   = size(y,1)
@inbounds for i in 2:Nx-1
d2y[i] += idx2*(y[i+1]-2*y[i]+y[i-1])
end
@inbounds d2y   += idx2*(2.0*y   - 5.0*y     + 4.0*y     - y)
@inbounds d2y[end] += idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end

function Laplacian_2D!(d2y, y, dr)
dx, dy     = dr
Nx, Ny     = size(y)

@inbounds for i in 1:Ny
Laplacian_1D!(view(d2y,:,i),view(y,:,i),dx)
end

@inbounds for i in 1:Nx
Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
end
d2y
end

function Laplacian_1D!(d2y, y, dr, j)
idx2 = 1.0/prod(dr)^2
Nx   = size(y,1)
@inbounds for i in 2:Nx-1
d2y[i,j] += idx2*(y[i+1,j]-2*y[i,j]+y[i-1,j])
end
@inbounds d2y[1,j]   += idx2*(2.0*y[1,j]   - 5.0*y[2,j]     + 4.0*y[3,j]     - y[4,j])
@inbounds d2y[end,j] += idx2*(2.0*y[end,j] - 5.0*y[end-1,j] + 4.0*y[end-2,j] - y[end-3,j])
end

function Laplacian_2Dv2!(d2y, y, dr)
dx, dy     = dr
Nx, Ny     = size(y)

@inbounds for i in 1:Ny
Laplacian_1D!(d2y, y, dx, i)
end

@inbounds for i in 1:Nx
Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
end
d2y
end

using LinearAlgebra
steps = (dx,dy)
let d2xy1 = zero(fi), d2xy2 = zero(fi)
Lapla!(d2xy1, fi, steps)
Laplacian_2D!(d2xy2,fi,steps)
println(norm(d2xy1 - d2xy2))

fill!(d2xy2, 0)
Laplacian_2Dv2!(d2xy2,fi,steps)
println(norm(d2xy1 - d2xy2))
end

using BenchmarkTools
let d2xy = zero(fi)
@btime Lapla!(\$d2xy, \$fi, \$steps);
@btime Laplacian_2D!(\$d2xy,\$fi,\$steps);
@btime Laplacian_2Dv2!(\$d2xy,\$fi,\$steps);
nothing
end
``````
3 Likes

Inlining the inner function gets rid of the view allocations. But there is still some time difference.

2 Likes

Hi,

Thanks for the tips I noticed the bug once posted, but that was actually a bad
copy-paste from my more elaborated and working code, where I had to do much cleaning and apparently did too much. The original algorithm was coded in fortran many years ago and has been flawlessly working since then, with no bugs Regarding the question itself, I actually didn’t want to use a different form for the laplacian function in 1D, 2D or 3D. I just wanted to use multiple dispatch to decide the case. Actually I use defined user-defined types for that

``````abstract type USER_DEFINED_TYPE end;

abstract type Numerical_Derivatives <: USER_DEFINED_TYPE end;

struct Hard_End_Points <: Numerical_Derivatives end;
struct Cartesian_Hard_End_Points    <: Numerical_Derivatives end;
struct Cartesian_Hard_End_Points_1D <: Numerical_Derivatives end;
struct Cartesian_Hard_End_Points_2D <: Numerical_Derivatives end;
struct Cartesian_Hard_End_Points_3D <: Numerical_Derivatives end;
``````

so that the laplacian function is always called the same way

``````Lap(d2f,grid,dr,calc)
``````

with

``````calc=Cartesian_Hard_Hend_Points_1D()
``````

or

``````calc=Cartesian_Hard_Hend_Points_2D()
``````

etc. In this way (though I did not say anything like that before), adding a new parameter j as in your Laplacian_1D! would break that. More importantly, your version does not do what mine does. Your version requires the array d2y in Laplacian_1D! to be initialised to 0, or otherwise it will add the result of the laplacian to the already stored values in d2y. I say this because I already found myself these two improvements, but realised that this was not what the original Lapla routine is doing…

In any case, more that specifically optimising this, I believe the question is more general, regarding performance penalty when trying to write a function acting on arrays (as Laplacian_2D!) vs acting on individual elements as Laplacian_1D! or Lapla do). I can understand that a little of overhead must be there, but I was expecting that one could make the time of executions of both versions could be essentially the same -which is not in the present case.
I mean, maybe I can re-think the whole thing and code it in a clever but anti-natural way, but I’m not really interested on that, I believe that coding the laplacian of a 2D function in a 2D array is the most natural way of write things… not the smartest probably, but the simplest one.

Thanks a lot for your kind help Ferran.