Optimize a simple code for performance


#1

Hi folks,
the other day I was discussing with a colleague about how to speed up things in Julia for performance, and was wondering if anybody could help me improve (so as to learn) the following code.
This is a minimal part of a code, where one is supposed to evaluate the laplacian in cartesian coordinates of a given function discretised in an array. I have here a grid, the tabulated function and the tabulated (analytical) laplacian

xmax    =  4.0
xmin    = -xmax
Nx      =  2001
x       = collect(range(xmin, stop=xmax,length=Nx))
dx      = x[2]-x[1];

ymax    =  3.0
ymin    = -ymax
Ny      =  2001
y       = collect(range(ymin, stop=ymax,length=Ny))
dy      = y[2]-y[1];

# Testing the 2D case

fi     = zeros(Nx,Ny);
lap_fi = zero(fi);
for i in 1:Nx
    xx = x[i]
    for j in 1:Ny
        yy = y[j]
        fi[i,j]     = exp(-0.1*(xx*xx+2.0*yy*yy))
        lap_fi[i,j] = fi[i,j]*(-0.6 + 0.04*xx*xx + 0.16*yy*yy)
    end
end
lap_fi

Then I wrote a function the good-old-fortran way… I mean, performing all the operations element-wise

function Lapla(y::Array{T}, dr::Array{T}) where {T}
    Nx, Ny    = size(y)
    idx2,idy2 = 1.0./(dr.*dr)
    d2y       = zero(y)
    @inbounds for i in 2:Nx-1
        @inbounds for j in 2:Ny-1
            d2y[i,j]  = idx2*(y[i+1,j] - 2.0*y[i,j] + y[i-1,j])
            d2y[i,j] += idy2*(y[i,j+1] - 2.0*y[i,j] + y[i,j-1])
        end
    end
    @inbounds for j in 2:Ny-1
        d2y[1,j]    = idy2*(y[1,j+1]       - 2.0*y[1,j]       + y[1,j-1]) +
                      idx2*(2.0*y[1,j]     - 5.0*y[2,j]       + 4.0*y[3,j]       - y[4,j])
        d2y[end,j]  = idy2*(y[end,j+1]     - 2.0*y[end,j]     + y[end,j-1]) +
                      idx2*(2.0*y[end,j]   - 5.0*y[end-1,j]   + 4.0*y[end-2,j]   - y[end-3,j])
    end
    @inbounds for j in 2:Nx-1
        d2y[j,1]    = idx2*(y[j+1,1]       - 2.0*y[j,1]       + y[j-1,1]) + 
                      idy2*(2.0*y[j,1]     - 5.0*y[j,2]       + 4.0*y[j,3]       - y[j,4])
        d2y[j,end]  = idx2*(y[j+1,end]     - 2.0*y[j,end]     + y[j-1,end]) + 
                      idy2*(2.0*y[j,end]   - 5.0*y[j,end-1] + 4.0*y[j,end-2]   - y[j,end-3])
    end
    d2y[1,1]        = idy2*(2.0*y[1,1]     - 5.0*y[1,2]       + 4.0*y[1,3]       - y[1,4]) +
                      idx2*(2.0*y[1,1]     - 5.0*y[2,1]       + 4.0*y[3,1]       - y[4,1])
    d2y[1,end]      = idy2*(2.0*y[1,end]   - 5.0*y[1,end-1]   + 4.0*y[1,end-2]   - y[1,end-3]) + 
                      idx2*(2.0*y[1,end]   - 5.0*y[2,end]     + 4.0*y[3,end]     - y[4,end])
    d2y[end,1]      = idy2*(2.0*y[end,1]   - 5.0*y[end,2]     + 4.0*y[end,3]     - y[end,4]) + 
                      idx2*(2.0*y[end,1]   - 5.0*y[end-1,1]   + 4.0*y[end-2,1]   - y[end-3,1])
    d2y[end,end]    = idy2*(2.0*y[end,end] - 5.0*y[end,end-1] + 4.0*y[end,end-2] - y[end,end-3]) +
                      idx2*(y[end,end]     - 5.0*y[end-1,end] + 4.0*y[end-2,end] - y[end-3,end])
    d2y
end;

with the result

@btime Lapla(fi,[dx,dy])
>  90.100 ms (4 allocations: 30.55 MiB)

…so quite fast with only 4 allocations. I’m happy with the result… But one of the reasons why I moved to Julia was to avoid having to write code like this, and use functions that operate on vectors/arrays directly.
I then wrote the following code

function Laplacian_1D!(d2y, y, dr)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)   
    @inbounds for i in 2:Nx-1
        d2y[i] = idx2*(y[i+1]-2*y[i]+y[i-1])
    end
    d2y[1]     = idx2*(2.0*y[1]   - 5.0*y[2]     + 4.0*y[3]     - y[4])
    d2y[end]   = idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end;    

function Laplacian_2D!(d2y, y, dr)
    dx, dy     = dr
    # idx2, idy2 = 1.0./(dr.*dr)
    Nx, Ny     = size(y)
    aux        = zero(d2y)
    @inbounds for i in 1:Ny
        Laplacian_1D!(view(aux,:,i),view(y,:,i),dx)
    end
    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y .+= aux
end;    

which does the laplacian in 1 dimension, and uses this function to compute the laplacian in two dimensions. Then

d2y = zero(fi[:,1]);
@btime Laplacian_1D!(d2y,view(fi,:,1),dx)
>   1.688 μs (3 allocations: 80 bytes)

with 3 allocations (nice!), and

d2xy = zero(fi)
@btime Laplacian_2D!(d2xy,view(fi,:,:),[dx,dy])
>   100.073 ms (8010 allocations: 30.91 MiB)

which is similar in terms of computing time, though a 10% worse, probably(?) but to allocations.

Now the question is: can the second way (with Laplacian_1D! and Laplacian_2D!) be modified slightly to perform as the Lapla function above? Is the speed & allocation penalty unavoidable without having to re-think the whole problem?
I keep wondering if I must code everything element wise as in Lapla all the time… which I’d like to avoid.

Thanks for your help,

Ferran.


#2

First, you’ve got a small bug in your Lapla implementation: the last value is computed incorrectly. Although bugs of this nature are difficult to avoid, I guess this says something about the balance between code quality/correctness vs sheer speed.

Fixing this makes both your implementations yield the same results, as checked by a test like the following:

julia> using LinearAlgebra;
julia> steps = (dx,dy);
julia> let d2xy = zero(fi)
           norm(Lapla(fi, steps) - Laplacian_2D!(d2xy2,fi,steps))
       end

0.0

Now that both implementations produce the same results, optimization can begin.Starting with your code, I get these reference timings on my machine:

julia> let d2xy = zero(fi)
           @btime Lapla($fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           nothing
       end
  24.429 ms (2 allocations: 30.55 MiB)
  31.646 ms (8006 allocations: 30.91 MiB)

This is not really fair, because Laplacian_2D! operates in-place and does not have to allocate space to store its results, like Lapla does. Fixing Lapla so that it operates in-place widens the performance gap between both versions: the overhead of the second version went from ~30% to ~50%.

julia> let d2xy = zero(fi)
           @btime Lapla!($d2xy,$fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           nothing
       end
  20.570 ms (0 allocations: 0 bytes)
  31.763 ms (8006 allocations: 30.91 MiB)

Since the Laplacian_2D! version is meant to operate in place, it should also not allocate so much memory. This can be fixed by getting rid of the aux vector:

function Laplacian_1D!(d2y, y, dr)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i] += idx2*(y[i+1]-2*y[i]+y[i-1])
    end
    @inbounds d2y[1]   += idx2*(2.0*y[1]   - 5.0*y[2]     + 4.0*y[3]     - y[4])
    @inbounds d2y[end] += idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end

function Laplacian_2D!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(view(d2y,:,i),view(y,:,i),dx)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end
julia> let d2xy = zero(fi)
           @btime Lapla!($d2xy, $fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           nothing
       end
  19.764 ms (0 allocations: 0 bytes)
  26.161 ms (8004 allocations: 375.19 KiB)

It does not get rid of all allocations, but the memory usage is more reasonable. And we’re back to ~30% overhead.


The remaining 8k allocations are probably due to the creation of temporary objects with the view. (I think there are plans to get rid of them in future Julia versions). I don’t think these allocations explain the performance loss, but let’s make sure of this: by slightlmy modifying Laplacian_1D! to get an additional index, we can make it operate on 2D arrays:

function Laplacian_1D!(d2y, y, dr, j)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i,j] += idx2*(y[i+1,j]-2*y[i,j]+y[i-1,j])
    end
    @inbounds d2y[1,j]   += idx2*(2.0*y[1,j]   - 5.0*y[2,j]     + 4.0*y[3,j]     - y[4,j])
    @inbounds d2y[end,j] += idx2*(2.0*y[end,j] - 5.0*y[end-1,j] + 4.0*y[end-2,j] - y[end-3,j])
end

function Laplacian_2Dv2!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(d2y, y, dx, i)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end
julia> let d2xy = zero(fi)
           @btime Lapla!($d2xy, $fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           @btime Laplacian_2Dv2!($d2xy,$fi,$steps);
           nothing
       end
  19.526 ms (0 allocations: 0 bytes)
  25.246 ms (8004 allocations: 375.19 KiB)
  26.015 ms (4002 allocations: 187.59 KiB)

Doing this only on the first loop, we get half as many allocations, but no significant impact on the run time.


I think the performance issues of your second version are actually related to the memory access pattern: the first loop is very fast because it accesses memory in the correct order, while the second loop accesses memory in a very inefficient pattern. In the Lapla version, the inefficient memory accesses are interspersed among correct ones, and I would guess they are somehow handled more efficiently.

I think a Julian way to rewrite the Lapla version in a more readable way could involve CartesianIndices (see for example this blog post), but I’m not sure how…


Complete code
fi, lap_fi, dx, dy = let
    xmax    =  4.0
    xmin    = -xmax
    Nx      =  2001
    x       = collect(range(xmin, stop=xmax,length=Nx))
    dx      = x[2]-x[1];

    ymax    =  3.0
    ymin    = -ymax
    Ny      =  2001
    y       = collect(range(ymin, stop=ymax,length=Ny))
    dy      = y[2]-y[1];

    # Testing the 2D case

    fi     = zeros(Nx,Ny);
    lap_fi = zero(fi);
    for i in 1:Nx
        xx = x[i]
        for j in 1:Ny
            yy = y[j]
            fi[i,j]     = exp(-0.1*(xx*xx+2.0*yy*yy))
            lap_fi[i,j] = fi[i,j]*(-0.6 + 0.04*xx*xx + 0.16*yy*yy)
        end
    end
    fi, lap_fi, dx, dy
end

function Lapla!(d2y, y::Array{T}, dr) where {T}
    Nx, Ny    = size(y)
    idx2,idy2 = 1.0./(dr.*dr)
    @inbounds for i in 2:Nx-1
        @inbounds for j in 2:Ny-1
            d2y[i,j]  = idx2*(y[i+1,j] - 2.0*y[i,j] + y[i-1,j])
            d2y[i,j] += idy2*(y[i,j+1] - 2.0*y[i,j] + y[i,j-1])
        end
    end
    @inbounds for j in 2:Ny-1
        d2y[1,j]    = idy2*(y[1,j+1]       - 2.0*y[1,j]       + y[1,j-1]) +
                      idx2*(2.0*y[1,j]     - 5.0*y[2,j]       + 4.0*y[3,j]       - y[4,j])
        d2y[end,j]  = idy2*(y[end,j+1]     - 2.0*y[end,j]     + y[end,j-1]) +
                      idx2*(2.0*y[end,j]   - 5.0*y[end-1,j]   + 4.0*y[end-2,j]   - y[end-3,j])
    end
    @inbounds for j in 2:Nx-1
        d2y[j,1]    = idx2*(y[j+1,1]       - 2.0*y[j,1]       + y[j-1,1]) +
                      idy2*(2.0*y[j,1]     - 5.0*y[j,2]       + 4.0*y[j,3]       - y[j,4])
        d2y[j,end]  = idx2*(y[j+1,end]     - 2.0*y[j,end]     + y[j-1,end]) +
                      idy2*(2.0*y[j,end]   - 5.0*y[j,end-1] + 4.0*y[j,end-2]   - y[j,end-3])
    end
    d2y[1,1]        = idy2*(2.0*y[1,1]     - 5.0*y[1,2]       + 4.0*y[1,3]       - y[1,4]) +
                      idx2*(2.0*y[1,1]     - 5.0*y[2,1]       + 4.0*y[3,1]       - y[4,1])
    d2y[1,end]      = idy2*(2.0*y[1,end]   - 5.0*y[1,end-1]   + 4.0*y[1,end-2]   - y[1,end-3]) +
                      idx2*(2.0*y[1,end]   - 5.0*y[2,end]     + 4.0*y[3,end]     - y[4,end])
    d2y[end,1]      = idy2*(2.0*y[end,1]   - 5.0*y[end,2]     + 4.0*y[end,3]     - y[end,4]) +
                      idx2*(2.0*y[end,1]   - 5.0*y[end-1,1]   + 4.0*y[end-2,1]   - y[end-3,1])
    d2y[end,end]    = idy2*(2.0*y[end,end] - 5.0*y[end,end-1] + 4.0*y[end,end-2] - y[end,end-3]) +
                      idx2*(2.0*y[end,end] - 5.0*y[end-1,end] + 4.0*y[end-2,end] - y[end-3,end])
    d2y
end

function Laplacian_1D!(d2y, y, dr)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i] += idx2*(y[i+1]-2*y[i]+y[i-1])
    end
    @inbounds d2y[1]   += idx2*(2.0*y[1]   - 5.0*y[2]     + 4.0*y[3]     - y[4])
    @inbounds d2y[end] += idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end

function Laplacian_2D!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(view(d2y,:,i),view(y,:,i),dx)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end



function Laplacian_1D!(d2y, y, dr, j)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i,j] += idx2*(y[i+1,j]-2*y[i,j]+y[i-1,j])
    end
    @inbounds d2y[1,j]   += idx2*(2.0*y[1,j]   - 5.0*y[2,j]     + 4.0*y[3,j]     - y[4,j])
    @inbounds d2y[end,j] += idx2*(2.0*y[end,j] - 5.0*y[end-1,j] + 4.0*y[end-2,j] - y[end-3,j])
end

function Laplacian_2Dv2!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(d2y, y, dx, i)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end



using LinearAlgebra
steps = (dx,dy)
let d2xy1 = zero(fi), d2xy2 = zero(fi)
    Lapla!(d2xy1, fi, steps)
    Laplacian_2D!(d2xy2,fi,steps)
    println(norm(d2xy1 - d2xy2))

    fill!(d2xy2, 0)
    Laplacian_2Dv2!(d2xy2,fi,steps)
    println(norm(d2xy1 - d2xy2))
end

using BenchmarkTools
let d2xy = zero(fi)
    @btime Lapla!($d2xy, $fi, $steps);
    @btime Laplacian_2D!($d2xy,$fi,$steps);
    @btime Laplacian_2Dv2!($d2xy,$fi,$steps);
    nothing
end

#3

Inlining the inner function gets rid of the view allocations. But there is still some time difference.


#4

Hi,

Thanks for the tips :slight_smile: I noticed the bug once posted, but that was actually a bad
copy-paste from my more elaborated and working code, where I had to do much cleaning and apparently did too much. The original algorithm was coded in fortran many years ago and has been flawlessly working since then, with no bugs :slight_smile:

Regarding the question itself, I actually didn’t want to use a different form for the laplacian function in 1D, 2D or 3D. I just wanted to use multiple dispatch to decide the case. Actually I use defined user-defined types for that

abstract type USER_DEFINED_TYPE end;

abstract type Numerical_Derivatives <: USER_DEFINED_TYPE end;

struct Hard_End_Points <: Numerical_Derivatives end;
struct Cartesian_Hard_End_Points    <: Numerical_Derivatives end;    
struct Cartesian_Hard_End_Points_1D <: Numerical_Derivatives end;    
struct Cartesian_Hard_End_Points_2D <: Numerical_Derivatives end;    
struct Cartesian_Hard_End_Points_3D <: Numerical_Derivatives end;    

so that the laplacian function is always called the same way

Lap(d2f,grid,dr,calc)

with

calc=Cartesian_Hard_Hend_Points_1D()

or

calc=Cartesian_Hard_Hend_Points_2D() 

etc. In this way (though I did not say anything like that before), adding a new parameter j as in your Laplacian_1D! would break that. More importantly, your version does not do what mine does. Your version requires the array d2y in Laplacian_1D! to be initialised to 0, or otherwise it will add the result of the laplacian to the already stored values in d2y. I say this because I already found myself these two improvements, but realised that this was not what the original Lapla routine is doing…

In any case, more that specifically optimising this, I believe the question is more general, regarding performance penalty when trying to write a function acting on arrays (as Laplacian_2D!) vs acting on individual elements as Laplacian_1D! or Lapla do). I can understand that a little of overhead must be there, but I was expecting that one could make the time of executions of both versions could be essentially the same -which is not in the present case.
I mean, maybe I can re-think the whole thing and code it in a clever but anti-natural way, but I’m not really interested on that, I believe that coding the laplacian of a 2D function in a 2D array is the most natural way of write things… not the smartest probably, but the simplest one.

Thanks a lot for your kind help :slight_smile:

Ferran.