Optimize a simple code for performance

Ferran_Mazzanti · January 19, 2019, 11:24am

Hi folks,
the other day I was discussing with a colleague about how to speed up things in Julia for performance, and was wondering if anybody could help me improve (so as to learn) the following code.
This is a minimal part of a code, where one is supposed to evaluate the laplacian in cartesian coordinates of a given function discretised in an array. I have here a grid, the tabulated function and the tabulated (analytical) laplacian

xmax    =  4.0
xmin    = -xmax
Nx      =  2001
x       = collect(range(xmin, stop=xmax,length=Nx))
dx      = x[2]-x[1];

ymax    =  3.0
ymin    = -ymax
Ny      =  2001
y       = collect(range(ymin, stop=ymax,length=Ny))
dy      = y[2]-y[1];

# Testing the 2D case

fi     = zeros(Nx,Ny);
lap_fi = zero(fi);
for i in 1:Nx
    xx = x[i]
    for j in 1:Ny
        yy = y[j]
        fi[i,j]     = exp(-0.1*(xx*xx+2.0*yy*yy))
        lap_fi[i,j] = fi[i,j]*(-0.6 + 0.04*xx*xx + 0.16*yy*yy)
    end
end
lap_fi

Then I wrote a function the good-old-fortran way… I mean, performing all the operations element-wise

function Lapla(y::Array{T}, dr::Array{T}) where {T}
    Nx, Ny    = size(y)
    idx2,idy2 = 1.0./(dr.*dr)
    d2y       = zero(y)
    @inbounds for i in 2:Nx-1
        @inbounds for j in 2:Ny-1
            d2y[i,j]  = idx2*(y[i+1,j] - 2.0*y[i,j] + y[i-1,j])
            d2y[i,j] += idy2*(y[i,j+1] - 2.0*y[i,j] + y[i,j-1])
        end
    end
    @inbounds for j in 2:Ny-1
        d2y[1,j]    = idy2*(y[1,j+1]       - 2.0*y[1,j]       + y[1,j-1]) +
                      idx2*(2.0*y[1,j]     - 5.0*y[2,j]       + 4.0*y[3,j]       - y[4,j])
        d2y[end,j]  = idy2*(y[end,j+1]     - 2.0*y[end,j]     + y[end,j-1]) +
                      idx2*(2.0*y[end,j]   - 5.0*y[end-1,j]   + 4.0*y[end-2,j]   - y[end-3,j])
    end
    @inbounds for j in 2:Nx-1
        d2y[j,1]    = idx2*(y[j+1,1]       - 2.0*y[j,1]       + y[j-1,1]) + 
                      idy2*(2.0*y[j,1]     - 5.0*y[j,2]       + 4.0*y[j,3]       - y[j,4])
        d2y[j,end]  = idx2*(y[j+1,end]     - 2.0*y[j,end]     + y[j-1,end]) + 
                      idy2*(2.0*y[j,end]   - 5.0*y[j,end-1] + 4.0*y[j,end-2]   - y[j,end-3])
    end
    d2y[1,1]        = idy2*(2.0*y[1,1]     - 5.0*y[1,2]       + 4.0*y[1,3]       - y[1,4]) +
                      idx2*(2.0*y[1,1]     - 5.0*y[2,1]       + 4.0*y[3,1]       - y[4,1])
    d2y[1,end]      = idy2*(2.0*y[1,end]   - 5.0*y[1,end-1]   + 4.0*y[1,end-2]   - y[1,end-3]) + 
                      idx2*(2.0*y[1,end]   - 5.0*y[2,end]     + 4.0*y[3,end]     - y[4,end])
    d2y[end,1]      = idy2*(2.0*y[end,1]   - 5.0*y[end,2]     + 4.0*y[end,3]     - y[end,4]) + 
                      idx2*(2.0*y[end,1]   - 5.0*y[end-1,1]   + 4.0*y[end-2,1]   - y[end-3,1])
    d2y[end,end]    = idy2*(2.0*y[end,end] - 5.0*y[end,end-1] + 4.0*y[end,end-2] - y[end,end-3]) +
                      idx2*(y[end,end]     - 5.0*y[end-1,end] + 4.0*y[end-2,end] - y[end-3,end])
    d2y
end;

with the result

@btime Lapla(fi,[dx,dy])
>  90.100 ms (4 allocations: 30.55 MiB)

…so quite fast with only 4 allocations. I’m happy with the result… But one of the reasons why I moved to Julia was to avoid having to write code like this, and use functions that operate on vectors/arrays directly.
I then wrote the following code

function Laplacian_1D!(d2y, y, dr)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)   
    @inbounds for i in 2:Nx-1
        d2y[i] = idx2*(y[i+1]-2*y[i]+y[i-1])
    end
    d2y[1]     = idx2*(2.0*y[1]   - 5.0*y[2]     + 4.0*y[3]     - y[4])
    d2y[end]   = idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end;    

function Laplacian_2D!(d2y, y, dr)
    dx, dy     = dr
    # idx2, idy2 = 1.0./(dr.*dr)
    Nx, Ny     = size(y)
    aux        = zero(d2y)
    @inbounds for i in 1:Ny
        Laplacian_1D!(view(aux,:,i),view(y,:,i),dx)
    end
    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y .+= aux
end;

which does the laplacian in 1 dimension, and uses this function to compute the laplacian in two dimensions. Then

d2y = zero(fi[:,1]);
@btime Laplacian_1D!(d2y,view(fi,:,1),dx)
>   1.688 μs (3 allocations: 80 bytes)

with 3 allocations (nice!), and

d2xy = zero(fi)
@btime Laplacian_2D!(d2xy,view(fi,:,:),[dx,dy])
>   100.073 ms (8010 allocations: 30.91 MiB)

which is similar in terms of computing time, though a 10% worse, probably(?) but to allocations.

Now the question is: can the second way (with Laplacian_1D! and Laplacian_2D!) be modified slightly to perform as the Lapla function above? Is the speed & allocation penalty unavoidable without having to re-think the whole problem?
I keep wondering if I must code everything element wise as in Lapla all the time… which I’d like to avoid.

Thanks for your help,

Ferran.

ffevotte · January 19, 2019, 9:29pm

First, you’ve got a small bug in your Lapla implementation: the last value is computed incorrectly. Although bugs of this nature are difficult to avoid, I guess this says something about the balance between code quality/correctness vs sheer speed.

Fixing this makes both your implementations yield the same results, as checked by a test like the following:

julia> using LinearAlgebra;
julia> steps = (dx,dy);
julia> let d2xy = zero(fi)
           norm(Lapla(fi, steps) - Laplacian_2D!(d2xy2,fi,steps))
       end

0.0

Now that both implementations produce the same results, optimization can begin.Starting with your code, I get these reference timings on my machine:

julia> let d2xy = zero(fi)
           @btime Lapla($fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           nothing
       end
  24.429 ms (2 allocations: 30.55 MiB)
  31.646 ms (8006 allocations: 30.91 MiB)

This is not really fair, because Laplacian_2D! operates in-place and does not have to allocate space to store its results, like Lapla does. Fixing Lapla so that it operates in-place widens the performance gap between both versions: the overhead of the second version went from ~30% to ~50%.

julia> let d2xy = zero(fi)
           @btime Lapla!($d2xy,$fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           nothing
       end
  20.570 ms (0 allocations: 0 bytes)
  31.763 ms (8006 allocations: 30.91 MiB)

Since the Laplacian_2D! version is meant to operate in place, it should also not allocate so much memory. This can be fixed by getting rid of the aux vector:

function Laplacian_1D!(d2y, y, dr)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i] += idx2*(y[i+1]-2*y[i]+y[i-1])
    end
    @inbounds d2y[1]   += idx2*(2.0*y[1]   - 5.0*y[2]     + 4.0*y[3]     - y[4])
    @inbounds d2y[end] += idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end

function Laplacian_2D!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(view(d2y,:,i),view(y,:,i),dx)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end

julia> let d2xy = zero(fi)
           @btime Lapla!($d2xy, $fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           nothing
       end
  19.764 ms (0 allocations: 0 bytes)
  26.161 ms (8004 allocations: 375.19 KiB)

It does not get rid of all allocations, but the memory usage is more reasonable. And we’re back to ~30% overhead.

The remaining 8k allocations are probably due to the creation of temporary objects with the view. (I think there are plans to get rid of them in future Julia versions). I don’t think these allocations explain the performance loss, but let’s make sure of this: by slightlmy modifying Laplacian_1D! to get an additional index, we can make it operate on 2D arrays:

function Laplacian_1D!(d2y, y, dr, j)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i,j] += idx2*(y[i+1,j]-2*y[i,j]+y[i-1,j])
    end
    @inbounds d2y[1,j]   += idx2*(2.0*y[1,j]   - 5.0*y[2,j]     + 4.0*y[3,j]     - y[4,j])
    @inbounds d2y[end,j] += idx2*(2.0*y[end,j] - 5.0*y[end-1,j] + 4.0*y[end-2,j] - y[end-3,j])
end

function Laplacian_2Dv2!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(d2y, y, dx, i)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end

julia> let d2xy = zero(fi)
           @btime Lapla!($d2xy, $fi, $steps);
           @btime Laplacian_2D!($d2xy,$fi,$steps);
           @btime Laplacian_2Dv2!($d2xy,$fi,$steps);
           nothing
       end
  19.526 ms (0 allocations: 0 bytes)
  25.246 ms (8004 allocations: 375.19 KiB)
  26.015 ms (4002 allocations: 187.59 KiB)

Doing this only on the first loop, we get half as many allocations, but no significant impact on the run time.

I think the performance issues of your second version are actually related to the memory access pattern: the first loop is very fast because it accesses memory in the correct order, while the second loop accesses memory in a very inefficient pattern. In the Lapla version, the inefficient memory accesses are interspersed among correct ones, and I would guess they are somehow handled more efficiently.

I think a Julian way to rewrite the Lapla version in a more readable way could involve CartesianIndices (see for example this blog post), but I’m not sure how…

Complete code

fi, lap_fi, dx, dy = let
    xmax    =  4.0
    xmin    = -xmax
    Nx      =  2001
    x       = collect(range(xmin, stop=xmax,length=Nx))
    dx      = x[2]-x[1];

    ymax    =  3.0
    ymin    = -ymax
    Ny      =  2001
    y       = collect(range(ymin, stop=ymax,length=Ny))
    dy      = y[2]-y[1];

    # Testing the 2D case

    fi     = zeros(Nx,Ny);
    lap_fi = zero(fi);
    for i in 1:Nx
        xx = x[i]
        for j in 1:Ny
            yy = y[j]
            fi[i,j]     = exp(-0.1*(xx*xx+2.0*yy*yy))
            lap_fi[i,j] = fi[i,j]*(-0.6 + 0.04*xx*xx + 0.16*yy*yy)
        end
    end
    fi, lap_fi, dx, dy
end

function Lapla!(d2y, y::Array{T}, dr) where {T}
    Nx, Ny    = size(y)
    idx2,idy2 = 1.0./(dr.*dr)
    @inbounds for i in 2:Nx-1
        @inbounds for j in 2:Ny-1
            d2y[i,j]  = idx2*(y[i+1,j] - 2.0*y[i,j] + y[i-1,j])
            d2y[i,j] += idy2*(y[i,j+1] - 2.0*y[i,j] + y[i,j-1])
        end
    end
    @inbounds for j in 2:Ny-1
        d2y[1,j]    = idy2*(y[1,j+1]       - 2.0*y[1,j]       + y[1,j-1]) +
                      idx2*(2.0*y[1,j]     - 5.0*y[2,j]       + 4.0*y[3,j]       - y[4,j])
        d2y[end,j]  = idy2*(y[end,j+1]     - 2.0*y[end,j]     + y[end,j-1]) +
                      idx2*(2.0*y[end,j]   - 5.0*y[end-1,j]   + 4.0*y[end-2,j]   - y[end-3,j])
    end
    @inbounds for j in 2:Nx-1
        d2y[j,1]    = idx2*(y[j+1,1]       - 2.0*y[j,1]       + y[j-1,1]) +
                      idy2*(2.0*y[j,1]     - 5.0*y[j,2]       + 4.0*y[j,3]       - y[j,4])
        d2y[j,end]  = idx2*(y[j+1,end]     - 2.0*y[j,end]     + y[j-1,end]) +
                      idy2*(2.0*y[j,end]   - 5.0*y[j,end-1] + 4.0*y[j,end-2]   - y[j,end-3])
    end
    d2y[1,1]        = idy2*(2.0*y[1,1]     - 5.0*y[1,2]       + 4.0*y[1,3]       - y[1,4]) +
                      idx2*(2.0*y[1,1]     - 5.0*y[2,1]       + 4.0*y[3,1]       - y[4,1])
    d2y[1,end]      = idy2*(2.0*y[1,end]   - 5.0*y[1,end-1]   + 4.0*y[1,end-2]   - y[1,end-3]) +
                      idx2*(2.0*y[1,end]   - 5.0*y[2,end]     + 4.0*y[3,end]     - y[4,end])
    d2y[end,1]      = idy2*(2.0*y[end,1]   - 5.0*y[end,2]     + 4.0*y[end,3]     - y[end,4]) +
                      idx2*(2.0*y[end,1]   - 5.0*y[end-1,1]   + 4.0*y[end-2,1]   - y[end-3,1])
    d2y[end,end]    = idy2*(2.0*y[end,end] - 5.0*y[end,end-1] + 4.0*y[end,end-2] - y[end,end-3]) +
                      idx2*(2.0*y[end,end] - 5.0*y[end-1,end] + 4.0*y[end-2,end] - y[end-3,end])
    d2y
end

function Laplacian_1D!(d2y, y, dr)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i] += idx2*(y[i+1]-2*y[i]+y[i-1])
    end
    @inbounds d2y[1]   += idx2*(2.0*y[1]   - 5.0*y[2]     + 4.0*y[3]     - y[4])
    @inbounds d2y[end] += idx2*(2.0*y[end] - 5.0*y[end-1] + 4.0*y[end-2] - y[end-3])
end

function Laplacian_2D!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(view(d2y,:,i),view(y,:,i),dx)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end



function Laplacian_1D!(d2y, y, dr, j)
    idx2 = 1.0/prod(dr)^2
    Nx   = size(y,1)
    @inbounds for i in 2:Nx-1
        d2y[i,j] += idx2*(y[i+1,j]-2*y[i,j]+y[i-1,j])
    end
    @inbounds d2y[1,j]   += idx2*(2.0*y[1,j]   - 5.0*y[2,j]     + 4.0*y[3,j]     - y[4,j])
    @inbounds d2y[end,j] += idx2*(2.0*y[end,j] - 5.0*y[end-1,j] + 4.0*y[end-2,j] - y[end-3,j])
end

function Laplacian_2Dv2!(d2y, y, dr)
    dx, dy     = dr
    Nx, Ny     = size(y)

    @inbounds for i in 1:Ny
        Laplacian_1D!(d2y, y, dx, i)
    end

    @inbounds for i in 1:Nx
        Laplacian_1D!(view(d2y,i,:),view(y,i,:),dy)
    end
    d2y
end



using LinearAlgebra
steps = (dx,dy)
let d2xy1 = zero(fi), d2xy2 = zero(fi)
    Lapla!(d2xy1, fi, steps)
    Laplacian_2D!(d2xy2,fi,steps)
    println(norm(d2xy1 - d2xy2))

    fill!(d2xy2, 0)
    Laplacian_2Dv2!(d2xy2,fi,steps)
    println(norm(d2xy1 - d2xy2))
end

using BenchmarkTools
let d2xy = zero(fi)
    @btime Lapla!($d2xy, $fi, $steps);
    @btime Laplacian_2D!($d2xy,$fi,$steps);
    @btime Laplacian_2Dv2!($d2xy,$fi,$steps);
    nothing
end

mohamed82008 · January 19, 2019, 9:42pm

Inlining the inner function gets rid of the view allocations. But there is still some time difference.

Ferran_Mazzanti · January 20, 2019, 9:36am

Hi,

Thanks for the tips I noticed the bug once posted, but that was actually a bad
copy-paste from my more elaborated and working code, where I had to do much cleaning and apparently did too much. The original algorithm was coded in fortran many years ago and has been flawlessly working since then, with no bugs

Regarding the question itself, I actually didn’t want to use a different form for the laplacian function in 1D, 2D or 3D. I just wanted to use multiple dispatch to decide the case. Actually I use defined user-defined types for that

abstract type USER_DEFINED_TYPE end;

abstract type Numerical_Derivatives <: USER_DEFINED_TYPE end;

struct Hard_End_Points <: Numerical_Derivatives end;
struct Cartesian_Hard_End_Points    <: Numerical_Derivatives end;    
struct Cartesian_Hard_End_Points_1D <: Numerical_Derivatives end;    
struct Cartesian_Hard_End_Points_2D <: Numerical_Derivatives end;    
struct Cartesian_Hard_End_Points_3D <: Numerical_Derivatives end;

so that the laplacian function is always called the same way

Lap(d2f,grid,dr,calc)

with

calc=Cartesian_Hard_Hend_Points_1D()

or

calc=Cartesian_Hard_Hend_Points_2D()

etc. In this way (though I did not say anything like that before), adding a new parameter j as in your Laplacian_1D! would break that. More importantly, your version does not do what mine does. Your version requires the array d2y in Laplacian_1D! to be initialised to 0, or otherwise it will add the result of the laplacian to the already stored values in d2y. I say this because I already found myself these two improvements, but realised that this was not what the original Lapla routine is doing…

In any case, more that specifically optimising this, I believe the question is more general, regarding performance penalty when trying to write a function acting on arrays (as Laplacian_2D!) vs acting on individual elements as Laplacian_1D! or Lapla do). I can understand that a little of overhead must be there, but I was expecting that one could make the time of executions of both versions could be essentially the same -which is not in the present case.
I mean, maybe I can re-think the whole thing and code it in a clever but anti-natural way, but I’m not really interested on that, I believe that coding the laplacian of a 2D function in a 2D array is the most natural way of write things… not the smartest probably, but the simplest one.

Thanks a lot for your kind help

Ferran.

Topic		Replies	Views
optimising for loop performance Performance	14	3013	July 8, 2018
Optimize benchmark-code Performance question	11	670	December 10, 2019
Performance & Profiling Tips for Beginner Code Performance performance , profiling	14	1693	June 26, 2023
Performance input on my code Performance	8	837	October 15, 2018
Optimizing code with array assignments Performance	12	415	June 15, 2023

Optimize a simple code for performance

Related topics