Performance & Profiling Tips for Beginner Code

physh · June 20, 2023, 2:56am

Dear all, I have been using Julia for a few months now, writing code mostly without worrying too much about performance (profiling very briefly with @time/@btime to make sure nothing too horribly is happening though).

Intro:
But now, I feel the time has come to improve my knowledge on performance, so that I can write faster and more efficient code in the future directly from the get-go, without profiling a lot afterwards. As I believe that learning is best by doing an example, I humbly ask for your help in improving my code. For that I present below a typical code of mine which builds matrices and subsequently calculates the eigenvalues and -functions. Here the matrix elements basically stem from some complex Gaussian integrals. My aim is not to over-optimizie this very code, but it is an excerpt of one of my actual codes (I have changed all variable names) and it representy my usual way of coding. In this sense it is not an urgent request for help, but rather to learn some new techniques.

If you feel that this question should be moved to the “New to Julia” subforum, please feel free to do so.

Code:
(the file is called PerformanceTest.jl)

using LinearAlgebra

# Script is below

## Functions:
function mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0)

    (;s1,snmax,nmax,S1,SNmax,Nmax) = var_params;
    (;avals,bvals,cvals) = abcvals;
    
    # Initialization:
    dims = nmax*Nmax;
    X  = zeros(ComplexF64,dims,dims) # create only marix W=X+Y and add both X and Y directly?
    Y  = zeros(ComplexF64,dims,dims)
    Z  = zeros(ComplexF64,dims,dims)

    # Gaussian widths nu:
    nu_arr=zeros(nmax);nu_arr[1] = 1/s1^2;
    if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)-1)/(nmax-1)); end
    NU_arr=zeros(Nmax);NU_arr[1] = 1/S1^2;
    if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)-1)/(Nmax-1)); end

    # Norm:
    norm(nu) = 2*(2*nu)^(3/4)/pi^(1/4);

    # Complex factor:
    cfac = exp(-2*theta*pi/180*im);
    nu_arr = nu_arr * cfac; NU_arr = NU_arr * cfac;

    # Calculations of matrices:
    for a=avals
        for b=bvals

    # fill only lower triangular part due to symmetry
    for i=1:dims
        n=div(i-1,Nmax)+1;
        N=rem(i-1,Nmax)+1;
        for j=1:i
            nprime=div(j-1,Nmax)+1;
            Nprime=rem(j-1,Nmax)+1;
            
            if j>i
                continue
            end

            nua = nu_arr[n]';
            nub = nu_arr[nprime];
            NUa = NU_arr[N]';
            NUb = NU_arr[Nprime];
            nus = (nua=nua,nub=nub,NUa=NUa,NUb=NUb);

            norma = norm(nua)';
            normb = norm(nub);
            Norma = norm(NUa)';
            Normb = norm(NUb);
            norms = (norma=norma,normb=normb,Norma=Norma,Normb=Normb);

            Z_temp = mat_z(a,b,d_arr,nus,norms); # for reusing in mat_x
            Z[i,j] = Z[i,j] + Z_temp;
            X[i,j] = X[i,j] + mat_x(a,b,d_arr,nus,norms;mat_z_element=Z_temp);

            for c=cvals
                Y[i,j] = Y[i,j] + mat_y(a,b,c,d_arr,nus,norms,prefac,mu_g);
            end

        end
    end # loop-end over i,j = dims x dims
        end
    end  # loop-end over a,b

    Z = Z + tril(Z,-1)' #Symmetric(Z, :L)
    X = X + transpose(tril(X,-1));
    Y = Y + transpose(tril(Y,-1)) # transpose instead of ' ! (we dont want adjoint...)
    
    return X,Y,Z
end


function mat_z(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms)

    (;norma,normb,Norma,Normb) = norms;
    (;nua,nub,NUa,NUb) = nus;

    Ma = [[nua',0] [0,NUa']];
    Mb = [[nub,0] [0,NUb]];

    Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;

    element = norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Mab))^(3/2);
    return element
end


function mat_x(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms;mat_z_element=-1.0)

    (;norma,normb,Norma,Normb) = norms;
    (;nua,nub,NUa,NUb) = nus;

    if mat_z_element == -1.0
        mat_z_element = mat_z(a,b,d_arr,nus,norms)
        println("mat_z_element should not be -1 !")
    end

    Ma = [[nua',0] [0,NUa']];
    Mb = [[nub,0] [0,NUb]];

    Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;
    eta = det(Mab)/Mab[2,2];
    zeta = det(Mab)/Mab[1,1];

    da = d_arr[mod(b+1,3)+1]; # b-1; maps 1,2,3 onto 3,1,2.
    db = d_arr[b];
    dc = d_arr[mod(b,3)+1]; # b+1; maps 1,2,3 onto 2,3,1
    dur = dc*da/(dc+da);
    duR = db*(da+dc)/(da+db+dc);

    element = mat_z_element*(3*nub/dur*(1-nub/eta) + 3*NUb/duR*(1-NUb/zeta));
    return element
end


function mat_y(a::Int64,b::Int64,c::Int64,d_arr::Array{Float64},nus,norms,prefac::Float64,mu_g)

    (;norma,normb,Norma,Normb) = norms;
    (;nua,nub,NUa,NUb) = nus;
    Ma = [[nua',0] [0,NUa']];
    Mb = [[nub,0] [0,NUb]];
    Mc = [[mu_g,0.0] [0.0,0.0]]

    Macb = jtrafo(a,c,d_arr)'*Ma*jtrafo(a,c,d_arr) + jtrafo(b,c,d_arr)'*Mb*jtrafo(b,c,d_arr) + Mc;

    element = prefac*norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Macb))^(3/2);
    return element;
end


# Function for Variable-transformation
function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
    da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
    j = Array{Float64}(undef,2,2)
    if f==i
        j[1,1]=1.0;j[2,2]=1.0;
        j[1,2]=0.0;j[2,1]=0.0;
    elseif mod(f-i,3)==1    # trafo a->b (+1)
        j[1,1]=-da/(da+dc)
        j[1,2]=1.0
        j[2,1]=-dc*(da+db+dc)/(da+dc)/(db+dc);
        j[2,2]=-db/(db+dc)
    elseif mod(f-i,3)==2    # trafo a->c (+2)
        j[1,1]=-da/(da+db)
        j[1,2]=-1.0
        j[2,1]=+db*(da+db+dc)/(da+db)/(db+dc);
        j[2,2]=-dc/(db+dc);
    end
    return j    
end

## End of function definitions

# Beginning of script:
var_params = (; s1=0.9,snmax=69.5,nmax=20,S1=0.9,SNmax=69.5,Nmax=20);
d_arr = [4.0 81.0 81.0];
abcvals = (;avals=2:2,bvals=2:3,cvals=2:3);
mu_g = 1.0;
prefac = -10.0;
theta = 10.0;

X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);

evals,evecs = eigen(X+Y,Z);
return nothing
#End of script

Profiling attempts

Currently @btime include("PerformanceTest.jl") yields almost exactly 1 second of runtime: 998.592 ms (12749484 allocations: 1.34 GiB).
With the help of some @time commands, I see that most of the time is spent in the big loops over i,j (which was expected I guess).
I have tried playing around using explicit ::DataTypes in the function arguments, but this has not really changed anything
I have used @code_warntype for the functions mat_x, mat_y, mat_z, and jtrafo without any DataType being marked in red which according to the documentation should reveal any problems with DataTypes.
I have used ProfileView.@profview X,Y,Z = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=-10.0). I attach the saved results of a few runs to this post. Apart from the first run (which expectedly looks totally different), I dont see any red parts, but several yellow ones on the top (which if I understand the documentation correctly should also indicate some unoptimal parts). However I have a hard time understanding if there is indeed a problem and what it is exactly.

Conclusion:
I have done some profiling, but to be honest, from the results that I see it is very hard to understand (for me!) whether there is actually much improvement to be found or not. Unfortunately I am not able to write similar code in C, or Fortran to quickly test against this implementation. Understanding where the bottleneck is and what should be done against it, are then also the next question.

Thank you!

mkitti · June 20, 2023, 4:49am

I have not looked too thoroughly through this yet, but my sense is that further optimization is going to come from managing memory allocations. Essentially, the idea is to figure out if you can reuse existing allocated memory rather than allocating new memory everytime. Preferably you would allocate all memory before the for loops, and then use that repeatedly.

In particular, take a look at the memory allocation profiler:
https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis

Salmon · June 20, 2023, 8:07am

It is often wise not to think too much about performance when you first write code. Make sure it is as simple as possible so that it runs correctly. Then, making performance improvements while keeping sure the results stay correct is much easier and from my experience also faster overall.

Any way, there are a few things that are probably worth doing by default. As mentioned by mkitti, it is often a good idea to avoid doing unnecessarily many small allocations. Your profiler shows yellow bars, which means that there is time being spent with garbage collection. Sometimes that is hard to avoid, but here it looks like there might be low-hanging fruit:

    Ma = [[nua',0] [0,NUa']];

Lines like this allocate more than they need. If I understand correctly, you want to create a 2x2 matrix (?). However this object is an array of arrays. Small arrays where the size is known at compile time like that should usually be made by using StaticArrays which will avoid allocations and also have many optimized routines for linear algebra.
Similarly the function

function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
    da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
    j = Array{Float64}(undef,2,2)
    if f==i
        j[1,1]=1.0;j[2,2]=1.0;
        j[1,2]=0.0;j[2,1]=0.0;
    ...

could be made much more efficient by switching to static array instead, i.e.

function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
    da = d_arr[i]
    db = d_arr[mod(i,3)+1]
    dc = d_arr[mod(i+1,3)+1]

    if f==i
        return SA[ #shorthand for static array
            1.0 0.0;
            0.0 1.0
        ]
    ...

by the way in Julia there is no need to use ; at the end of a line

Personally, I’d say if the code is type-stable and does not spend much time in the garbage collector it is often quite close to optimal. There is no harm in using a profiler to find performance bottlenecks and spend more time optimizing those with more effort, but these kinds of optimization attempts should only be done if you know it is actually going to make a difference.

Salmon · June 20, 2023, 8:29am

Here is an example of what you can get by using static arrays:

julia> @time XOLD,YOLD,ZOLD = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=-10.0); #old code
  1.173104 seconds (12.51 M allocations: 1.310 GiB, 11.40% gc time)

julia> @time X,Y,Z = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=-10.0);
  0.177127 seconds (28 allocations: 21.975 MiB) # new code
julia> X ≈ XOLD && Y ≈ YOLD &&  Z ≈ ZOLD
true

New code using static arrays

using LinearAlgebra,StaticArrays

# Script is below

## Functions:
function mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0)

    (;s1,snmax,nmax,S1,SNmax,Nmax) = var_params;
    (;avals,bvals,cvals) = abcvals;
    
    # Initialization:
    dims = nmax*Nmax;
    X  = zeros(ComplexF64,dims,dims) # create only marix W=X+Y and add both X and Y directly?
    Y  = zeros(ComplexF64,dims,dims)
    Z  = zeros(ComplexF64,dims,dims)

    # Gaussian widths nu:
    nu_arr=zeros(nmax);nu_arr[1] = 1/s1^2;
    if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)-1)/(nmax-1)); end
    NU_arr=zeros(Nmax);NU_arr[1] = 1/S1^2;
    if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)-1)/(Nmax-1)); end

    # Norm:
    norm(nu) = 2*(2*nu)^(3/4)/pi^(1/4);

    # Complex factor:
    cfac = exp(-2*theta*pi/180*im);
    nu_arr = nu_arr * cfac; NU_arr = NU_arr * cfac;

    # Calculations of matrices:
    for a=avals
        for b=bvals

    # fill only lower triangular part due to symmetry
    for i=1:dims
        n=div(i-1,Nmax)+1;
        N=rem(i-1,Nmax)+1;
        for j=1:i
            nprime=div(j-1,Nmax)+1;
            Nprime=rem(j-1,Nmax)+1;
            
            if j>i
                continue
            end

            nua = nu_arr[n]';
            nub = nu_arr[nprime];
            NUa = NU_arr[N]';
            NUb = NU_arr[Nprime];
            nus = (nua=nua,nub=nub,NUa=NUa,NUb=NUb);

            norma = norm(nua)';
            normb = norm(nub);
            Norma = norm(NUa)';
            Normb = norm(NUb);
            norms = (norma=norma,normb=normb,Norma=Norma,Normb=Normb);

            Z_temp = mat_z(a,b,d_arr,nus,norms); # for reusing in mat_x
            Z[i,j] = Z[i,j] + Z_temp;
            X[i,j] = X[i,j] + mat_x(a,b,d_arr,nus,norms;mat_z_element=Z_temp);

            for c=cvals
                Y[i,j] = Y[i,j] + mat_y(a,b,c,d_arr,nus,norms,prefac,mu_g);
            end

        end
    end # loop-end over i,j = dims x dims
        end
    end  # loop-end over a,b

    Z = Z + tril(Z,-1)' #Symmetric(Z, :L)
    X = X + transpose(tril(X,-1));
    Y = Y + transpose(tril(Y,-1)) # transpose instead of ' ! (we dont want adjoint...)
    
    return X,Y,Z
end


function mat_z(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms)

    (;norma,normb,Norma,Normb) = norms;
    (;nua,nub,NUa,NUb) = nus;

    Ma = SA[nua' 0 ; 0 NUa']
    Mb = SA[nub 0 ; 0 NUb]

    Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;

    element = norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Mab))^(3/2);
    return element
end


function mat_x(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms;mat_z_element=-1.0)

    (;norma,normb,Norma,Normb) = norms;
    (;nua,nub,NUa,NUb) = nus;

    if mat_z_element == -1.0
        mat_z_element = mat_z(a,b,d_arr,nus,norms)
        println("mat_z_element should not be -1 !")
    end

    Ma = SA[nua' 0 ; 0 NUa'];
    Mb = SA[nub 0 ; 0 NUb];

    Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;
    eta = det(Mab)/Mab[2,2];
    zeta = det(Mab)/Mab[1,1];

    da = d_arr[mod(b+1,3)+1]; # b-1; maps 1,2,3 onto 3,1,2.
    db = d_arr[b];
    dc = d_arr[mod(b,3)+1]; # b+1; maps 1,2,3 onto 2,3,1
    dur = dc*da/(dc+da);
    duR = db*(da+dc)/(da+db+dc);

    element = mat_z_element*(3*nub/dur*(1-nub/eta) + 3*NUb/duR*(1-NUb/zeta));
    return element
end


function mat_y(a::Int64,b::Int64,c::Int64,d_arr::Array{Float64},nus,norms,prefac::Float64,mu_g)

    (;norma,normb,Norma,Normb) = norms;
    (;nua,nub,NUa,NUb) = nus;
    Ma = SA[nua' 0 ;0 NUa']
    Mb = SA[nub 0 ;0 NUb]
    Mc = SA[mu_g 0.0 ;0.0 0.0]

    Macb = jtrafo(a,c,d_arr)'*Ma*jtrafo(a,c,d_arr) + jtrafo(b,c,d_arr)'*Mb*jtrafo(b,c,d_arr) + Mc

    element = prefac*norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Macb))^(3/2);
    return element;
end


# Function for Variable-transformation
function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
    da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
    if f==i
        return SA[1.0 0.0;0.0 1.0]

    elseif mod(f-i,3)==1    # trafo a->b (+1)
        return SA[-da/(da+dc) 1.0;
        -dc*(da+db+dc)/(da+dc)/(db+dc) -db/(db+dc)]

    elseif mod(f-i,3)==2    # trafo a->c (+2)
        return SA[-da/(da+db) -1.0; +db*(da+db+dc)/(da+db)/(db+dc) -dc/(db+dc)]
    end
    error("something went wrong in jtrafo")
end

## End of function definitions

# Beginning of script:
var_params = (; s1=0.9,snmax=69.5,nmax=20,S1=0.9,SNmax=69.5,Nmax=20);
d_arr = [4.0 81.0 81.0];
abcvals = (;avals=2:2,bvals=2:3,cvals=2:3);
mu_g = 1.0;
prefac = -10.0;
theta = 10.0;

X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);

evals,evecs = eigen(X+Y,Z);
return nothing
#End of script

nsajko · June 20, 2023, 9:21am

In some places you use this expression, however it is mathematically equivalent to 16 * pi^2, which is surely more accurate. In fact the latter expression happens to evaluate to the correctly rounded Float64 representation of 16 \pi^2.

In some places you use constructions like A = A + B. The problem with this is that A + B needs to be allocated. You could avoid the allocation by doing something like this instead A .+= B. Also see this section of Performance Tips in the Manual.

I like to do profiling in the (VS) Code editor with the Julia extension, with the commands @profview and @profview_allocs. Usually no additional packages are necessary for the profiling as the Julia extension bundles relevant code. You get an interactive (try hovering over the elements, clicking on them, etc.) flamegraph, and can also open files with highlighted costly lines of code (I definitely recommend this). Just don’t forget to run the profiling one time in advance, just to allow Julia to compile all functions.

Also see the Julia extension docs for profiling: Profiler · Julia in VS Code

nsajko · June 20, 2023, 9:27am

It’s always a good idea to profile code. Otherwise, without measurements and visualizations, one might potentially misunderstand the performance behavior and spend misguided effort at what seems like optimization, but could also turn out to be pessimization.

lmiq · June 20, 2023, 11:22am

A good idea is to check for the performance of each function separately, for instance using

julia> @btime mat_z(... inputs ...)

where you provide to the function some representative inputs. When these are functions called multiple times within loops, that is important, and easier to understand than the complete profile.

The most important optimizations will come from removing temporary allocations, as mentioned above. This post may help: Common allocation mistakes

physh · June 21, 2023, 2:31am

Dear all, many thanks already for all your hints, this is exactly what I was hoping to get. I will try to answer them all:

1a)

This is simply a reminiscent of the formula I derived by hand/Mathematica and stems from some spherical harmonics. When writing such code, I try to stay with the original form in order to not introduce any typo and for easier debugging. As expected, changing it to 16 \pi^2 did not really make any relevant difference.

1b)

I tried this by changing my code to

Z .+= tril(Z,-1)';
X .+= transpose(tril(X,-1));
Y .+= transpose(tril(Y,-1));

however this changed the benchmark result from

julia> @btime include("PerformanceTest.jl")
  1.010 s (12749484 allocations: 1.34 GiB)

to

@btime include("PerformanceTest.jl")
  1.078 s (12789251 allocations: 1.33 GiB)

I don’t really understand why this actually increases the number of allocations? (The memory is however slightly less…)

1c)

Unfortunately, due to some bug, I experience some memory leak when activating the Julia extension in VSCode. But I will keep it in mind when this bug might be fixed in the future.

Is there a way to see and analyze the number of calls to this functions within one execution?

Wow, that is quite the improvement! Thank you for the hint with using StaticArrays. I am pretty sure that I have read about this once in the performance tips or somewhere else, but I guess it was not so revealing to me until I have applied it to actual code of myself. As you showed,

@btime X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);
  92.730 ms (19 allocations: 14.65 MiB)

which is a big improvement. I will certainly be using StaticArrays in the future.

However I am curious, when calling the entire script, why does

@time X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);
@time evals,evecs = eigen(X+Y,Z);

yield

  0.553434 seconds (249.05 k allocations: 30.053 MiB, 2.41% gc time, 71.89% compilation time)
  0.281798 seconds (20 allocations: 14.695 MiB)

even after several executions? Why is it recompiling the function every time?

Salmon · June 21, 2023, 7:53am

physh:

However I am curious, when calling the entire script, why does
@time X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);
@time evals,evecs = eigen(X+Y,Z);
yield
  0.553434 seconds (249.05 k allocations: 30.053 MiB, 2.41% gc time, 71.89% compilation time)
  0.281798 seconds (20 allocations: 14.695 MiB)
even after several executions? Why is it recompiling the function every time?

I think If you include the entire script you effectively re-define the function which needs to be compiled again (although not the functions that are called from other packages, see this thread). If you want accurate timings, @btime is good, in this case it probably is sufficient to simply call the function two times in the script and only time the second call.
If the compilation time actually matters, you probably should change your workflow by using Revise.jl. For instance, you can have a script that contains the function definitions and another one which runs the calculation and includes the former as includet("FunctionDefinitions.jl") (Or, perhaps even better, make a package, which is actually not difficult in julia).
Revise will make sure that your function definitions stay up to date while only recompiling when it is necessary.

lmiq · June 21, 2023, 9:41am

I don’t mean anything sophisticated there. If the function is called from within a loop, it called many times.

nsajko · June 21, 2023, 10:56am

That’s what a profiler is for.

alfaromartino · June 21, 2023, 8:54pm

# Gaussian widths nu:
    nu_arr=zeros(nmax);nu_arr[1] = 1/s1^2;
    if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)-1)/(nmax-1)); end
    NU_arr=zeros(Nmax);NU_arr[1] = 1/S1^2;
    if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)-1)/(Nmax-1)); end

Skimming through it, I can see some easy improvements you could have here.
First, if you initialize arrays, use similar or undef arrays (see here Arrays · The Julia Language). It avoids filling values with zeros.
Second, once you initialize an array, “updating” the values is better. This means @. should be at the start of the line (which ensures that you use .= rather than =).
Third, if you work with slices of an array, it’s better to use @views, which you can add at the beginning of the function.
Fourth, StaticArrays are always helpful when you have a vector with a few elements (rough rule of thumb 100, but I choose not to use them for vectors with more than 50).

Example:

#approach 1

vec1   = rand(10)
vec2   = rand(10)

function approach1(vec1,vec2)
    nu_arr = zeros(10)    

    nu_arr[2:8] = @. vec1[2:8] + vec2[2:8]
end


#approach 2
using StaticArrays
Svec1   = SVector(vec1...)
Svec2   = SVector(vec2...)
@views function approach2(vec1, vec2)
    nu_arr = similar(vec1)

    @. nu_arr[2:8] = vec1[2:8] + vec2[2:8]
end


@btime approach1($vec1,$vec2)  #155.044 ns (4 allocations: 480 bytes)
@btime approach2($Svec1,$Svec2)#12.412 ns (1 allocation: 96 bytes)

A personal recommendation is also to make the code clearer. It’ll help you a lot when you have to debug. Two suggestions for doing this would be to split the functions you have into smaller functions. They’re too long, making it harder to detect type instabilities or where you have room for improvements. Also, I suggest not writing conditionals in a single line. If you want to write a conditional in a single line, I’d use
nmax > 1 && nu_arr[2:nmax] =...
which means <condition> && <do this if condition is true>
instead of
if nmax > 1; nu_arr[2:nmax] =... end

physh · June 25, 2023, 2:17am

Ok, usually I do it that way. And indeed, testing your suggestion of calling the function twice showed your predicted behavior. So it was simply due to the form of this exmaple, where I wanted to have everything in a single file.

I agree. But for example when using the ProfileView.@profview, it is not clear to me how to see it. Maybe you can suggest to me another tool?

alfaromartino:

First, if you initialize arrays, use similar or undef arrays (see here Arrays · The Julia Language). It avoids filling values with zeros.
Second, once you initialize an array, “updating” the values is better. This means @. should be at the start of the line (which ensures that you use .= rather than =).
Third, if you work with slices of an array, it’s better to use @views, which you can add at the beginning of the function.
Fourth, StaticArrays are always helpful when you have a vector with a few elements (rough rule of thumb 100, but I choose not to use them for vectors with more than 50).
[…]
A personal recommendation is also to make the code clearer. It’ll help you a lot when you have to debug. Two suggestions for doing this would be to split the functions you have into smaller functions. They’re too long, making it harder to detect type instabilities or where you have room for improvements. Also, I suggest not writing conditionals in a single line.

Thank you for several suggestions:

I think I have used that in the past, however in my example, the arrays X,Y,Z are initialized by filling with zeros, because later in the loops I add terms on each array element: Z[i,j] = Z[i,j] + Z_temp;I think by using an undef array I once ran into problems with this. However maybe there is aynway a better way to write the values into the array elements? Otherwise, for the nu_arr I agree.
Easy fix, I will try it in the future.
I am not sure if I understand this correctly. Can I simply/stupidly add this in front of all of my functions which has some array-sliced operation in order to save allocations? Or can this lead to unexpected behaviour if misused?
Sounds like a good rule of thumb.
Interesting, I didnt feel my functions are too long. Could you maybe give me a more explicit suggestion? After your comment I would think that the functions mat_x, mat_y, mat_z, jtrafo are fine, but maybe within mat_xyz I could outsource the function defintions and calculation of the widths, and the norms. Moreover, I can write another function which contains the loop-structure. In there, I was anyway wondering whether it would be more efficient to have the “big” loops over i and j before or rather “outside” of the loops over a and b? Is there also some kind of rule of thumb?
I mostly wrote it in one line, to not clog up the looking of the function. But maybe this goes hand in had with comment 5 to make the functions simpler. In any case I will go and use && for simple conditionals.

alfaromartino · June 25, 2023, 8:56pm

Writing short functions will help you a lot. Read this (it’s not about Julia, but the point made is the same).

I’ve started a blog. Maybe the last three posts about easy fixes to enhance performance can be useful to you. It includes a post about @views.
https://alfaromartino.github.io/blog/list_posts/

Salmon · June 26, 2023, 9:18am

To be honest, I would not worry about this too much. It is true that using zeros is strictly speaking slower but I think the number of cases where that really matters are small (please correct me if Im wrong). Usually, the major performance loss comes from the fact that you allocate a bunch of memory which shouldnt be done in tight loops where performance matters anyways. Personally, I like to avoid using undef initializers when possible, since they can lead to annoying bugs with undefined behaviour if you do not explicitly initialize them with something (and do not use += as you mentioned before).

It definitely does matter whether you use this or not. It can lead to different results and can cause both faster and sometimes also slower code. Without using @views, you are creating a copy of the array (at the specified locations). Modifying this copy for example will not change the original, which may be desired. It can even be faster to not use views since views do not have to be contiguous in memory, i.e. two neighboring elements might be a long distance away from each other making iterating over it slower (and less likely to profit from SIMD/LoopVectorization).
As a rule of thumb: if it is in a loop that is called very often, you want to reduce allocations as much as possible so you use @views. If you do operations on the array and not want to modify the original array, you should not use views. If you create the slice only as an initial step before some performance intensive operation, test if it is faster if you do not use @views.

This is a very common topic of discussion and people’s opinions tend to differ slightly but it is true that your functions tend to be longer than they probably have to be. A good rule of thumb is to ask yourself “What does my function do?”, and be specific about the answer. Often, your answer might be something of the sort “given the input parameters, I create and initialize datatype x, then I compute some metadata and finally I operate on the datatype using the metadata”. This answer implies that your function might benefit from defining (at least) three subfunctions, one that initializes the datatype, one the computes metadata and one that performs the operation.
(Some people would maybe go as far as to say that each function should be shortened down to the absolute minimum of what is possible, i.e. create lots of functions with ~1-5 lines, but arguably that can also lead to difficult to read code)

Topic		Replies	Views
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Memory allocation and performance Performance memory-allocation	7	650	June 26, 2020
Porting code from MatLab - performance tips New to Julia	18	427	June 26, 2024
Creating the same matrix in Julia as in Matlab takes longer time New to Julia	28	2188	June 28, 2017
Common allocation mistakes Performance memory-allocation	47	7174	August 21, 2023

Performance & Profiling Tips for Beginner Code

Related topics