# Performance & Profiling Tips for Beginner Code

Dear all, I have been using Julia for a few months now, writing code mostly without worrying too much about performance (profiling very briefly with @time/@btime to make sure nothing too horribly is happening though).

Intro:
But now, I feel the time has come to improve my knowledge on performance, so that I can write faster and more efficient code in the future directly from the get-go, without profiling a lot afterwards. As I believe that learning is best by doing an example, I humbly ask for your help in improving my code. For that I present below a typical code of mine which builds matrices and subsequently calculates the eigenvalues and -functions. Here the matrix elements basically stem from some complex Gaussian integrals. My aim is not to over-optimizie this very code, but it is an excerpt of one of my actual codes (I have changed all variable names) and it representy my usual way of coding. In this sense it is not an urgent request for help, but rather to learn some new techniques.

If you feel that this question should be moved to the “New to Julia” subforum, please feel free to do so.

Code:
(the file is called PerformanceTest.jl)

using LinearAlgebra

# Script is below

## Functions:
function mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0)

(;s1,snmax,nmax,S1,SNmax,Nmax) = var_params;
(;avals,bvals,cvals) = abcvals;

# Initialization:
dims = nmax*Nmax;
X  = zeros(ComplexF64,dims,dims) # create only marix W=X+Y and add both X and Y directly?
Y  = zeros(ComplexF64,dims,dims)
Z  = zeros(ComplexF64,dims,dims)

# Gaussian widths nu:
nu_arr=zeros(nmax);nu_arr = 1/s1^2;
if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)-1)/(nmax-1)); end
NU_arr=zeros(Nmax);NU_arr = 1/S1^2;
if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)-1)/(Nmax-1)); end

# Norm:
norm(nu) = 2*(2*nu)^(3/4)/pi^(1/4);

# Complex factor:
cfac = exp(-2*theta*pi/180*im);
nu_arr = nu_arr * cfac; NU_arr = NU_arr * cfac;

# Calculations of matrices:
for a=avals
for b=bvals

# fill only lower triangular part due to symmetry
for i=1:dims
n=div(i-1,Nmax)+1;
N=rem(i-1,Nmax)+1;
for j=1:i
nprime=div(j-1,Nmax)+1;
Nprime=rem(j-1,Nmax)+1;

if j>i
continue
end

nua = nu_arr[n]';
nub = nu_arr[nprime];
NUa = NU_arr[N]';
NUb = NU_arr[Nprime];
nus = (nua=nua,nub=nub,NUa=NUa,NUb=NUb);

norma = norm(nua)';
normb = norm(nub);
Norma = norm(NUa)';
Normb = norm(NUb);
norms = (norma=norma,normb=normb,Norma=Norma,Normb=Normb);

Z_temp = mat_z(a,b,d_arr,nus,norms); # for reusing in mat_x
Z[i,j] = Z[i,j] + Z_temp;
X[i,j] = X[i,j] + mat_x(a,b,d_arr,nus,norms;mat_z_element=Z_temp);

for c=cvals
Y[i,j] = Y[i,j] + mat_y(a,b,c,d_arr,nus,norms,prefac,mu_g);
end

end
end # loop-end over i,j = dims x dims
end
end  # loop-end over a,b

Z = Z + tril(Z,-1)' #Symmetric(Z, :L)
X = X + transpose(tril(X,-1));
Y = Y + transpose(tril(Y,-1)) # transpose instead of ' ! (we dont want adjoint...)

return X,Y,Z
end

function mat_z(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms)

(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;

Ma = [[nua',0] [0,NUa']];
Mb = [[nub,0] [0,NUb]];

Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;

element = norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Mab))^(3/2);
return element
end

function mat_x(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms;mat_z_element=-1.0)

(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;

if mat_z_element == -1.0
mat_z_element = mat_z(a,b,d_arr,nus,norms)
println("mat_z_element should not be -1 !")
end

Ma = [[nua',0] [0,NUa']];
Mb = [[nub,0] [0,NUb]];

Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;
eta = det(Mab)/Mab[2,2];
zeta = det(Mab)/Mab[1,1];

da = d_arr[mod(b+1,3)+1]; # b-1; maps 1,2,3 onto 3,1,2.
db = d_arr[b];
dc = d_arr[mod(b,3)+1]; # b+1; maps 1,2,3 onto 2,3,1
dur = dc*da/(dc+da);
duR = db*(da+dc)/(da+db+dc);

element = mat_z_element*(3*nub/dur*(1-nub/eta) + 3*NUb/duR*(1-NUb/zeta));
return element
end

function mat_y(a::Int64,b::Int64,c::Int64,d_arr::Array{Float64},nus,norms,prefac::Float64,mu_g)

(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;
Ma = [[nua',0] [0,NUa']];
Mb = [[nub,0] [0,NUb]];
Mc = [[mu_g,0.0] [0.0,0.0]]

Macb = jtrafo(a,c,d_arr)'*Ma*jtrafo(a,c,d_arr) + jtrafo(b,c,d_arr)'*Mb*jtrafo(b,c,d_arr) + Mc;

element = prefac*norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Macb))^(3/2);
return element;
end

# Function for Variable-transformation
function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
j = Array{Float64}(undef,2,2)
if f==i
j[1,1]=1.0;j[2,2]=1.0;
j[1,2]=0.0;j[2,1]=0.0;
elseif mod(f-i,3)==1    # trafo a->b (+1)
j[1,1]=-da/(da+dc)
j[1,2]=1.0
j[2,1]=-dc*(da+db+dc)/(da+dc)/(db+dc);
j[2,2]=-db/(db+dc)
elseif mod(f-i,3)==2    # trafo a->c (+2)
j[1,1]=-da/(da+db)
j[1,2]=-1.0
j[2,1]=+db*(da+db+dc)/(da+db)/(db+dc);
j[2,2]=-dc/(db+dc);
end
return j
end

## End of function definitions

# Beginning of script:
var_params = (; s1=0.9,snmax=69.5,nmax=20,S1=0.9,SNmax=69.5,Nmax=20);
d_arr = [4.0 81.0 81.0];
abcvals = (;avals=2:2,bvals=2:3,cvals=2:3);
mu_g = 1.0;
prefac = -10.0;
theta = 10.0;

X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);

evals,evecs = eigen(X+Y,Z);
return nothing
#End of script


Profiling attempts

1. Currently @btime include("PerformanceTest.jl") yields almost exactly 1 second of runtime: 998.592 ms (12749484 allocations: 1.34 GiB).

2. With the help of some @time commands, I see that most of the time is spent in the big loops over i,j (which was expected I guess).

3. I have tried playing around using explicit ::DataTypes in the function arguments, but this has not really changed anything

4. I have used @code_warntype for the functions mat_x, mat_y, mat_z, and jtrafo without any DataType being marked in red which according to the documentation should reveal any problems with DataTypes.

5. I have used ProfileView.@profview X,Y,Z = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=-10.0). I attach the saved results of a few runs to this post. Apart from the first run (which expectedly looks totally different), I dont see any red parts, but several yellow ones on the top (which if I understand the documentation correctly should also indicate some unoptimal parts). However I have a hard time understanding if there is indeed a problem and what it is exactly.

Conclusion:
I have done some profiling, but to be honest, from the results that I see it is very hard to understand (for me!) whether there is actually much improvement to be found or not. Unfortunately I am not able to write similar code in C, or Fortran to quickly test against this implementation. Understanding where the bottleneck is and what should be done against it, are then also the next question.

Thank you!

4 Likes

I have not looked too thoroughly through this yet, but my sense is that further optimization is going to come from managing memory allocations. Essentially, the idea is to figure out if you can reuse existing allocated memory rather than allocating new memory everytime. Preferably you would allocate all memory before the for loops, and then use that repeatedly.

In particular, take a look at the memory allocation profiler:
https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis

4 Likes

It is often wise not to think too much about performance when you first write code. Make sure it is as simple as possible so that it runs correctly. Then, making performance improvements while keeping sure the results stay correct is much easier and from my experience also faster overall.

Any way, there are a few things that are probably worth doing by default. As mentioned by mkitti, it is often a good idea to avoid doing unnecessarily many small allocations. Your profiler shows yellow bars, which means that there is time being spent with garbage collection. Sometimes that is hard to avoid, but here it looks like there might be low-hanging fruit:

    Ma = [[nua',0] [0,NUa']];


Lines like this allocate more than they need. If I understand correctly, you want to create a 2x2 matrix (?). However this object is an array of arrays. Small arrays where the size is known at compile time like that should usually be made by using StaticArrays which will avoid allocations and also have many optimized routines for linear algebra.
Similarly the function

function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
j = Array{Float64}(undef,2,2)
if f==i
j[1,1]=1.0;j[2,2]=1.0;
j[1,2]=0.0;j[2,1]=0.0;
...


could be made much more efficient by switching to static array instead, i.e.

function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
da = d_arr[i]
db = d_arr[mod(i,3)+1]
dc = d_arr[mod(i+1,3)+1]

if f==i
return SA[ #shorthand for static array
1.0 0.0;
0.0 1.0
]
...


by the way in Julia there is no need to use ; at the end of a line Personally, I’d say if the code is type-stable and does not spend much time in the garbage collector it is often quite close to optimal. There is no harm in using a profiler to find performance bottlenecks and spend more time optimizing those with more effort, but these kinds of optimization attempts should only be done if you know it is actually going to make a difference.

4 Likes

Here is an example of what you can get by using static arrays:

julia> @time XOLD,YOLD,ZOLD = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=-10.0); #old code
1.173104 seconds (12.51 M allocations: 1.310 GiB, 11.40% gc time)

julia> @time X,Y,Z = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=-10.0);
0.177127 seconds (28 allocations: 21.975 MiB) # new code
julia> X ≈ XOLD && Y ≈ YOLD &&  Z ≈ ZOLD
true

New code using static arrays
using LinearAlgebra,StaticArrays

# Script is below

## Functions:
function mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0)

(;s1,snmax,nmax,S1,SNmax,Nmax) = var_params;
(;avals,bvals,cvals) = abcvals;

# Initialization:
dims = nmax*Nmax;
X  = zeros(ComplexF64,dims,dims) # create only marix W=X+Y and add both X and Y directly?
Y  = zeros(ComplexF64,dims,dims)
Z  = zeros(ComplexF64,dims,dims)

# Gaussian widths nu:
nu_arr=zeros(nmax);nu_arr = 1/s1^2;
if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)-1)/(nmax-1)); end
NU_arr=zeros(Nmax);NU_arr = 1/S1^2;
if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)-1)/(Nmax-1)); end

# Norm:
norm(nu) = 2*(2*nu)^(3/4)/pi^(1/4);

# Complex factor:
cfac = exp(-2*theta*pi/180*im);
nu_arr = nu_arr * cfac; NU_arr = NU_arr * cfac;

# Calculations of matrices:
for a=avals
for b=bvals

# fill only lower triangular part due to symmetry
for i=1:dims
n=div(i-1,Nmax)+1;
N=rem(i-1,Nmax)+1;
for j=1:i
nprime=div(j-1,Nmax)+1;
Nprime=rem(j-1,Nmax)+1;

if j>i
continue
end

nua = nu_arr[n]';
nub = nu_arr[nprime];
NUa = NU_arr[N]';
NUb = NU_arr[Nprime];
nus = (nua=nua,nub=nub,NUa=NUa,NUb=NUb);

norma = norm(nua)';
normb = norm(nub);
Norma = norm(NUa)';
Normb = norm(NUb);
norms = (norma=norma,normb=normb,Norma=Norma,Normb=Normb);

Z_temp = mat_z(a,b,d_arr,nus,norms); # for reusing in mat_x
Z[i,j] = Z[i,j] + Z_temp;
X[i,j] = X[i,j] + mat_x(a,b,d_arr,nus,norms;mat_z_element=Z_temp);

for c=cvals
Y[i,j] = Y[i,j] + mat_y(a,b,c,d_arr,nus,norms,prefac,mu_g);
end

end
end # loop-end over i,j = dims x dims
end
end  # loop-end over a,b

Z = Z + tril(Z,-1)' #Symmetric(Z, :L)
X = X + transpose(tril(X,-1));
Y = Y + transpose(tril(Y,-1)) # transpose instead of ' ! (we dont want adjoint...)

return X,Y,Z
end

function mat_z(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms)

(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;

Ma = SA[nua' 0 ; 0 NUa']
Mb = SA[nub 0 ; 0 NUb]

Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;

element = norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Mab))^(3/2);
return element
end

function mat_x(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms;mat_z_element=-1.0)

(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;

if mat_z_element == -1.0
mat_z_element = mat_z(a,b,d_arr,nus,norms)
println("mat_z_element should not be -1 !")
end

Ma = SA[nua' 0 ; 0 NUa'];
Mb = SA[nub 0 ; 0 NUb];

Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;
eta = det(Mab)/Mab[2,2];
zeta = det(Mab)/Mab[1,1];

da = d_arr[mod(b+1,3)+1]; # b-1; maps 1,2,3 onto 3,1,2.
db = d_arr[b];
dc = d_arr[mod(b,3)+1]; # b+1; maps 1,2,3 onto 2,3,1
dur = dc*da/(dc+da);
duR = db*(da+dc)/(da+db+dc);

element = mat_z_element*(3*nub/dur*(1-nub/eta) + 3*NUb/duR*(1-NUb/zeta));
return element
end

function mat_y(a::Int64,b::Int64,c::Int64,d_arr::Array{Float64},nus,norms,prefac::Float64,mu_g)

(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;
Ma = SA[nua' 0 ;0 NUa']
Mb = SA[nub 0 ;0 NUb]
Mc = SA[mu_g 0.0 ;0.0 0.0]

Macb = jtrafo(a,c,d_arr)'*Ma*jtrafo(a,c,d_arr) + jtrafo(b,c,d_arr)'*Mb*jtrafo(b,c,d_arr) + Mc

element = prefac*norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Macb))^(3/2);
return element;
end

# Function for Variable-transformation
function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
if f==i
return SA[1.0 0.0;0.0 1.0]

elseif mod(f-i,3)==1    # trafo a->b (+1)
return SA[-da/(da+dc) 1.0;
-dc*(da+db+dc)/(da+dc)/(db+dc) -db/(db+dc)]

elseif mod(f-i,3)==2    # trafo a->c (+2)
return SA[-da/(da+db) -1.0; +db*(da+db+dc)/(da+db)/(db+dc) -dc/(db+dc)]
end
error("something went wrong in jtrafo")
end

## End of function definitions

# Beginning of script:
var_params = (; s1=0.9,snmax=69.5,nmax=20,S1=0.9,SNmax=69.5,Nmax=20);
d_arr = [4.0 81.0 81.0];
abcvals = (;avals=2:2,bvals=2:3,cvals=2:3);
mu_g = 1.0;
prefac = -10.0;
theta = 10.0;

X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);

evals,evecs = eigen(X+Y,Z);
return nothing
#End of script

4 Likes

In some places you use this expression, however it is mathematically equivalent to 16 * pi^2, which is surely more accurate. In fact the latter expression happens to evaluate to the correctly rounded Float64 representation of 16 \pi^2.

In some places you use constructions like A = A + B. The problem with this is that A + B needs to be allocated. You could avoid the allocation by doing something like this instead A .+= B. Also see this section of Performance Tips in the Manual.

I like to do profiling in the (VS) Code editor with the Julia extension, with the commands @profview and @profview_allocs. Usually no additional packages are necessary for the profiling as the Julia extension bundles relevant code. You get an interactive (try hovering over the elements, clicking on them, etc.) flamegraph, and can also open files with highlighted costly lines of code (I definitely recommend this). Just don’t forget to run the profiling one time in advance, just to allow Julia to compile all functions.

Also see the Julia extension docs for profiling: Profiler · Julia in VS Code

3 Likes

It’s always a good idea to profile code. Otherwise, without measurements and visualizations, one might potentially misunderstand the performance behavior and spend misguided effort at what seems like optimization, but could also turn out to be pessimization.

1 Like

A good idea is to check for the performance of each function separately, for instance using

julia> @btime mat_z(... inputs ...)


where you provide to the function some representative inputs. When these are functions called multiple times within loops, that is important, and easier to understand than the complete profile.

The most important optimizations will come from removing temporary allocations, as mentioned above. This post may help: Common allocation mistakes

1 Like

Dear all, many thanks already for all your hints, this is exactly what I was hoping to get. I will try to answer them all:

1a)

This is simply a reminiscent of the formula I derived by hand/Mathematica and stems from some spherical harmonics. When writing such code, I try to stay with the original form in order to not introduce any typo and for easier debugging. As expected, changing it to 16 \pi^2 did not really make any relevant difference.

1b)

I tried this by changing my code to

Z .+= tril(Z,-1)';
X .+= transpose(tril(X,-1));
Y .+= transpose(tril(Y,-1));


however this changed the benchmark result from

julia> @btime include("PerformanceTest.jl")
1.010 s (12749484 allocations: 1.34 GiB)


to

@btime include("PerformanceTest.jl")
1.078 s (12789251 allocations: 1.33 GiB)


I don’t really understand why this actually increases the number of allocations? (The memory is however slightly less…)

1c)

Unfortunately, due to some bug, I experience some memory leak when activating the Julia extension in VSCode. But I will keep it in mind when this bug might be fixed in the future.

Is there a way to see and analyze the number of calls to this functions within one execution?

Wow, that is quite the improvement! Thank you for the hint with using StaticArrays. I am pretty sure that I have read about this once in the performance tips or somewhere else, but I guess it was not so revealing to me until I have applied it to actual code of myself. As you showed,

@btime X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);
92.730 ms (19 allocations: 14.65 MiB)


which is a big improvement. I will certainly be using StaticArrays in the future.

However I am curious, when calling the entire script, why does

@time X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=-10.0, theta = 0.0);
@time evals,evecs = eigen(X+Y,Z);


yield

  0.553434 seconds (249.05 k allocations: 30.053 MiB, 2.41% gc time, 71.89% compilation time)
0.281798 seconds (20 allocations: 14.695 MiB)


even after several executions? Why is it recompiling the function every time?

I think If you include the entire script you effectively re-define the function which needs to be compiled again (although not the functions that are called from other packages, see this thread). If you want accurate timings, @btime is good, in this case it probably is sufficient to simply call the function two times in the script and only time the second call.
If the compilation time actually matters, you probably should change your workflow by using Revise.jl. For instance, you can have a script that contains the function definitions and another one which runs the calculation and includes the former as includet("FunctionDefinitions.jl") (Or, perhaps even better, make a package, which is actually not difficult in julia).
Revise will make sure that your function definitions stay up to date while only recompiling when it is necessary.

I don’t mean anything sophisticated there. If the function is called from within a loop, it called many times.

That’s what a profiler is for.

# Gaussian widths nu:
nu_arr=zeros(nmax);nu_arr = 1/s1^2;
if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)-1)/(nmax-1)); end
NU_arr=zeros(Nmax);NU_arr = 1/S1^2;
if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)-1)/(Nmax-1)); end


Skimming through it, I can see some easy improvements you could have here.
First, if you initialize arrays, use similar or undef arrays (see here Arrays · The Julia Language). It avoids filling values with zeros.
Second, once you initialize an array, “updating” the values is better. This means @. should be at the start of the line (which ensures that you use .= rather than =).
Third, if you work with slices of an array, it’s better to use @views, which you can add at the beginning of the function.
Fourth, StaticArrays are always helpful when you have a vector with a few elements (rough rule of thumb 100, but I choose not to use them for vectors with more than 50).

Example:

#approach 1

vec1   = rand(10)
vec2   = rand(10)

function approach1(vec1,vec2)
nu_arr = zeros(10)

nu_arr[2:8] = @. vec1[2:8] + vec2[2:8]
end

#approach 2
using StaticArrays
Svec1   = SVector(vec1...)
Svec2   = SVector(vec2...)
@views function approach2(vec1, vec2)
nu_arr = similar(vec1)

@. nu_arr[2:8] = vec1[2:8] + vec2[2:8]
end

@btime approach1($vec1,$vec2)  #155.044 ns (4 allocations: 480 bytes)
@btime approach2($Svec1,$Svec2)#12.412 ns (1 allocation: 96 bytes)



A personal recommendation is also to make the code clearer. It’ll help you a lot when you have to debug. Two suggestions for doing this would be to split the functions you have into smaller functions. They’re too long, making it harder to detect type instabilities or where you have room for improvements. Also, I suggest not writing conditionals in a single line. If you want to write a conditional in a single line, I’d use
nmax > 1 && nu_arr[2:nmax] =...
which means <condition> && <do this if condition is true>
if nmax > 1; nu_arr[2:nmax] =... end

Ok, usually I do it that way. And indeed, testing your suggestion of calling the function twice showed your predicted behavior. So it was simply due to the form of this exmaple, where I wanted to have everything in a single file.

I agree. But for example when using the ProfileView.@profview, it is not clear to me how to see it. Maybe you can suggest to me another tool?

Thank you for several suggestions:

1. I think I have used that in the past, however in my example, the arrays X,Y,Z are initialized by filling with zeros, because later in the loops I add terms on each array element: Z[i,j] = Z[i,j] + Z_temp;I think by using an undef array I once ran into problems with this. However maybe there is aynway a better way to write the values into the array elements? Otherwise, for the nu_arr I agree.

2. Easy fix, I will try it in the future.

3. I am not sure if I understand this correctly. Can I simply/stupidly add this in front of all of my functions which has some array-sliced operation in order to save allocations? Or can this lead to unexpected behaviour if misused?

4. Sounds like a good rule of thumb.

5. Interesting, I didnt feel my functions are too long. Could you maybe give me a more explicit suggestion? After your comment I would think that the functions mat_x, mat_y, mat_z, jtrafo are fine, but maybe within mat_xyz I could outsource the function defintions and calculation of the widths, and the norms. Moreover, I can write another function which contains the loop-structure. In there, I was anyway wondering whether it would be more efficient to have the “big” loops over i and j before or rather “outside” of the loops over a and b? Is there also some kind of rule of thumb?

6. I mostly wrote it in one line, to not clog up the looking of the function. But maybe this goes hand in had with comment 5 to make the functions simpler. In any case I will go and use && for simple conditionals.

I’ve started a blog. Maybe the last three posts about easy fixes to enhance performance can be useful to you. It includes a post about @views.
It definitely does matter whether you use this or not. It can lead to different results and can cause both faster and sometimes also slower code. Without using @views, you are creating a copy of the array (at the specified locations). Modifying this copy for example will not change the original, which may be desired. It can even be faster to not use views since views do not have to be contiguous in memory, i.e. two neighboring elements might be a long distance away from each other making iterating over it slower (and less likely to profit from SIMD/LoopVectorization).