Dear all, I have been using Julia for a few months now, writing code mostly without worrying too much about performance (profiling very briefly with @time/@btime to make sure nothing too horribly is happening though).
Intro:
But now, I feel the time has come to improve my knowledge on performance, so that I can write faster and more efficient code in the future directly from the getgo, without profiling a lot afterwards. As I believe that learning is best by doing an example, I humbly ask for your help in improving my code. For that I present below a typical code of mine which builds matrices and subsequently calculates the eigenvalues and functions. Here the matrix elements basically stem from some complex Gaussian integrals. My aim is not to overoptimizie this very code, but it is an excerpt of one of my actual codes (I have changed all variable names) and it representy my usual way of coding. In this sense it is not an urgent request for help, but rather to learn some new techniques.
If you feel that this question should be moved to the “New to Julia” subforum, please feel free to do so.
Code:
(the file is called PerformanceTest.jl)
using LinearAlgebra
# Script is below
## Functions:
function mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=10.0, theta = 0.0)
(;s1,snmax,nmax,S1,SNmax,Nmax) = var_params;
(;avals,bvals,cvals) = abcvals;
# Initialization:
dims = nmax*Nmax;
X = zeros(ComplexF64,dims,dims) # create only marix W=X+Y and add both X and Y directly?
Y = zeros(ComplexF64,dims,dims)
Z = zeros(ComplexF64,dims,dims)
# Gaussian widths nu:
nu_arr=zeros(nmax);nu_arr[1] = 1/s1^2;
if nmax > 1; nu_arr[2:nmax] = @. 1/s1^2 * (s1/snmax)^(2*((2:nmax)1)/(nmax1)); end
NU_arr=zeros(Nmax);NU_arr[1] = 1/S1^2;
if Nmax > 1; NU_arr[2:Nmax] = @. 1/S1^2 * (S1/SNmax)^(2*((2:Nmax)1)/(Nmax1)); end
# Norm:
norm(nu) = 2*(2*nu)^(3/4)/pi^(1/4);
# Complex factor:
cfac = exp(2*theta*pi/180*im);
nu_arr = nu_arr * cfac; NU_arr = NU_arr * cfac;
# Calculations of matrices:
for a=avals
for b=bvals
# fill only lower triangular part due to symmetry
for i=1:dims
n=div(i1,Nmax)+1;
N=rem(i1,Nmax)+1;
for j=1:i
nprime=div(j1,Nmax)+1;
Nprime=rem(j1,Nmax)+1;
if j>i
continue
end
nua = nu_arr[n]';
nub = nu_arr[nprime];
NUa = NU_arr[N]';
NUb = NU_arr[Nprime];
nus = (nua=nua,nub=nub,NUa=NUa,NUb=NUb);
norma = norm(nua)';
normb = norm(nub);
Norma = norm(NUa)';
Normb = norm(NUb);
norms = (norma=norma,normb=normb,Norma=Norma,Normb=Normb);
Z_temp = mat_z(a,b,d_arr,nus,norms); # for reusing in mat_x
Z[i,j] = Z[i,j] + Z_temp;
X[i,j] = X[i,j] + mat_x(a,b,d_arr,nus,norms;mat_z_element=Z_temp);
for c=cvals
Y[i,j] = Y[i,j] + mat_y(a,b,c,d_arr,nus,norms,prefac,mu_g);
end
end
end # loopend over i,j = dims x dims
end
end # loopend over a,b
Z = Z + tril(Z,1)' #Symmetric(Z, :L)
X = X + transpose(tril(X,1));
Y = Y + transpose(tril(Y,1)) # transpose instead of ' ! (we dont want adjoint...)
return X,Y,Z
end
function mat_z(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms)
(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;
Ma = [[nua',0] [0,NUa']];
Mb = [[nub,0] [0,NUb]];
Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;
element = norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Mab))^(3/2);
return element
end
function mat_x(a::Int64,b::Int64,d_arr::Array{Float64},nus,norms;mat_z_element=1.0)
(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;
if mat_z_element == 1.0
mat_z_element = mat_z(a,b,d_arr,nus,norms)
println("mat_z_element should not be 1 !")
end
Ma = [[nua',0] [0,NUa']];
Mb = [[nub,0] [0,NUb]];
Mab = jtrafo(a,b,d_arr)'*Ma*jtrafo(a,b,d_arr) + Mb;
eta = det(Mab)/Mab[2,2];
zeta = det(Mab)/Mab[1,1];
da = d_arr[mod(b+1,3)+1]; # b1; maps 1,2,3 onto 3,1,2.
db = d_arr[b];
dc = d_arr[mod(b,3)+1]; # b+1; maps 1,2,3 onto 2,3,1
dur = dc*da/(dc+da);
duR = db*(da+dc)/(da+db+dc);
element = mat_z_element*(3*nub/dur*(1nub/eta) + 3*NUb/duR*(1NUb/zeta));
return element
end
function mat_y(a::Int64,b::Int64,c::Int64,d_arr::Array{Float64},nus,norms,prefac::Float64,mu_g)
(;norma,normb,Norma,Normb) = norms;
(;nua,nub,NUa,NUb) = nus;
Ma = [[nua',0] [0,NUa']];
Mb = [[nub,0] [0,NUb]];
Mc = [[mu_g,0.0] [0.0,0.0]]
Macb = jtrafo(a,c,d_arr)'*Ma*jtrafo(a,c,d_arr) + jtrafo(b,c,d_arr)'*Mb*jtrafo(b,c,d_arr) + Mc;
element = prefac*norma*normb*Norma*Normb/sqrt(4*pi)^4 * (pi^2/det(Macb))^(3/2);
return element;
end
# Function for Variabletransformation
function jtrafo(i::Int64,f::Int64,d_arr::Array{Float64})
da = d_arr[i];db = d_arr[mod(i,3)+1];dc = d_arr[mod(i+1,3)+1]
j = Array{Float64}(undef,2,2)
if f==i
j[1,1]=1.0;j[2,2]=1.0;
j[1,2]=0.0;j[2,1]=0.0;
elseif mod(fi,3)==1 # trafo a>b (+1)
j[1,1]=da/(da+dc)
j[1,2]=1.0
j[2,1]=dc*(da+db+dc)/(da+dc)/(db+dc);
j[2,2]=db/(db+dc)
elseif mod(fi,3)==2 # trafo a>c (+2)
j[1,1]=da/(da+db)
j[1,2]=1.0
j[2,1]=+db*(da+db+dc)/(da+db)/(db+dc);
j[2,2]=dc/(db+dc);
end
return j
end
## End of function definitions
# Beginning of script:
var_params = (; s1=0.9,snmax=69.5,nmax=20,S1=0.9,SNmax=69.5,Nmax=20);
d_arr = [4.0 81.0 81.0];
abcvals = (;avals=2:2,bvals=2:3,cvals=2:3);
mu_g = 1.0;
prefac = 10.0;
theta = 10.0;
X,Y,Z = mat_xyz(var_params, d_arr, abcvals; mu_g=1.0, prefac=10.0, theta = 0.0);
evals,evecs = eigen(X+Y,Z);
return nothing
#End of script
Profiling attempts

Currently
@btime include("PerformanceTest.jl")
yields almost exactly 1 second of runtime: 998.592 ms (12749484 allocations: 1.34 GiB). 
With the help of some
@time
commands, I see that most of the time is spent in the big loops over i,j (which was expected I guess). 
I have tried playing around using explicit
::DataTypes
in the function arguments, but this has not really changed anything 
I have used @code_warntype for the functions
mat_x
,mat_y
,mat_z
, andjtrafo
without any DataType being marked in red which according to the documentation should reveal any problems with DataTypes. 
I have used
ProfileView.@profview X,Y,Z = mat_xyz(var_params,d_arr,abcvals;mu_g = 1.0,theta=5.0,prefac=10.0)
. I attach the saved results of a few runs to this post. Apart from the first run (which expectedly looks totally different), I dont see any red parts, but several yellow ones on the top (which if I understand the documentation correctly should also indicate some unoptimal parts). However I have a hard time understanding if there is indeed a problem and what it is exactly.
Conclusion:
I have done some profiling, but to be honest, from the results that I see it is very hard to understand (for me!) whether there is actually much improvement to be found or not. Unfortunately I am not able to write similar code in C, or Fortran to quickly test against this implementation. Understanding where the bottleneck is and what should be done against it, are then also the next question.
Thank you!