My julia code is somehow much slower than the matlab code

Hi, I’m new to Julia. So I used my existing MatLab code to make a counterpart for Julia.
I heard Julia prefers the performance-demanding part inside a function, so I did it.
However, still, the performance is far lagging behind that of MatLab. I want to know what makes my Julia code inefficient.

The followings are MatLab code and Julia code each.

%First, set up given data%
alpha=0.261;beta=0.99;delta=0.0176;n_star=1;
z_grid=[0.9526;0.9760;1;1.0246;1.0497];
Pi_z=[0.9149 0.0823 0.0028 0 0;...
    0.0206 0.9163 0.0618 0.0014 0;...
    0.0005 0.0412 0.9167 0.0412 0.0005;...
    0 0.0014 0.0618 0.9163 0.0206;...
    0 0 0.0028 0.0823 0.9149];
%Next, get the k_star and k_L and k_H%
k_star=(1/alpha*(1/beta-1+delta))^(1/(alpha-1));
k_L=k_star*0.85;k_H=1.15*k_star;
disp(round(k_star,4));disp(round(k_L,4));disp(round(k_H,4));
%Setting up an 499 points of kgrid%
k_grid=(k_L:((k_H-k_L)/498):k_H).';
disp(median(k_grid));

%%(b)%%
%First, set up the length variables%
n_k=length(k_grid);n_z=length(z_grid);
%Now initialize a V array%
V=zeros(n_k,n_z);
%Make settings for the iteration%
%First, make an array for each z_i%
k_prime_grid_mat_proto=repmat(k_grid,[1 n_z n_k]);%build prototype k' array with 3-d block array%
k_grid_mat=permute(k_prime_grid_mat_proto,[3 2 1]);%build k array with 3-d block array%;
k_prime_grid_mat=k_prime_grid_mat_proto.*(k_grid_mat.^alpha.*z_grid.'+k_grid_mat.*(1-delta)-k_prime_grid_mat_proto>=0);
k_prime_grid_mat(k_prime_grid_mat==0)=nan;
logC=log(k_grid_mat.^alpha.*z_grid.'+k_grid_mat.*(1-delta)-k_prime_grid_mat);
%build k' array that assigns nan to k' that is out of bound%

precision=0.01;distance=100;sz=size(k_prime_grid_mat);%precision and initial distance%

i=1;%iteration number%

%Start iteration%
while (distance > precision)
    [TV,Tk_index]=max(logC+repmat(V*Pi_z.',[1 1 n_k]).*beta,[],'omitnan');%max within column omitting nan%
    TV=reshape(permute(TV,[1,3,2]),[],size(TV,2));%change 1x5x499 array into 499x5 matrix%
    distance=max(max(TV-V));
    i=i+1;disp(i);disp(distance);
    if distance<precision
        break
    else
    V=TV;k_index=Tk_index;
    end
end

%reviving k-rules from the k_index array%
k_prime_sol=k_prime_grid_mat_proto(Tk_index);
k_prime_sol_mat=reshape(permute(k_prime_sol,[1,3,2]),[],size(k_prime_sol,2));%change array into matrix%
k_prime_sol_previous=k_prime_grid_mat_proto(k_index);
k_prime_sol_previous_mat=reshape(permute(k_prime_sol,[1,3,2]),[],size(k_prime_sol_previous,2));%change array into matrix%
%matrix of previous capital to compare with solutions%
distance_k=max(max(abs(k_prime_sol_mat-k_prime_sol_previous_mat)));

%answers%
disp(i);disp(distance);disp(distance_k);
using Statistics
##(a)##
#First, set up the given data#
const alpha=0.261;const beta=0.99;const delta=0.0176;const n_star=1;
const z_grid=[0.9526;0.9760;1;1.0246;1.0497];
const Pi_z=[0.9149 0.0823 0.0028 0 0;
    0.0206 0.9163 0.0618 0.0014 0;
    0.0005 0.0412 0.9167 0.0412 0.0005;
    0 0.0014 0.0618 0.9163 0.0206;
    0 0 0.0028 0.0823 0.9149];
#Next, get the k_star and k_L and k_H#
const k_star=(1/alpha*(1/beta-1+delta))^(1/(alpha-1));
const k_L=k_star*0.85;const k_H=1.15*k_star;
println(round(k_star,digits=4));println(round(k_L,digits=4));println(round(k_H,digits=4));
#Setting up an 499 points of kgrid#
const k_grid=[k_L:((k_H-k_L)/498):k_H;];
println(median(k_grid))

##(b)##
#First, set up the length variables#
const n_k=length(k_grid);const n_z=length(z_grid);
#Make settings for the iteration#
#First, make an array for each z_i#
const k_prime_grid_mat_proto=repeat(k_grid,1, n_z, n_k);
const k_grid_mat=permutedims(k_prime_grid_mat_proto,(3,2,1));
k_prime_grid_mat=k_prime_grid_mat_proto.*(k_grid_mat.^alpha.*transpose(z_grid)+k_grid_mat.*(1-delta)-k_prime_grid_mat_proto.>0);
k_prime_grid_mat[k_prime_grid_mat.==0].=NaN;#build k' array that is out of bound#
logC=log.(k_grid_mat.^alpha.*transpose(z_grid)+k_grid_mat.*(1-delta)-k_prime_grid_mat);
logC[isnan.(logC)].=-Inf;
    
const precision=0.01;
const startdistance=100;#precision and initial distance#
#iteration number#

#start iteration#
function loop_over_global(distance, precision)
    #Now initialize a V array#
    global V=zeros(n_k,n_z);global i=1;
    while distance>precision
        global (TV,Tk_index)=findmax(logC::Array{Float64, 3}+repeat(V*transpose(Pi_z),1, 1, n_k).*beta,dims=1);
        global TV=transpose(reshape(TV,5,499));
        distance=maximum(TV-V);
        println(i);println(distance);
        if distance≤precision
            break
        else
            global V=TV; global k_index=Tk_index; i=i+1;
        end
    end
    return (TV,V,Tk_index,k_index,distance);
end

global (TV,V,Tk_index,k_index,distance)=loop_over_global(startdistance,precision);
#Reviving k-rules from the k_index array#
k_prime_sol_mat=transpose(reshape(k_prime_grid_mat_proto[Tk_index],5,499));
k_prime_sol_mat_previous=transpose(reshape(k_prime_grid_mat_proto[k_index],5,499));#matrix of previous capital to compare with solutions#
distance_k_prime_sol=maximum(abs.(k_prime_sol_mat-k_prime_sol_mat_previous));

##for information; how to convert a cartesian index array into matrix with only necessary info(first entry of the cartesian index)##
Tk_index=first.(Tuple.(Tk_index));Tk_index=transpose(reshape(Tk_index,5,499));
    
#answers#
println(i);println(distance);println(distance_k_prime_sol);

Thanks for reading this cumbersome codes. I wish I could get fluent in this fantastic language.

I haven’t grokked the code, but as a first idea here: it’s the inner loop body that should mostly be in a function. Can you try that (extract the body of the while to a function) and see if it runs faster?

I didn’t quite get the idea of extracting the body of while loop into a function. I thought this was what I’d done.
Are you talking about the second code? First one is MatLabs.

The idea is:

function main(…)
    while {condition}
      {result} = loop_body(…)
    end
end

function loop_body(…)
    …
end
2 Likes

On first sight: You are creating global variables in the inner loop. That would be bad in any language on purely stylistic grounds, but in Julia working with non-const untyped global variables is particularly bad for performance.

Have you had a chance to read Performance Tips · The Julia Language ? It is a particularly important part of the documentation for people coming from other languages, as Julia slightly different habits than Python and Matlab.

Side note: you are printing from inside your inner loop. That would throw off the measurement results both for the Matlab and Julia examples.

12 Likes

A quick guess: This line distance=maximum(TV-V); probably causes significant allocations that will be slowing down your code. TV-V would be allocating a new array each time, which would be much slower than the rest of the code combined. I.e. reserving the memory would be slower than actually running the computation of the values to put in that memory.

Also, could you share how you were benchmarking? I would strongly suggest using @btime from BenchmarkTools.jl when measuring the speed of julia code, in order to make sure you are not measuring compilation time.

3 Likes

10 posts were split to a new topic: Digression on short variable names

In Julia there is no need for a semicolon at the end of each line.

Is there a way to calculate the supnorm of two functions avoiding such a way like distance=maximum(TV-V)?

I just measured with my phone, but the difference was so significant that the measurement didn’t matter.
But I would like to use the package from now on, Thanks!

Thank you for the suggestion. I tried to change the code as below.

    #Now initialize a V array#
    V=zeros(n_k,n_z);i=1;k_index=Array{CartesianIndex{3}, 3}
    while distance>precision
        k_index=k_index;
        (TV,Tk_index,distance)=loop_body(V);
        println(i);println(distance);
        if distance≤precision
            return (TV,V,Tk_index,distance,k_index,i);
            break
        else
            V=TV; k_index=Tk_index::Array{CartesianIndex{3}, 3}; i=i+1;
        end
    end
end

function loop_body(V)
    (A,B)=findmax(logC::Array{Float64,3}+repeat(V*transpose(Pi_z),1,1,n_k).*beta,dims=1);
    A1=transpose(reshape(A,size(V)[2],size(V)[1]));
    C=maximum(A1-V);
    return (A1,B,C)
(TV,V,Tk_index,distance,k_index,i)=loop_over_global(startdistance,precision);
end

However, I think there was not much difference from the previous one. Is there something I missed?

I’m on holiday without my computer, so can’t be very helpful. But I can tell that this is inefficient. You never need to use repeat, and should not. This is one of the good things about broadcasting. You should not create arrays with many repeated entries. Also, get rid of the Array{Float64,3} annotation, it can only do harm.

Try something like

(A, B) = findmax(logC .+ V .* transpose(Pi_z) .* beta; dims=1)

I cannot test it, so cannot guarantee it works, maybe you need a reshape, for example.

Also, you don’t need to use repeat in Matlab either. They also support broadcasting, though implicitly, without dots.

Generally, just remove repeat from your programming vocabulary (in any programming language). Imho, you will never need it for anything, it’s just a waste of time and memory.

(Btw, your code would be more readable if you put spaces in your code, around operators, etc.)

6 Likes

And please do not put ; at the end of the lines…

If I’m reading this line correctly, you’re setting k_index equal to a datatype and then in the hot loop you’re changing its value to an instance of that type.

Can you post a complete running example of your code, and the time you are getting?

In terms of this, here are some options:

maximum(a1[i] - v[i] for i in eachindex(a1,v))

This is not faster for a single call, but does not allocate the intermediate array A1-V, and since this is being called from within the loop, that can be faster for the whole timing.

The fastest you can get for that operation is, probably something like this:

julia> using LoopVectorization

julia> function f2(a1,v)
           m = zero(eltype(a1))
           @turbo for i in eachindex(a1,v)
               val = a1[i] - v[i]
               m = ifelse(val > m, val, m)
           end
           return m
       end
f2 (generic function with 1 method)

julia> @btime f2($a1, $v)
  86.956 ns (0 allocations: 0 bytes)
0.9590868980321948

This is about 15 times faster than maximum(a1-v).

(I don’t know how much time is being spent on this opeartion, though)

Is this better than m = max(m, val)?

A more readable version of the code, with more line breaks and no useless semicolons:

using Statistics
const alpha=0.261
const beta=0.99
const delta=0.0176
const n_star=1

const z_grid=[0.9526;0.9760;1;1.0246;1.0497]
const Pi_z=[0.9149 0.0823 0.0028 0 0
    0.0206 0.9163 0.0618 0.0014 0
    0.0005 0.0412 0.9167 0.0412 0.0005
    0 0.0014 0.0618 0.9163 0.0206
    0 0 0.0028 0.0823 0.9149]
#Next, get the k_star and k_L and k_H#
const k_star=(1/alpha*(1/beta-1+delta))^(1/(alpha-1))
const k_L=k_star*0.85
const k_H=1.15*k_star
println(round(k_star,digits=4));println(round(k_L,digits=4));println(round(k_H,digits=4))
#Setting up an 499 points of kgrid#
const k_grid=[k_L:((k_H-k_L)/498):k_H;]
println(median(k_grid))

##(b)##
#First, set up the length variables#
const n_k=length(k_grid)
const n_z=length(z_grid)
#Make settings for the iteration#
#First, make an array for each z_i#
const k_prime_grid_mat_proto=repeat(k_grid,1, n_z, n_k)
const k_grid_mat=permutedims(k_prime_grid_mat_proto,(3,2,1))
k_prime_grid_mat=k_prime_grid_mat_proto.*(k_grid_mat.^alpha.*transpose(z_grid)+k_grid_mat.*(1-delta)-k_prime_grid_mat_proto.>0)
k_prime_grid_mat[k_prime_grid_mat.==0].=NaN#build k' array that is out of bound#
logC=log.(k_grid_mat.^alpha.*transpose(z_grid)+k_grid_mat.*(1-delta)-k_prime_grid_mat)
logC[isnan.(logC)].=-Inf
    
const precision=0.01
const startdistance=100#precision and initial distance#
#iteration number#

#start iteration#
function loop_over_global(distance, precision)
    #Now initialize a V array#
    global V=zeros(n_k,n_z)
    global i=1
    while distance>precision
        global (TV,Tk_index)=findmax(logC::Array{Float64, 3}+repeat(V*transpose(Pi_z),1, 1, n_k).*beta,dims=1)
        global TV=transpose(reshape(TV,5,499))
        distance=maximum(TV-V)
        println(i)
        println(distance)
        if distance≤precision
            break
        else
            global V=TV
            global k_index=Tk_index
            i=i+1
        end
    end
    return (TV,V,Tk_index,k_index,distance)
end

global (TV,V,Tk_index,k_index,distance)=loop_over_global(startdistance,precision)
#Reviving k-rules from the k_index array#
k_prime_sol_mat=transpose(reshape(k_prime_grid_mat_proto[Tk_index],5,499))
k_prime_sol_mat_previous=transpose(reshape(k_prime_grid_mat_proto[k_index],5,499))#matrix of previous capital to compare with solutions#
distance_k_prime_sol=maximum(abs.(k_prime_sol_mat-k_prime_sol_mat_previous))

##for information; how to convert a cartesian index array into matrix with only necessary info(first entry of the cartesian index)##
Tk_index=first.(Tuple.(Tk_index))
Tk_index=transpose(reshape(Tk_index,5,499))
    
#answers#
println(i)
println(distance)
println(distance_k_prime_sol)

Timing all of it takes around 11.43 seconds for me (on occation as low as 8)
Putting all that (except using Statistics) in a begin-end block, and annotating with @profview, lets me know that 88% of the time is spent in poptask:


By the way, removing all const annotations changes the runtime from around 11.43 to around 11.6 for me, so I will do that, as it allows BenchmarkTools’s @btime.

Removing the hot loop printing takes the time down to 10.3 s the first time, and 8.648 s the next time. There is clearly some variation in the runs, but it seems to cut about a second.

Simply removing the global annotations (which had no functionality!) takes it down to 5.8 seconds. A new @profview without constant and global annotations, and without inner loop printing, shows that still 88% of the time is spendt in poptask

I am not sure what that is, nor how to fix it.

Zooming in on the 12% of the time spent on loop_over_global shows the following “flamegraph”. It can be a bit cryptic at first, but tells you everything about where the time is spent. It can probably help others spot the performance problems in the inner loop.

5 Likes

In my tests that allowed vectorization, and max didn’t.

This “pop_task” stuff is because of unused threads.

I would also say that your code is clearly written in a matlab kind of way, and that a Julian formulation would probably look very different. But it takes some effort to see what your code is actually trying to do, which is a key part of making an improved rewrite. A general tip to make it easy to help you is to boil down the code as much as possible in a clear example, which is then easier to optimize.

One of the big problems is that the timing varies significantly - you need to get rid of this randomness to make it easy to optimize.

One smaller thing is e.g. the following line:

k_grid=[k_L:((k_H-k_L)/498):k_H;]

You are in fact (WRONG, I did not know the syntax well enough in writing this) making a 1-element vector containing a range as its only element, which does not seem to be what you want. Espesially concidering your calls to length(k_grid). But if I remove the semicolon, your code breaks. It is just a messy code to understand and optimize, which is unrelated to Julia and Matlab.

That’s because the semicolon is syntactically significant. Cf. the construction with two elements:

julia> [1:2, 3:5]
2-element Vector{UnitRange{Int64}}:
 1:2
 3:5

julia> [1:2; 3:5]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

Effectively [a:b;] is equivalent to collect(a:b) but most likely directly using the range a:b would work equally well and be more efficient.

5 Likes