My julia code is somehow much slower than the matlab code

Thank you for a lot of relevant information about my major! A lot of things to study, phew.
I also found out that your code with a non-generic version is as fast as @Seif_Shebl 's code above. It seemed like the generic version needed some loading time to operate.
Honestly, I have no idea how the codes work differently from mine in inner operation, but I’ll try to understand it!:slight_smile:
Thanks a lot for your support.

I think my computer has some differences from yours. On my computer, this code was even a little bit slower than your previous one with 1.870 s while the previous one showed 1.530 s with @btime. But if the performance difference comes from the part about multithreading, I don’t know why is matlab code so consistently faster on my computer.
I’m learning a lot, thx!!

This is weird, make sure you set the --check-bounds=no option when you call your Julia script. It’s well-known in the Julia community that any code relying on vectorized loops will be quite fast. Make sure the loops get vectorized by inspecting the emitted LLVM code via @code_llvm main(), for instance. You should see llvm code like 4×double which means simd vectorization works.

Something strange is happening with your Julia code. Notice that the MATLAB code runs at the same speed on our computers, so the Julia code should also run at similar speeds. Maybe some people here can confirm before we look further into what is happening.

Also, what version of Julia you are using? CPU thread count? CPU type? I’m using latest Julia 1.9 alpha, and my computer has 4-core Intel i7-4790k CPU.

Can you give me an example of how to use --check-bounds=no? I don’t know ways to use it.
Yes, I think it is weird since mine is a laptop that will perform worse than yours but has a similar run time for Matlab and not for Julia. My laptop spec is as below.


Also, my Julia version is 1.73. Can this be a source of the difference? I am looking for ways to update Julia, but it needs a bit of time to understand this, too… :sweat_smile:
I also don’t see 4×double from @code_llvm main(). But I did using LoopVectorization. What may be the problem?

My current code:
I added the println(i) inside the loop to make the comparison fair.

using LinearAlgebra
using Statistics
using BenchmarkTools;
using LoopVectorization


function main()
    ## (a) ##
    # First, set up the given data#
    α = 0.261
    β = 0.99
    δ = 0.0176
    zGrid = [0.9526, 0.9760, 1, 1.0246, 1.0497]
    Π_z = [0.9149 0.0823 0.0028 0      0
           0.0206 0.9163 0.0618 0.0014 0
           0.0005 0.0412 0.9167 0.0412 0.0005
           0      0.0014 0.0618 0.9163 0.0206
           0      0      0.0028 0.0823 0.9149]
    # Next, get the kStar and kL and kH #
    kStar = (1 / α * (1 / β - 1 + δ))^(1 / (α - 1))
    kL = kStar * 0.85
    kH = 1.15 * kStar
    println(kStar)
    println(kL)
    println(kH)

    # Setting up an 499 points of kgrid #
    kGrid = range(kL, kH, 499)
    println(median(kGrid))

    ## (b) ##
    # First, set up the length variables #
    nk = length(kGrid)
    nz = length(zGrid)

    # Make settings for the iteration #
    # First, make an array for each z_i #
    k′Grid = repeat(kGrid, 1, nz, nk)
    kGridMat = permutedims(k′Grid, (3, 2, 1))
    k′GridMat = k′Grid .* (kGridMat .^ α .* zGrid' .+ kGridMat .* (1 - δ) .- k′Grid .> 0)
    replace!(k′GridMat, 0 => NaN) 
    logC = log.(kGridMat .^ α .* zGrid' .+ kGridMat .* (1 - δ) .- k′GridMat)
    replace!(logC, NaN => -Inf)

    # precision and initial distance #
    precision = 0.01
    distance = 100    

    # start iteration #
    i = 1
    V = zeros(nk, nz)
    VΠ_z = similar(V)
    logCC = similar(logC)    
    while distance > precision
        mul!(VΠ_z, V, Π_z')
        d = -Inf
        for k = 1:nk, j = 1:nz
            m = -Inf
            for i = 1:nk
                logCC[i,j,k] = logC[i,j,k] + VΠ_z[i,j] * β
                m = ifelse(m > logCC[i,j,k], m, logCC[i,j,k])
            end 
            distance = max(d, m-V[k,j])
            d = m - V[k,j]
            V[k,j] = m
        end
        i += 1
        println(i);
    end
    # Add one more iteration to find Tk_index
    _, k_index = findmax(logCC, dims=1)
    mul!(VΠ_z, V, Π_z')
    logCC .= logC .+ VΠ_z .* β
    _, Tk_index = findmax(logCC, dims=1)

    # Reviving k-rules from the k_index array #
    k′sol = transpose(reshape(k′Grid[Tk_index], 5, 499))
    k′sol_prev = transpose(reshape(k′Grid[k_index], 5, 499)) # matrix of previous capital to compare with solutions #
    dist_k′sol = maximum(abs(i-j) for (i,j) in zip(k′sol,k′sol_prev))

    # answers #
    println(i)
    println(distance)
    println(dist_k′sol)
end
@btime main()
@code_llvm main()

Thanks a lot for the whole thing, everyone.

I was not able to run your code but found out minor typo(changed k_index = CartesianIndex{3}[;;;] into k_index = CartesianIndex{3}[]) and finally found out that your loop_over_withcache! is also very fast (around 1.5 s in my laptop).
I honestly don’t know what part made your code very different from mine. Can you explain what role similar has or anything that is essential in the difference of allocation?
Thanks a lot!

I mean to start Julia like this:

julia.exe --optimize=3 -t auto --check-bounds=no 

Then, include your file as: include("\path\to\file.jl")

This will turn off bounds checking in the whole code. So, in debugging phase, it’s not recommended to call Julia like this. The other easier option for you now is just to add @inbounds in front of while and keep your workflow as is.

@inbounds while ..
    ...
end

The version of Julia is not so important, it makes a negligible difference.

This is how it looks when you issue @code_llvm main() in the REPL. Notice these <4 x double> in the middle of a lot of emitted code.

Finally my cpu is 4th gen, that’s 10 years old, but yours is 11th gen, so it should be powerful enough to compete with a desktop cpu that age.

2 Likes

Wow, I don’t know why but as I put @inbounds while as you said, the @btime showed me 887.031 ms, which is definitely faster than my Matlab code.
Thanks a lot!
As a novice in Julia, I am very confused since, in my eyes, your code contains a lot of for loop (of course, it is a vectorized loop but still it is a loop), which until now, with my knowledge in R and Matlab, was something that should be avoided. Could you explain what is happening? Also, why did my first code be so heavy, but your code is so light in memory?

For loops are fast in Julia, like they are in compiled languages like C or C++. There’s almost never a reason to avoid them, and they can often be the simplest way of writing highly performant code.

12 Likes

Sometimes, this is worth checking. Just yesterday, updating to the latest Julia version gave me a 70% performance increase on one of my tasks. This is obviously not always guaranteed but sometimes it can make a difference.

There’s a tool called juliaup which lets you update julia seamlessly. You can read about it on the GitHub page. juliaup lets you have multiple versions of Julia installed at the same time, so you can always try the latest version with the option of reverting back if something in your code breaks/is slower (unlikely but it could happen).

2 Likes

Those ‘vectorized’ calls in Matlab or R just send the inputs to a C/Fortran library that uses a loop. It’s all loops, Matlab/R/numpy just hide them from you.

If you use a fast language, like C/Fortran/Julia, you don’t need to send your data somewhere else to do the loops for you, you just write the loops yourself, directly.

9 Likes

Not really a typo, but this is a feature that is only available with Julia v1.8 or newer: add syntax for empty n-dimensional arrays by simeonschaub · Pull Request #41618 · JuliaLang/julia · GitHub

Can you explain what role similar has or anything that is essential in the difference of allocation?

similar(V) alone is not responsible for the performance gain. I could have also used zeros(n_k,n_z) and its purpose is just to create another array that has the same size and element type as V. Try julia> ?similar to learn more about it.

What I tried to do in my version is to avoid computations like logC::Array{Float64,3}.+V*transpose(Pi_z).*beta from the loop_body functions to create a new array, hence, an allocation, whenever you call these functions, which you do in every step of the while loop. So my version sets up an array called tmp_product with the right size before the while loop and I then forward it to loop_body_withcache!. In there I use the .= assignment (notice the dot!!!) to tell julia that it should store the result of the rhs computation into tmp_product. Using only a = would have discarded the buffer I set up before.

The effect of this is that the program needs around a factor of x100 less memory to do the same work. When I wrote the code I knew one could shave off some more allocations this way by avoiding findmax to create new arrays all the time, but I was too lazy on my train :smile:. Luckily @Seif_Shebl showed how it can be done.

4 Likes

That’s awesome, considering that I struggling to turn ‘for loop easy’ problems into arrays. Thank you for explaining it.

Thank you for the explanation. I feel shame about my past time spent on changing loops into vectors. :joy:

Thanks a lot for your explanation. I am getting to understand some stuff (though I’m still working on this understanding :joy:).

I feel desperate to understand the whole thing and reproduce these myself since, if not, this means I cannot benefit from learning one of the fastest languages. In this regard, I thank you for giving me valuable materials to learn.

I have some questions, by the way. In your code, I found out that you set make_cache function for the V, TV, tmp_product and then again changed them with all zeros in loop_withcache! function. Is there a particular reason for this? (The reason why you made a separate function for generating the right sized V, TV, tmp_product.)

Also, I am confused about the relationship between allocation and arrays. First, I thought avoiding findmax with an array inside was the key. But next, I found out that @Seif_Shebl 's code also happened to have two of the findmax that contains an array in lines 74 and 75. Why are these not damaging the performance? Are these arrays not consuming data allocation as my original arrays?

Lastly, what actually makes the difference between .= and = between arrays? Is the former consuming less data? or is it just related to performance? (or both?)

To be fair, my first version did not have the fill! in there. If you leave that out you will see that running

cache = make_cache(n_k, n_z)
(Tk_index,distance,k_index,i) = loop_withcache!(startdistance,precision,cache)

will only be slow the first time you run it, but all subsequent runs will be much faster. The reason is that V, TV, tmp_product will contain the results from the previous run and, hence, there will be no need to iterate it again. If you look at my EDIT of that post (click the little pencile at the beginning of my answer) you see that initially I reported a x1000 speedup, because of that.
To make the benchmark fair again, I then added the fill!s so that the buffers are reset every time when the function is run and, thus, every call does ‘real’ work.

(The reason why you made a separate function for generating the right sized V, TV, tmp_product.)

The reason is that I can create the cache before I call loop_withcache!. Otherwise @benchmark will count those allocations too.
Let me say that you could also have created the cache inside loop_withcache! and the benchmark would have still displayed less allocations overall compared to your versions, because the really saving comes from using .= in the inner loop. However, creating the cache outside of the function allows you to reuse it multiple times. E.g. imagine calling loop_withcache! inside another loop many times, each iteration with with slightly modified parameters. Preparing the cache separately would allow you to reuse it here too.

Also, I am confused about the relationship between allocation and arrays. First, I thought avoiding findmax with an array inside was the key. But next, I found out that @Seif_Shebl 's code also happened to have two of the findmax that contains an array in lines 74 and 75. Why are these not damaging the performance? Are these arrays not consuming data allocation as my original arrays?

I guess you are referring to this one: My julia code is somehow much slower than the matlab code - #49 by Seif_Shebl

In my version findmax is called in every iteration of the while loop, of which we need 412 (see the benchmark results). On the other hand, @Seif_Shebl s version calls findmax function exactly once after the expensive while loop is done. It is not that findmax needs to be avoided at all costs in Julia, but it should be avoided running it in a hot loop (a loop that needs to do lots of work) where you cannot cache the result. As his example demonstrates, the computational costs of the while loop clearly exceeds that of running findmax once, when compared to my version.

For the next points I wanna say that I am not 100% an expert with Julia’s memory model and my vocabulary might be a bit off. Others should please correct me if I am using wrong language here.

Lastly, what actually makes the difference between .= and = between arrays? Is the former consuming less data? or is it just related to performance? (or both?)

Consider

julia> N = 2;

julia> a = zeros(N)
2-element Vector{Float64}:
 0.0
 0.0

julia> b = ones(N)
2-element Vector{Float64}:
 1.0
 1.0

julia> a = b
2-element Vector{Float64}:
 1.0
 1.0

julia> a[1] = 2;

julia> a
2-element Vector{Float64}:
 2.0
 1.0

julia> b
2-element Vector{Float64}:
 2.0
 1.0

julia> a = zeros(N)
2-element Vector{Float64}:
 0.0
 0.0

julia> b = ones(N)
2-element Vector{Float64}:
 1.0
 1.0

julia> a .= b
2-element Vector{Float64}:
 1.0
 1.0

julia> a[1] = 2;

julia> a
2-element Vector{Float64}:
 2.0
 1.0

julia> b
2-element Vector{Float64}:
 1.0
 1.0

As you can see, using = will make a, b refer to the same array: If you change an element of a this will be reflected in b and vice versa. Note that in this example the array to which a referred to before the assignment (the zeros) are now lost (they might be still around if another variable refers to them, but this not the case in our example). In fact, you would not have needed to setup a with zeros(N).
However, if you use .= instead, then the contents of b will be copied over to a: changing elements of a will not be reflected in b and vice versa. For this to work, it is important that before running a .= b there exists an a such that size(a) == size(b). The benefit of this is that Julia does not have to dispose of the storage to which a referred to before, but instead just overwrites its contents.

If you understood this concept than it should be easy to see why using .= above caused less allocations.

2 Likes