I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test

Hello everyone!

I just wanted to say hi and share what made me change my mind about my current setting.

I am an applied mathematician working in scientific computing and geospatial data analysis. On the scientific computing side, I will be definitively switching from Python+Fortran to Julia for my next project. As for geospatial data analysis, I believe I will be sticking to pure Python for the moment, as I’m only using it as a wrapper to efficient C libraries.

First, I wanted to say that I didn’t believe in the publicized performance tests until I run my own. I thought, maybe they are cherry-picking stuff, running toy problems, etc. So I went to the most significant (and yet simple) operation that sucks up most computational resources in my area, which is the evaluation of lots and lots of complex exponential functions.

I still can’t believe that Julia is about 30% faster than gfortran in my test, with multithreading and full optimization flags.

I wrote a blog post with details of my experience. I’m also new to blogging so I would really appreciate any feedback. Here you can find the link:

https://www.matecdev.com/articles/numpy-julia-fortran.html

I look forward to probably blogging more about Julia, as the writing stimulates me to learn more and document technical details that don’t end up in papers or the like. I will also be checking the coverage of my most frequent topics in terms of Julia packages.

Regards!

51 Likes

I recommend trying Tullio.jl, with it you can write:

using Tullio
function eval_exp_tullio(N)
    a = range(0, stop=2*pi, length=N)
    @tullio A[i,j] := exp(100*im*(a[i]^2 + a[j]^2))
    return A
end

and here’s the timing (top half is original, bottom half is Tullio):

running loop on 4 threads 
  0.007978 seconds (34 allocations: 15.261 MiB)
  0.033045 seconds (25 allocations: 61.037 MiB)
  0.074765 seconds (25 allocations: 137.331 MiB)
  0.128596 seconds (25 allocations: 244.143 MiB)
  0.222373 seconds (25 allocations: 381.472 MiB)
  0.480560 seconds (24 allocations: 549.318 MiB, 28.88% gc time)
  0.494272 seconds (24 allocations: 747.683 MiB, 3.44% gc time)
  0.613162 seconds (26 allocations: 976.565 MiB, 5.34% gc time)
  0.800912 seconds (25 allocations: 1.207 GiB, 4.84% gc time)
  1.002543 seconds (26 allocations: 1.490 GiB, 4.72% gc time)
running loop on 4 threads 
  0.011456 seconds (4.88 k allocations: 15.552 MiB)
  0.024558 seconds (46 allocations: 61.038 MiB)
  0.054313 seconds (46 allocations: 137.332 MiB)
  0.095881 seconds (46 allocations: 244.143 MiB)
  0.153821 seconds (46 allocations: 381.472 MiB)
  0.346016 seconds (46 allocations: 549.319 MiB, 36.62% gc time)
  0.315549 seconds (46 allocations: 747.683 MiB, 5.61% gc time)
  0.419185 seconds (46 allocations: 976.565 MiB, 7.71% gc time)
  0.529250 seconds (46 allocations: 1.207 GiB, 7.63% gc time)
  0.639491 seconds (46 allocations: 1.490 GiB, 7.36% gc time)
6 Likes

In the article, I think the link provided in the last part of the sentence,

In any case, trying new things always comes with some risk, and at least Julia’s approach to avoiding the Python2/3 disaster is that of supporting at least two adjacent versions simultaneusly.

is wrong. It just redirects to the same article.

2 Likes

For exponentials with imaginary arguments, check out the cis function. cis(x) is the same as exp(im*x), and may have some optimizations.

6 Likes

Are those your own libraries or public ones (just curious about the geospatial part)? But note that Julia is very very handy to wrapp C libraries.

1 Like

Welcome @martin.d.maas! Julia can be quite the rabbit hole. I mostly came for the speed, but at this point I’d probably stay even if it were slow because the language has so many other interesting aspects to it.

A couple of performance notes:

  • The matrix A doesn’t need to be initialized with zeros, instead you can just grab it as an unitialized chunk of memory by writing A = Matrix{ComplexF64}(undef, N, N). This will save the program from having to manually set everything to zero.
  • In your loop, all your array accesses are in-bounds by construction, so you can annotate the loop with the @inbounds macro to stop julia from doing any bounds checks (it may already have figured out on it’s own that it can drop these).
  • As mentioned above by @DNF , cis(...) is more efficient than exp(im * ...)

Hence, I’d write

function eval_exp_tweaked(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    @inbounds Threads.@threads for i in 1:N
        for j in 1:N
            A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

and for benchmarking, I strongly suggest checking out the @btime or @benchmark macros from BenchmarkTools.jl, they’ll take many samples to get statistically significant results as well as omitting compilation time.

Running my tweaked version, I see we can squeeze out quite a bit of extra performance for the smaller sizes:

using BenchmarkTools

function eval_exp(N)
    a = range(0, stop=2*pi, length=N)
    A = zeros(Complex{Float64},N,N)
    Threads.@threads for i in 1:N
        for j in 1:N
            A[i,j] = exp(100*im*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

function eval_exp_tweaked(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    @inbounds Threads.@threads for i in 1:N
        for j in 1:N
            A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
    @show N
    A = @btime eval_exp($N)
    B = @btime eval_exp_tweaked($N)
    @assert A ≈ B
    println()
end
running loop on 6 threads 
N = 1000
  5.909 ms (33 allocations: 15.26 MiB)
  3.877 ms (33 allocations: 15.26 MiB)

N = 2000
  45.875 ms (33 allocations: 61.04 MiB)
  23.908 ms (33 allocations: 61.04 MiB)

N = 3000
  96.361 ms (35 allocations: 137.33 MiB)
  55.392 ms (33 allocations: 137.33 MiB)

N = 4000
  173.731 ms (36 allocations: 244.14 MiB)
  92.259 ms (35 allocations: 244.14 MiB)

N = 5000
  259.352 ms (36 allocations: 381.47 MiB)
  157.808 ms (35 allocations: 381.47 MiB)

N = 6000
  409.806 ms (35 allocations: 549.32 MiB)
  258.681 ms (35 allocations: 549.32 MiB)

N = 7000
  536.165 ms (36 allocations: 747.68 MiB)
  363.541 ms (35 allocations: 747.68 MiB)

N = 8000
  985.929 ms (37 allocations: 976.57 MiB)
  836.564 ms (36 allocations: 976.57 MiB)

N = 9000
  1.008 s (38 allocations: 1.21 GiB)
  668.207 ms (37 allocations: 1.21 GiB)

N = 10000
  1.446 s (43 allocations: 1.49 GiB)
  995.583 ms (41 allocations: 1.49 GiB)

Tullio.jl is one of my favourite packages and definitely worth looking into for this stuff though.

9 Likes

I think the loops are still backwards – the first index wants to be innermost. And this is one of the cases where collecting the range to a real Vector helps, instead of re-calculating it each time. Adding some variants to Mason’s timing loop:

julia> for N in 1_000:1_000:10_000
           @show N
           A = @btime eval_exp($N)
           B = @btime eval_exp_tweaked($N)    # mason's, above
           B = @btime eval_exp_tweaked_2($N)  # loops re-ordered
           B = @btime eval_exp_tweaked_3($N)  # range collected
           B = @btime eval_exp_tullio($N)     # is using @fastmath, perhaps unwisely
           @assert A ≈ B
           println()
       end
N = 1000
  5.724 ms (24 allocations: 15.26 MiB)
  4.116 ms (23 allocations: 15.26 MiB)
  3.830 ms (23 allocations: 15.26 MiB)
  2.826 ms (24 allocations: 15.27 MiB)
  3.256 ms (47 allocations: 15.27 MiB)

N = 2000
  22.612 ms (24 allocations: 61.04 MiB)
  15.405 ms (23 allocations: 61.04 MiB)
  14.909 ms (23 allocations: 61.04 MiB)
  11.466 ms (25 allocations: 61.05 MiB)
  12.024 ms (47 allocations: 61.05 MiB)

N = 5000
  148.182 ms (25 allocations: 381.47 MiB)
  102.851 ms (26 allocations: 381.47 MiB)
  93.575 ms (24 allocations: 381.47 MiB)
  72.733 ms (27 allocations: 381.51 MiB)
  79.471 ms (48 allocations: 381.51 MiB)

N = 10000
  609.136 ms (24 allocations: 1.49 GiB)
  451.781 ms (26 allocations: 1.49 GiB)
  376.157 ms (25 allocations: 1.49 GiB)
  300.795 ms (26 allocations: 1.49 GiB)
  303.499 ms (48 allocations: 1.49 GiB)

julia> function eval_exp(N)
           a = range(0, stop=2*pi, length=N)
           A = zeros(Complex{Float64},N,N)
           Threads.@threads for i in 1:N
               for j in 1:N
                   A[i,j] = exp(100*im*(a[i]^2 + a[j]^2))
               end
           end
           return A
       end
eval_exp (generic function with 1 method)

julia> function eval_exp_tweaked(N)
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           @inbounds Threads.@threads for i in 1:N
               for j in 1:N
                   A[i,j] = cis(100*(a[i]^2 + a[j]^2))
               end
           end
           return A
       end
eval_exp_tweaked (generic function with 1 method)

julia> function eval_exp_tweaked_2(N) # loops re-ordered
           a = range(0, stop=2*pi, length=N)
           A = Matrix{ComplexF64}(undef, N, N)
           @inbounds Threads.@threads for j in 1:N
               for i in 1:N
                   A[i,j] = cis(100*(a[i]^2 + a[j]^2))
               end
           end
           return A
       end
eval_exp_tweaked_2 (generic function with 1 method)

julia> function eval_exp_tweaked_3(N) # ... and range collected
           a = collect(range(0, stop=2*pi, length=N))
           A = Matrix{ComplexF64}(undef, N, N)
           @inbounds Threads.@threads for j in 1:N
               for i in 1:N
                   A[i,j] = cis(100*(a[i]^2 + a[j]^2))
               end
           end
           return A
       end
eval_exp_tweaked_3 (generic function with 1 method)

julia> using Tullio

julia> function eval_exp_tullio(N)
         a = collect(range(0, stop=2*pi, length=N))
         @tullio A[i,j] := cis(100*(a[i]^2 + a[j]^2))
       end
eval_exp_tullio (generic function with 1 method)

julia> versioninfo()
Julia Version 1.7.0-DEV.1102
Commit a0241b9226 (2021-05-13 23:27 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.4.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

As you can see, people like playing this game around here! Since A is symmetric, you could also compute just the upper triangle and wrap it in Symmetric, depending on what you do next. Or copy the data across.

4 Likes

Hello again! Thank you guys for your kind responses.

I’ve fixed my link, I was meaning to point here:

Oops, I’ve forgotten about the loop index order (too much mixed-language programming lately I suppose). I do get a 10% improvement in both languages if I do that correctly. I didn’t know Julia was also following column-major ordering :slight_smile:

I’ll definitively check out Tullio.jl.

My test example was actually too simple, I usually require complex exponentials, not purely imaginary arguments. So I’ll check out if the comparison still holds in the general case (maybe Julia is figuring that out while gfortran is not? Both languages know that “a” and “k” are real…).

As for other performance libraries, I was just wondering about linking to MKL. I had done it in gfortran in the past, but I need to find again how I did it (documenting these things would really help).

In Julia, I see this package:

but they only mention Linear Algebra functions. There is also stuff like complex exponentials or square roots in MKL.

Anyway, I was also even willing to sacrifice some performance (no more than 10-20% actually) if the resulting code was, to begin with, written in a single language. I am delighted that I seem to be geting a performance gain on top of the code simplicity (and overall, better code in many ways than mixed python+fortran).

I have many more things to try, like Tullio.jl, MKL, and distributed memory.

And by the way, this was all about computational electromagnetics. Those complex exponentials are the expensive part of a Green function, my vector a is actually sampled from a geometry, and “k” could be a complex number.

As for the Geospatial things I do, I’m just using Geopandas which has a ton of wrappers, mainly to the GEOS library, but also I/O in every file format I need, thanks to fiona which is itself a wrapper to GDAL. If Julia has good wrapping capabilities, it could be done too I guess.

4 Likes

These two versions seem to be slightly faster (yet that may be noise):

function eval_exp_tweaked_4_serial(N) # ... and range collected
    a = collect(range(0, stop=2*pi, length=N))
    A = Matrix{ComplexF64}(undef, N, N)
    for (j,aj) in pairs(a)
        for i in eachindex(a)
            @inbounds A[i,j] = cis(100*(a[i]^2 + aj^2))
        end
    end
    return A
end

function eval_exp_tweaked_4_parallel(N) # ... and range collected
    a = collect(range(0, stop=2*pi, length=N))
    A = Matrix{ComplexF64}(undef, N, N)
    Threads.@threads for j in eachindex(a)
        for i in eachindex(a)
            @inbounds A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

Result:

tweaked3:  15.497 ms (3 allocations: 15.27 MiB)
tweaked3_p:  5.139 ms (24 allocations: 15.27 MiB)
tweaked4:  14.773 ms (3 allocations: 15.27 MiB)
tweaked4_p:  4.971 ms (24 allocations: 15.27 MiB)

But I wanted to show the iterations using pairs and eachindex, which are nice because they do not depend on the array index style and also because you can guarantee inbounds with that without having to use the flag. pairs probably gives some speedup depending on how indexes and values are used repeatedly in the loop.

Code
using Test, BenchmarkTools

function eval_exp_tweaked_3_parallel(N) # ... and range collected
    a = collect(range(0, stop=2*pi, length=N))
    A = Matrix{ComplexF64}(undef, N, N)
    @inbounds Threads.@threads for j in 1:N
        for i in 1:N
            A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

function eval_exp_tweaked_3_serial(N) # ... and range collected
    a = collect(range(0, stop=2*pi, length=N))
    A = Matrix{ComplexF64}(undef, N, N)
    @inbounds for j in 1:N
        for i in 1:N
            A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

function eval_exp_tweaked_4_serial(N) # ... and range collected
    a = collect(range(0, stop=2*pi, length=N))
    A = Matrix{ComplexF64}(undef, N, N)
    for (j,aj) in pairs(a)
        for i in eachindex(a)
            @inbounds A[i,j] = cis(100*(a[i]^2 + aj^2))
        end
    end
    return A
end

function eval_exp_tweaked_4_parallel(N) # ... and range collected
    a = collect(range(0, stop=2*pi, length=N))
    A = Matrix{ComplexF64}(undef, N, N)
    Threads.@threads for j in eachindex(a)
        for i in eachindex(a)
            @inbounds A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

N = 10000
@test eval_exp_tweaked_3_serial(N) ≈ eval_exp_tweaked_4_serial(N) ≈
      eval_exp_tweaked_3_parallel(N) ≈ eval_exp_tweaked_4_parallel(N)

print("tweaked3:"); @btime eval_exp_tweaked_3_serial($N) 

print("tweaked3_p:"); @btime eval_exp_tweaked_3_parallel($N) 

print("tweaked4:"); @btime eval_exp_tweaked_4_serial($N)

print("tweaked4_p:"); @btime eval_exp_tweaked_4_parallel($N)

nothing

2 Likes

Nah, Julia was not figuring out it was purely imaginary. I saw a concrete boost from switching to cis.

Yeah definitely. It’s just also fun to see how fast these things can be tuned in pure julia anyways :slight_smile:

FYI @inbounds should go inside of closure-creating macros like @threads and @spawn (like leandromartinez98’s eval_exp_tweaked_4_parallel). Ref: What is the cause of this performance difference between Julia and Cython? - #4 by tkf

(Personally, I believe @inbounds scope should be as small as possible in general. Putting it outside a macro call is especially worrisome as you usually don’t know what code the macro will generates.)

6 Likes

To get even further into the performance rabbit hole, it’s always fun to bring out LoopVectorization.jl. It doesn’t understand how to handle complex numbers yet, but fortunately we can just reinterpret a matrix of complex floats as a 3-D array of real floats and then operate on that. This example requires a new feature in VectorizationBase.jl, version 0.12.20 0.20.4, so make sure you use that version or newer if you try this.

using LoopVectorization: @tturbo

function eval_exp_tturbo(N)
    A = Matrix{Complex{Float64}}(undef, N, N)
    a = range(0, stop=2*pi, length=N)
    _A = reinterpret(reshape, Float64, A)
    @tturbo for i in 1:N, j in 1:N
        Aij_im, Aij_re = sincos(100*(a[i]^2 + a[j]^2))
        _A[1,i,j] = Aij_re
        _A[2,i,j] = Aij_im
    end
    A
end
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
    @show N
    A = @btime eval_exp($N)
    B = @btime eval_exp_tweaked_4($N)
    C = @btime eval_exp_tturbo($N)
    @assert A ≈ B ≈ C
    println()
end
running loop on 6 threads 
N = 1000
  7.497 ms (33 allocations: 15.26 MiB)
  2.720 ms (34 allocations: 15.27 MiB)
  999.171 μs (2 allocations: 15.26 MiB)

N = 2000
  36.936 ms (33 allocations: 61.04 MiB)
  13.246 ms (34 allocations: 61.05 MiB)
  6.112 ms (2 allocations: 61.04 MiB)

N = 3000
  74.096 ms (33 allocations: 137.33 MiB)
  29.550 ms (35 allocations: 137.35 MiB)
  13.667 ms (2 allocations: 137.33 MiB)

N = 4000
  131.016 ms (35 allocations: 244.14 MiB)
  52.159 ms (36 allocations: 244.17 MiB)
  24.066 ms (2 allocations: 244.14 MiB)

N = 5000
  215.135 ms (35 allocations: 381.47 MiB)
  80.854 ms (37 allocations: 381.51 MiB)
  37.622 ms (2 allocations: 381.47 MiB)

N = 6000
  319.400 ms (35 allocations: 549.32 MiB)
  122.928 ms (37 allocations: 549.37 MiB)
  62.631 ms (3 allocations: 549.32 MiB)

N = 7000
  491.836 ms (35 allocations: 747.68 MiB)
  164.792 ms (38 allocations: 747.74 MiB)
  79.859 ms (5 allocations: 747.68 MiB)

N = 8000
  883.225 ms (37 allocations: 976.57 MiB)
  227.527 ms (37 allocations: 976.63 MiB)
  114.281 ms (6 allocations: 976.56 MiB)

N = 9000
  827.222 ms (35 allocations: 1.21 GiB)
  283.081 ms (37 allocations: 1.21 GiB)
  162.987 ms (4 allocations: 1.21 GiB)

N = 10000
  1.095 s (37 allocations: 1.49 GiB)
  329.651 ms (38 allocations: 1.49 GiB)
  156.945 ms (4 allocations: 1.49 GiB)
6 Likes

VectorizationBase 0.20.4. I’ll tag a new LoopVectorization release after tests pass, but the latest release as of now should work as long as you have VectorizationBase >= v"0.20.4".

1 Like

Does this also apply to a matrix that I want to populate for finite or boundary element methods (no complex numbers)? So, I should not initialize them with zeros and instead allocate an “uninitialized memory”.

Whenever you write before reading, you can create the array as Array{T}(undef, dims..).

4 Likes

Regarding geospatial data analysis, there is a list of native libraries and wrappers in Julia here (in case you want to experiment with alternatives for that part of your workflow too).

2 Likes

Cool, I’ll look into JuliaGeo!

I believe one possible reason to make the switch could be parallelization (GPU would be amazing, but that’s on another league I guess). For example, some of the Geopandas functions I use, are efficient because they have been Cythonised, but are single-threaded… so the way to get multithreading is actually non-trivial, like going down to modifying the Cython code, dividing your data in chunks and using a pool of workers, or relying on Dask… all of which looks a lot more complicated than just using some Threads.@threads… so I guess there could be a case for making the switch here, as well.

I still can’t believe my eyes on the performance difference I’m getting with the optimized version of my micro-benchmark. I made slight modifications to the test, like using a complex value of k and added a square root. I’m using the “undef” matrix instead of zeros, besides of the loop in the correct order. I couldn’t get Tullio to work for now, the function seems to be stuck forever, and I don’t see any improvements with @inbounds, either (once the loop order is correct).

function eval_exp(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)

    Threads.@threads for j in 1:N
    for i in 1:N
        A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
    end
    end
    
    return A

end

Now, the comparison is a massive 2.5X in favor of Julia vs F2PY.

I’m starting to wonder whether I’m being unfair on the Fortran side, and maybe there is some overhead in using F2PY vs a standalone code, or maybe it’s auto-parallelization vs OpenMP, or this is an issue of Gfortran vs ifort. Or Julia automatically relying on some optimized math library in the likes of Intel’s MKL vs having to link to MKL manually in gfortran (which I used to do, but I’m on a fresh OS so I haven’t even downloaded Intel’s MKL)…

Anyway, F2PY with auto-parallel is one of the convenient ways of using (and distributing) Fortran code imho. I was expecting that to be optimal in terms of performance, but it seems to be far from it.

2 Likes

Well, in the single-threaded versions the performance difference is 50% (which is already quite significant), and in the multithreaded version, OpenMP is slightly (10%) better than gfortran’s auto-parallel, but there is a jump from 50% to a 2X speed factor when moving from single-threaded to multi-threaded. I’ve also checked online, and there seems fo be no difference between F2PY and standalone gfortran.

So the performance difference must be a combination of efficient exponentials, and importantly, very efficient multi-threading in Julia. There is also a slight overhead in Python to Fortran data communications as Python uses row-major ordering and Fortran is column-major, but I’d expect that to be a minor detail.

Here’s how your new kernel could be written using LoopVectorization.jl

using LoopVectorization: @tvectorize

function eval_exp_tvectorize(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)

    _A = reinterpret(reshape, Float64, A)
    @tvectorize for j in 1:N, i in 1:N
        x = sqrt(a[i]^2 + a[j]^2)
        prefac = exp(-x)
        s, c = sincos(100*x)
        
        _A[1,i,j] = prefac * c
        _A[2,i,j] = prefac * s
    end
    return A
end

It might be a little hard to tell what I’m doing here, but I’m basically writing

\exp\left(i (100+i) \sqrt{a_i^2 + a_j^2} \right) = \exp\left(-\sqrt{a_i^2 + a_j^2} \right)\left(\cos\left(100 \sqrt{a_i^2 + a_j^2} \right) + i \sin\left(100 \sqrt{a_i^2 + a_j^2} \right) \right)

Here’s how it stacks up against your version:

function eval_exp(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)

    Threads.@threads for j in 1:N
    for i in 1:N
        A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
    end
    end
    return A
end
using BenchmarkTools
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
    @show N
    A = @btime eval_exp($N)
    B = @btime eval_exp_tvectorize($N)
    @assert A ≈ B
    println()
end

#+RESULTS:
running loop on 6 threads 
N = 1000
  4.868 ms (33 allocations: 15.26 MiB)
  1.332 ms (2 allocations: 15.26 MiB)

N = 2000
  22.427 ms (33 allocations: 61.04 MiB)
  8.384 ms (2 allocations: 61.04 MiB)

N = 3000
  51.026 ms (33 allocations: 137.33 MiB)
  19.169 ms (2 allocations: 137.33 MiB)

N = 4000
  90.038 ms (33 allocations: 244.14 MiB)
  35.894 ms (2 allocations: 244.14 MiB)

N = 5000
  140.807 ms (35 allocations: 381.47 MiB)
  54.000 ms (4 allocations: 381.47 MiB)

N = 6000
  208.826 ms (35 allocations: 549.32 MiB)
  76.401 ms (3 allocations: 549.32 MiB)

N = 7000
  276.598 ms (35 allocations: 747.68 MiB)
  106.647 ms (4 allocations: 747.68 MiB)

N = 8000
  371.500 ms (35 allocations: 976.57 MiB)
  187.425 ms (5 allocations: 976.56 MiB)

N = 9000
  460.661 ms (36 allocations: 1.21 GiB)
  233.272 ms (5 allocations: 1.21 GiB)

N = 10000
  616.968 ms (35 allocations: 1.49 GiB)
  242.980 ms (5 allocations: 1.49 GiB)

As you can see, this is a huge win. I wouldn’t be be super surprised if Fortran could go a bit faster here, but I’ll also point out that this isn’t the first time Julia and LoopVectorization.jl have slayed Fortran performance numbers Simple summation 8x slower than in Julia - Fortran Discourse with no real suggestions from experts on how to do better without improving the compiler.

I should also note that my CPU is a Zen 1+ CPU which means it’s SIMD performance isn’t that great. I’d expect an even more impressive gap between these results on an AVX 512 Intel CPU, or maybe a Zen 3 CPU.

Regarding MKL, julia does not use it by default, it needs to be manually installed to use it.

7 Likes

A lot of the difference is that Julia has taken the perspective that special functions are just normal Julia code. Fortran and C treat them more as special intrinsics. Julia 1.6 has a much faster exp function (for reals) than 1.5 in part because the implementation is just normal Julia code, so I was able to come in as a relative outsider with faster code. One of the places Julia has excelled is making as much as possible as transparent as possible. This has the major advantage that it makes it much easier for normal users to make really meaningful contributions to the language itself.

14 Likes