I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test

Welcome @martin.d.maas! Julia can be quite the rabbit hole. I mostly came for the speed, but at this point I’d probably stay even if it were slow because the language has so many other interesting aspects to it.

A couple of performance notes:

  • The matrix A doesn’t need to be initialized with zeros, instead you can just grab it as an unitialized chunk of memory by writing A = Matrix{ComplexF64}(undef, N, N). This will save the program from having to manually set everything to zero.
  • In your loop, all your array accesses are in-bounds by construction, so you can annotate the loop with the @inbounds macro to stop julia from doing any bounds checks (it may already have figured out on it’s own that it can drop these).
  • As mentioned above by @DNF , cis(...) is more efficient than exp(im * ...)

Hence, I’d write

function eval_exp_tweaked(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    @inbounds Threads.@threads for i in 1:N
        for j in 1:N
            A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

and for benchmarking, I strongly suggest checking out the @btime or @benchmark macros from BenchmarkTools.jl, they’ll take many samples to get statistically significant results as well as omitting compilation time.

Running my tweaked version, I see we can squeeze out quite a bit of extra performance for the smaller sizes:

using BenchmarkTools

function eval_exp(N)
    a = range(0, stop=2*pi, length=N)
    A = zeros(Complex{Float64},N,N)
    Threads.@threads for i in 1:N
        for j in 1:N
            A[i,j] = exp(100*im*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

function eval_exp_tweaked(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)
    @inbounds Threads.@threads for i in 1:N
        for j in 1:N
            A[i,j] = cis(100*(a[i]^2 + a[j]^2))
        end
    end
    return A
end

print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
    @show N
    A = @btime eval_exp($N)
    B = @btime eval_exp_tweaked($N)
    @assert A ≈ B
    println()
end
running loop on 6 threads 
N = 1000
  5.909 ms (33 allocations: 15.26 MiB)
  3.877 ms (33 allocations: 15.26 MiB)

N = 2000
  45.875 ms (33 allocations: 61.04 MiB)
  23.908 ms (33 allocations: 61.04 MiB)

N = 3000
  96.361 ms (35 allocations: 137.33 MiB)
  55.392 ms (33 allocations: 137.33 MiB)

N = 4000
  173.731 ms (36 allocations: 244.14 MiB)
  92.259 ms (35 allocations: 244.14 MiB)

N = 5000
  259.352 ms (36 allocations: 381.47 MiB)
  157.808 ms (35 allocations: 381.47 MiB)

N = 6000
  409.806 ms (35 allocations: 549.32 MiB)
  258.681 ms (35 allocations: 549.32 MiB)

N = 7000
  536.165 ms (36 allocations: 747.68 MiB)
  363.541 ms (35 allocations: 747.68 MiB)

N = 8000
  985.929 ms (37 allocations: 976.57 MiB)
  836.564 ms (36 allocations: 976.57 MiB)

N = 9000
  1.008 s (38 allocations: 1.21 GiB)
  668.207 ms (37 allocations: 1.21 GiB)

N = 10000
  1.446 s (43 allocations: 1.49 GiB)
  995.583 ms (41 allocations: 1.49 GiB)

Tullio.jl is one of my favourite packages and definitely worth looking into for this stuff though.

9 Likes