Welcome @martin.d.maas! Julia can be quite the rabbit hole. I mostly came for the speed, but at this point I’d probably stay even if it were slow because the language has so many other interesting aspects to it.
A couple of performance notes:
- The matrix
A
doesn’t need to be initialized with zeros, instead you can just grab it as an unitialized chunk of memory by writingA = Matrix{ComplexF64}(undef, N, N)
. This will save the program from having to manually set everything to zero. - In your loop, all your array accesses are in-bounds by construction, so you can annotate the loop with the
@inbounds
macro to stop julia from doing any bounds checks (it may already have figured out on it’s own that it can drop these). - As mentioned above by @DNF ,
cis(...)
is more efficient thanexp(im * ...)
Hence, I’d write
function eval_exp_tweaked(N)
a = range(0, stop=2*pi, length=N)
A = Matrix{ComplexF64}(undef, N, N)
@inbounds Threads.@threads for i in 1:N
for j in 1:N
A[i,j] = cis(100*(a[i]^2 + a[j]^2))
end
end
return A
end
and for benchmarking, I strongly suggest checking out the @btime
or @benchmark
macros from BenchmarkTools.jl, they’ll take many samples to get statistically significant results as well as omitting compilation time.
Running my tweaked version, I see we can squeeze out quite a bit of extra performance for the smaller sizes:
using BenchmarkTools
function eval_exp(N)
a = range(0, stop=2*pi, length=N)
A = zeros(Complex{Float64},N,N)
Threads.@threads for i in 1:N
for j in 1:N
A[i,j] = exp(100*im*(a[i]^2 + a[j]^2))
end
end
return A
end
function eval_exp_tweaked(N)
a = range(0, stop=2*pi, length=N)
A = Matrix{ComplexF64}(undef, N, N)
@inbounds Threads.@threads for i in 1:N
for j in 1:N
A[i,j] = cis(100*(a[i]^2 + a[j]^2))
end
end
return A
end
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
@show N
A = @btime eval_exp($N)
B = @btime eval_exp_tweaked($N)
@assert A ≈ B
println()
end
running loop on 6 threads
N = 1000
5.909 ms (33 allocations: 15.26 MiB)
3.877 ms (33 allocations: 15.26 MiB)
N = 2000
45.875 ms (33 allocations: 61.04 MiB)
23.908 ms (33 allocations: 61.04 MiB)
N = 3000
96.361 ms (35 allocations: 137.33 MiB)
55.392 ms (33 allocations: 137.33 MiB)
N = 4000
173.731 ms (36 allocations: 244.14 MiB)
92.259 ms (35 allocations: 244.14 MiB)
N = 5000
259.352 ms (36 allocations: 381.47 MiB)
157.808 ms (35 allocations: 381.47 MiB)
N = 6000
409.806 ms (35 allocations: 549.32 MiB)
258.681 ms (35 allocations: 549.32 MiB)
N = 7000
536.165 ms (36 allocations: 747.68 MiB)
363.541 ms (35 allocations: 747.68 MiB)
N = 8000
985.929 ms (37 allocations: 976.57 MiB)
836.564 ms (36 allocations: 976.57 MiB)
N = 9000
1.008 s (38 allocations: 1.21 GiB)
668.207 ms (37 allocations: 1.21 GiB)
N = 10000
1.446 s (43 allocations: 1.49 GiB)
995.583 ms (41 allocations: 1.49 GiB)
Tullio.jl is one of my favourite packages and definitely worth looking into for this stuff though.