I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test

Mason · May 17, 2021, 8:48pm

Here’s how your new kernel could be written using LoopVectorization.jl

using LoopVectorization: @tvectorize

function eval_exp_tvectorize(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)

    _A = reinterpret(reshape, Float64, A)
    @tvectorize for j in 1:N, i in 1:N
        x = sqrt(a[i]^2 + a[j]^2)
        prefac = exp(-x)
        s, c = sincos(100*x)
        
        _A[1,i,j] = prefac * c
        _A[2,i,j] = prefac * s
    end
    return A
end

It might be a little hard to tell what I’m doing here, but I’m basically writing

\exp\left(i (100+i) \sqrt{a_i^2 + a_j^2} \right) = \exp\left(-\sqrt{a_i^2 + a_j^2} \right)\left(\cos\left(100 \sqrt{a_i^2 + a_j^2} \right) + i \sin\left(100 \sqrt{a_i^2 + a_j^2} \right) \right)

Here’s how it stacks up against your version:

function eval_exp(N)
    a = range(0, stop=2*pi, length=N)
    A = Matrix{ComplexF64}(undef, N, N)

    Threads.@threads for j in 1:N
    for i in 1:N
        A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
    end
    end
    return A
end

using BenchmarkTools
print(string("running loop on ", Threads.nthreads(), " threads \n"))
for N in 1_000:1_000:10_000
    @show N
    A = @btime eval_exp($N)
    B = @btime eval_exp_tvectorize($N)
    @assert A ≈ B
    println()
end

#+RESULTS:
running loop on 6 threads 
N = 1000
  4.868 ms (33 allocations: 15.26 MiB)
  1.332 ms (2 allocations: 15.26 MiB)

N = 2000
  22.427 ms (33 allocations: 61.04 MiB)
  8.384 ms (2 allocations: 61.04 MiB)

N = 3000
  51.026 ms (33 allocations: 137.33 MiB)
  19.169 ms (2 allocations: 137.33 MiB)

N = 4000
  90.038 ms (33 allocations: 244.14 MiB)
  35.894 ms (2 allocations: 244.14 MiB)

N = 5000
  140.807 ms (35 allocations: 381.47 MiB)
  54.000 ms (4 allocations: 381.47 MiB)

N = 6000
  208.826 ms (35 allocations: 549.32 MiB)
  76.401 ms (3 allocations: 549.32 MiB)

N = 7000
  276.598 ms (35 allocations: 747.68 MiB)
  106.647 ms (4 allocations: 747.68 MiB)

N = 8000
  371.500 ms (35 allocations: 976.57 MiB)
  187.425 ms (5 allocations: 976.56 MiB)

N = 9000
  460.661 ms (36 allocations: 1.21 GiB)
  233.272 ms (5 allocations: 1.21 GiB)

N = 10000
  616.968 ms (35 allocations: 1.49 GiB)
  242.980 ms (5 allocations: 1.49 GiB)

As you can see, this is a huge win. I wouldn’t be be super surprised if Fortran could go a bit faster here, but I’ll also point out that this isn’t the first time Julia and LoopVectorization.jl have slayed Fortran performance numbers Simple summation 8x slower than in Julia - Fortran Discourse with no real suggestions from experts on how to do better without improving the compiler.

I should also note that my CPU is a Zen 1+ CPU which means it’s SIMD performance isn’t that great. I’d expect an even more impressive gap between these results on an AVX 512 Intel CPU, or maybe a Zen 3 CPU.

Regarding MKL, julia does not use it by default, it needs to be manually installed to use it.

Topic		Replies	Views
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8624	March 5, 2017
Julia vs Fortran complaint General Usage fortran	25	14727	July 20, 2017
Quite bad performance of Julia 0.6.4 vs Python+Numpy General Usage	26	5331	November 13, 2018
Comparing exp() performance on Julia versus numpy Performance	38	2053	August 8, 2022
Small benchmark Performance benchmark	14	2842	November 21, 2018

I just decided to migrate from Python+Fortran to Julia as Julia was faster in my test

Related topics