Simple speed comparison with Matlab

Hi, Julia experts:

I am new using Julia and I want to check differences with Matlab.

I have written a simple Matlab example:

N=100000
tic
for K=[1:N]
sum(cos(2 * pi * [0:N]/N) .* exp(-j * 2 * K * pi/N * [0:N]))/N
end
toc

I have traslated the example to Julia in this way:

N=100000
@time begin
for K in 1:N
sum(cos.(2 * pi * (0:N)/N) .* exp.(-1im * 2 * K * pi/N * (0:N)))/N
end
end

I have obtained better results with Matlab (80 sec. vs 210 sec.).

I would like to know if the problem is my traslation (which is very direct) or simply Matlab is faster for this kind of operations in my machine (Apple ARM M2 Max).

I am using the latest (stable) versions of Julia and Matlab.

Thanks !

josu

This is a known issue. Matlab does a better job of vectorized elementary functions than Julia. That said, you could make this a decent amount faster with sum(cospi.(2/N * (0:N)) .* cispi.(-2 * K / N * (0:N)))/N. cis stands for cos+im*sin and is a faster way to compute exp(im*x), and Julia provides cospi and associated trig functions to do a cheaper and easier reduction. I also think this is missing a couple dots which would speed things up.

1 Like

Ok, thank very much, Oscar. I am going to try with other kind of examples.

Thanks a lot!

@josujugo: To elaborate on what Oscar is saying; you can certainly make some small modifications that on my device (Mac with M1) led to roughly a doubling of the speed. The first is to avoid unnecessary allocations by

My fastest version is on 84 seconds, just as your Matlab version. You can even multithread this for an attractive speed-up!

using Pkg; Pkg.activate("."); Pkg.instantiate()

using BenchmarkTools

function f(N)
    for K in 1:N
        sum(cos.(2 * pi * (0:N) / N) .* exp.(-2im * K * pi / N * (0:N))) / N
    end
end

function f_dots(N)
    for K in 1:N
        sum(cos.(2 .* pi .* (0:N) ./ N) .* exp.(-2im .* K .* pi ./ N .* (0:N))) / N
    end
end

function f_dots_and_prealloc(N)
    inters = Vector{ComplexF64}(undef, N+1)
    for K in 1:N
        inters .= cos.(2 .* pi .* (0:N) ./ N) .* exp.(-2im .* K .* pi ./ N .* (0:N))
        sum(inters) / N
    end
end

function f_dots_and_prealloc_2(N)
    inters = Vector{ComplexF64}(undef, N+1)
    cos_only = cos.(2 .* pi .* (0:N) ./ N)
    for K in 1:N
        inters .= cos_only .* exp.(-2im .* K .* pi ./ N .* (0:N))
        sum(inters) / N
    end
end

function f_dots_and_prealloc_cispi(N)
    inters = Vector{ComplexF64}(undef, N+1)
    cos_only = cos.(2 .* pi .* (0:N) ./ N)
    for K in 1:N
        inters .= cos_only .* cispi.(-2 .* K .* pi ./ N .* (0:N))
        sum(inters) / N
    end
end

function test(N)
    @time f(N)
    @time f_dots(N)
    @time f_dots_and_prealloc(N)
    @time f_dots_and_prealloc_2(N)
    @time f_dots_and_prealloc_cispi(N)
end

test(10_000)

gives me

# f
  2.054132 seconds (30.00 k allocations: 1.526 GiB, 2.82% gc time)
# f_dots
  1.888829 seconds (30.00 k allocations: 1.526 GiB, 2.51% gc time)
# f_dots_and_prealloc
  1.786435 seconds (3 allocations: 160.062 KiB)
# f_dots_and_prealloc_2
  1.304906 seconds (6 allocations: 256.125 KiB)
# f_dots_and_prealloc_cispi
  0.846121 seconds (6 allocations: 256.125 KiB)

Running test(100_000) gave me (not plugged in to power!) a runtime of 84 seconds for f_dots_and_prealloc_cispi, which I would expect to be even faster on your device!

4 Likes

Also note that if you’re trying to time things, it is always good to wrap it inside a function, to let the compiler works its magic! If I directly run

N = 10_000
@time begin for K in 1:N
    sum(cos.(2 * pi * (0:N) / N) .* exp.(-2im * K * pi / N * (0:N))) / N
end end

I get

  1.971349 seconds (178.98 k allocations: 1.533 GiB, 3.04% gc time)

which has much more allocations!

Fastest I can get for now:

using LinearAlgebra
function f_dots_and_prealloc_cispi_dot(N)
    inters = Vector{ComplexF64}(undef, N+1)
    cos_only = cos.(2 .* pi .* (0:N) ./ N)
    for K in 1:N
        inters .= cispi.(-2 .* K .* pi ./ N .* (0:N))
        cos_only â‹… inters / N
    end
end

gives for N = 10,000:

  0.753205 seconds (6 allocations: 256.125 KiB)

Matlab will compile your block to avoid alocating too much and use threading for the maping, so in julia you should start julia with julia -t auto and run

julia> using ThreadsX
julia> function f_dots_and_prealloc_cispi(N)
           inters = Vector{ComplexF64}(undef, N+1)
           for K in 1:N
               ThreadsX.map!(n->cos(2 * pi * n/N) * exp(-1im * 2 * K * pi/N * n),inters,0:N)
               sum(inters)
           end
       end
f_dots_and_prealloc_cispi (generic function with 1 method)

julia> @time f_dots_and_prealloc_cispi(N)
 37.008138 seconds (97.44 M allocations: 10.244 GiB, 5.82% gc time, 1.52% compilation time)

you will need to add ThreadsX.jl (or any other threading package hundling map!). Julia does not like threading by default (beside for blas operation)

2 Likes

Thank very much for your suggestions! I see that I have a lot to check and to learn.

I will study all the tricks proposed.

My first tests with your code confirm the same values. Really good ones !

1 Like

Just to systematize a bit:

Global variables are bad. In particular if they are not typed with a concrete type. If globals are needed, they should be declared as const. E.g. N::Int = 10000, or const N = 10000.

Code should be in functions. The type of every expression in a function should be possible to deduce from the types of the actual arguments. Don’t do things like
x = ifelse(something, 0, 1.0). Because 0 is an Int, and 1.0 is a Float64. Unless something is a compile time constant, the compiler won’t know if x will be an Int or a Float64.

Loops are not bad. It’s often better than “vectorized” code.

Allocations can be bad. In particular in threaded code.

Broadcasting with the dot operator very slow, instead, use @. At the start of your expression and remove the other dots

MATLAB also has cospi, so it’s fair to make that change there as well, though I’m not sure if that’ll make a performance impact. As for cis, there’s good indication that MATLAB optimizes complex exponentials (not the same as exp with complex input), being on par with a tabulated implementation. It outpaces the equivalent cos(Q)+1i*sin(Q) by around the expected gains from Julia’s cis(x) over cos(x)+im*sin(x), which is in turn faster than exp(::Complex). That benchmark is as far as they could go, I imagine it’s harder to dig further into MATLAB’s implementation of exp or possibly the JIT. MATLAB isn’t too clear on how its type system treats complex numbers, so a possible explanation for that benchmark is that (unlike Julia), wholly imaginary inputs are somehow distinguished from other complex numbers and allows exp in particular to use a cheaper cis.

exp secretly doing cis probably wouldn’t explain all or even most of the 2.6x difference observed, I’d hazard a guess that multithreading plays a role. In general, base MATLAB is optimized pretty well, I wouldn’t worry about losing performance there; in fact, more direct control over performance can also mean requiring more effort, which I think Julia demonstrated here. You may have to resort to MEX (write C/C++ or try to compile it from MATLAB) to optimize your MATLAB code in more cases, but you may not run into those cases in practice, especially if you’re mostly sticking to highly optimized base functions like this.

josujugo’s dots for Julia are actually nearly as good as it gets, and dotting everything with @. isn’t a general performance boost. The undotted operations among scalars and ranges are fast, and the dotted operations are actually special-cased to do the same thing when order of operations allow for it. JADekker demonstrated that adding those dots alone doesn’t reduce allocations for a significant performance boost; the modest gain is likely due to optimization of the fused broadcast kernel. In some cases, dotting everything may unnecessarily put more operations into the broadcast loop that the compiler can’t hoist back out, or add extra dimensions to a loop that makes it more expensive than carefully separated loops.

2 Likes

we probably should add a branch for exp(::Complex) that checks for zero real part and calls cis for you. It’s probably fairly common for people to not use cis and exp(::Complex) is expensive enough that it’s probably worth the branch.

exp(::Complex} has a branch checking for zero imaginary part to make a similar optimization as cis, funnily enough. Not sure if another check for zero real part will get the (50%-ish for me) gains to approach cis though, there are many runtime checks in there that cis just doesn’t have to make.

it won’t recover full performance (especially compared to cispi), but saving the call to exp should help a decent amount

1 Like

Shouldn’t the pi also be taken out of the cispi input expression? cis is what takes im out.

Also to add to my caution about dotting everything, it’s possible to go even further here and remove all allocations and throw every operation into the kernel for a performance loss (though zero allocations may be a desirable tradeoff sometimes):

julia> function f_noalloc(N)
         for K in 1:N
           # sum(cos.(2 * pi * (0:N)/N) .* exp.(-1im * 2 * K * pi/N * (0:N)))/N
           sum(x -> cos(2*pi*x/N) * exp(-1im * 2 * K * pi/N * x), 0:N)/N
         end
       end
f_noalloc (generic function with 1 method)

julia> function f_noalloc_cispi(N)
         for K in 1:N
           # sum(x -> cospi(2*x/N) * cispi(-12 * K/N * x), 0:N)/N
           sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
         end
       end
f_noalloc_cispi (generic function with 1 method)

julia> @time f_noalloc(10_000)
  4.222765 seconds

julia> @time f_noalloc_cispi(10_000)
  2.223600 seconds

julia> # see JADekker's implementations for the following

julia> @time f(10_000)
  3.875294 seconds (30.00 k allocations: 1.492 GiB, 1.64% gc time)

julia> @time f_dots_and_prealloc_cispi(10_000)
  2.085591 seconds (6 allocations: 234.686 KiB)

julia> @time f_dots_and_prealloc_cispi_dot(10_000)
  2.024006 seconds (6 allocations: 234.686 KiB)

Yeah I see I forgot to take the pi out indeed