Simple speed comparison with Matlab

josujugo · May 3, 2025, 4:04pm

Hi, Julia experts:

I am new using Julia and I want to check differences with Matlab.

I have written a simple Matlab example:

N=100000
tic
for K=[1:N]
sum(cos(2 * pi * [0:N]/N) .* exp(-j * 2 * K * pi/N * [0:N]))/N
end
toc

I have traslated the example to Julia in this way:

N=100000
@time begin
for K in 1:N
sum(cos.(2 * pi * (0:N)/N) .* exp.(-1im * 2 * K * pi/N * (0:N)))/N
end
end

I have obtained better results with Matlab (80 sec. vs 210 sec.).

I would like to know if the problem is my traslation (which is very direct) or simply Matlab is faster for this kind of operations in my machine (Apple ARM M2 Max).

I am using the latest (stable) versions of Julia and Matlab.

Thanks !

josu

Oscar_Smith · May 3, 2025, 4:15pm

This is a known issue. Matlab does a better job of vectorized elementary functions than Julia. That said, you could make this a decent amount faster with sum(cospi.(2/N * (0:N)) .* cispi.(-2 * K / N * (0:N)))/N. cis stands for cos+im*sin and is a faster way to compute exp(im*x), and Julia provides cospi and associated trig functions to do a cheaper and easier reduction. I also think this is missing a couple dots which would speed things up.

josujugo · May 3, 2025, 4:22pm

Ok, thank very much, Oscar. I am going to try with other kind of examples.

Thanks a lot!

JADekker · May 3, 2025, 4:39pm

@josujugo: To elaborate on what Oscar is saying; you can certainly make some small modifications that on my device (Mac with M1) led to roughly a doubling of the speed. The first is to avoid unnecessary allocations by

My fastest version is on 84 seconds, just as your Matlab version. You can even multithread this for an attractive speed-up!

using Pkg; Pkg.activate("."); Pkg.instantiate()

using BenchmarkTools

function f(N)
    for K in 1:N
        sum(cos.(2 * pi * (0:N) / N) .* exp.(-2im * K * pi / N * (0:N))) / N
    end
end

function f_dots(N)
    for K in 1:N
        sum(cos.(2 .* pi .* (0:N) ./ N) .* exp.(-2im .* K .* pi ./ N .* (0:N))) / N
    end
end

function f_dots_and_prealloc(N)
    inters = Vector{ComplexF64}(undef, N+1)
    for K in 1:N
        inters .= cos.(2 .* pi .* (0:N) ./ N) .* exp.(-2im .* K .* pi ./ N .* (0:N))
        sum(inters) / N
    end
end

function f_dots_and_prealloc_2(N)
    inters = Vector{ComplexF64}(undef, N+1)
    cos_only = cos.(2 .* pi .* (0:N) ./ N)
    for K in 1:N
        inters .= cos_only .* exp.(-2im .* K .* pi ./ N .* (0:N))
        sum(inters) / N
    end
end

function f_dots_and_prealloc_cispi(N)
    inters = Vector{ComplexF64}(undef, N+1)
    cos_only = cos.(2 .* pi .* (0:N) ./ N)
    for K in 1:N
        inters .= cos_only .* cispi.(-2 .* K .* pi ./ N .* (0:N))
        sum(inters) / N
    end
end

function test(N)
    @time f(N)
    @time f_dots(N)
    @time f_dots_and_prealloc(N)
    @time f_dots_and_prealloc_2(N)
    @time f_dots_and_prealloc_cispi(N)
end

test(10_000)

gives me

# f
  2.054132 seconds (30.00 k allocations: 1.526 GiB, 2.82% gc time)
# f_dots
  1.888829 seconds (30.00 k allocations: 1.526 GiB, 2.51% gc time)
# f_dots_and_prealloc
  1.786435 seconds (3 allocations: 160.062 KiB)
# f_dots_and_prealloc_2
  1.304906 seconds (6 allocations: 256.125 KiB)
# f_dots_and_prealloc_cispi
  0.846121 seconds (6 allocations: 256.125 KiB)

Running test(100_000) gave me (not plugged in to power!) a runtime of 84 seconds for f_dots_and_prealloc_cispi, which I would expect to be even faster on your device!

JADekker · May 3, 2025, 4:41pm

Also note that if you’re trying to time things, it is always good to wrap it inside a function, to let the compiler works its magic! If I directly run

N = 10_000
@time begin for K in 1:N
    sum(cos.(2 * pi * (0:N) / N) .* exp.(-2im * K * pi / N * (0:N))) / N
end end

I get

  1.971349 seconds (178.98 k allocations: 1.533 GiB, 3.04% gc time)

which has much more allocations!

JADekker · May 3, 2025, 4:45pm

Fastest I can get for now:

using LinearAlgebra
function f_dots_and_prealloc_cispi_dot(N)
    inters = Vector{ComplexF64}(undef, N+1)
    cos_only = cos.(2 .* pi .* (0:N) ./ N)
    for K in 1:N
        inters .= cispi.(-2 .* K .* pi ./ N .* (0:N))
        cos_only ⋅ inters / N
    end
end

gives for N = 10,000:

  0.753205 seconds (6 allocations: 256.125 KiB)

yolhan_mannes · May 3, 2025, 4:52pm

Matlab will compile your block to avoid alocating too much and use threading for the maping, so in julia you should start julia with julia -t auto and run

julia> using ThreadsX
julia> function f_dots_and_prealloc_cispi(N)
           inters = Vector{ComplexF64}(undef, N+1)
           for K in 1:N
               ThreadsX.map!(n->cos(2 * pi * n/N) * exp(-1im * 2 * K * pi/N * n),inters,0:N)
               sum(inters)
           end
       end
f_dots_and_prealloc_cispi (generic function with 1 method)

julia> @time f_dots_and_prealloc_cispi(N)
 37.008138 seconds (97.44 M allocations: 10.244 GiB, 5.82% gc time, 1.52% compilation time)

you will need to add ThreadsX.jl (or any other threading package hundling map!). Julia does not like threading by default (beside for blas operation)

josujugo · May 3, 2025, 6:18pm

Thank very much for your suggestions! I see that I have a lot to check and to learn.

I will study all the tricks proposed.

My first tests with your code confirm the same values. Really good ones !

sgaure · May 3, 2025, 6:59pm

Just to summarize a bit:

Global variables are bad. In particular if they are not typed with a concrete type. If globals are needed, they should be declared as const. E.g. N::Int = 10000, or const N = 10000.

Code should be in functions. The type of every expression in a function should be possible to deduce from the types of the actual arguments. Don’t do things like
x = ifelse(something, 0, 1.0). Because 0 is an Int, and 1.0 is a Float64. Unless something is a compile time constant, the compiler won’t know if x will be an Int or a Float64.

Loops are not bad. It’s often better than “vectorized” code.

Allocations can be bad. In particular in threaded code.

Gesee · May 3, 2025, 7:17pm

Broadcasting with the dot operator very slow, instead, use @. At the start of your expression and remove the other dots

Benny · May 3, 2025, 7:39pm

MATLAB also has cospi, so it’s fair to make that change there as well, though I’m not sure if that’ll make a performance impact. As for cis, there’s good indication that MATLAB optimizes complex exponentials (not the same as exp with complex input), being on par with a tabulated implementation. It outpaces the equivalent cos(Q)+1i*sin(Q) by around the expected gains from Julia’s cis(x) over cos(x)+im*sin(x), which is in turn faster than exp(::Complex). That benchmark is as far as they could go, I imagine it’s harder to dig further into MATLAB’s implementation of exp or possibly the JIT. MATLAB isn’t too clear on how its type system treats complex numbers, so a possible explanation for that benchmark is that (unlike Julia), wholly imaginary inputs are somehow distinguished from other complex numbers and allows exp in particular to use a cheaper cis.

exp secretly doing cis probably wouldn’t explain all or even most of the 2.6x difference observed, I’d hazard a guess that multithreading plays a role. In general, base MATLAB is optimized pretty well, I wouldn’t worry about losing performance there; in fact, more direct control over performance can also mean requiring more effort, which I think Julia demonstrated here. You may have to resort to MEX (write C/C++ or try to compile it from MATLAB) to optimize your MATLAB code in more cases, but you may not run into those cases in practice, especially if you’re mostly sticking to highly optimized base functions like this.

josujugo’s dots for Julia are actually nearly as good as it gets, and dotting everything with @. isn’t a general performance boost. The undotted operations among scalars and ranges are fast, and the dotted operations are actually special-cased to do the same thing when order of operations allow for it. JADekker demonstrated that adding those dots alone doesn’t reduce allocations for a significant performance boost; the modest gain is likely due to optimization of the fused broadcast kernel. In some cases, dotting everything may unnecessarily put more operations into the broadcast loop that the compiler can’t hoist back out, or add extra dimensions to a loop that makes it more expensive than carefully separated loops.

Oscar_Smith · May 3, 2025, 8:00pm

we probably should add a branch for exp(::Complex) that checks for zero real part and calls cis for you. It’s probably fairly common for people to not use cis and exp(::Complex) is expensive enough that it’s probably worth the branch.

Benny · May 3, 2025, 8:06pm

exp(::Complex} has a branch checking for zero imaginary part to make a similar optimization as cis, funnily enough. Not sure if another check for zero real part will get the (50%-ish for me) gains to approach cis though, there are many runtime checks in there that cis just doesn’t have to make.

Oscar_Smith · May 3, 2025, 8:10pm

it won’t recover full performance (especially compared to cispi), but saving the call to exp should help a decent amount

Benny · May 3, 2025, 8:30pm

Shouldn’t the pi also be taken out of the cispi input expression? cis is what takes im out.

Also to add to my caution about dotting everything, it’s possible to go even further here and remove all allocations and throw every operation into the kernel for a performance loss (though zero allocations may be a desirable tradeoff sometimes):

julia> function f_noalloc(N)
         for K in 1:N
           # sum(cos.(2 * pi * (0:N)/N) .* exp.(-1im * 2 * K * pi/N * (0:N)))/N
           sum(x -> cos(2*pi*x/N) * exp(-1im * 2 * K * pi/N * x), 0:N)/N
         end
       end
f_noalloc (generic function with 1 method)

julia> function f_noalloc_cispi(N)
         for K in 1:N
           # sum(x -> cospi(2*x/N) * cispi(-12 * K/N * x), 0:N)/N
           sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
         end
       end
f_noalloc_cispi (generic function with 1 method)

julia> @time f_noalloc(10_000)
  4.222765 seconds

julia> @time f_noalloc_cispi(10_000)
  2.223600 seconds

julia> # see JADekker's implementations for the following

julia> @time f(10_000)
  3.875294 seconds (30.00 k allocations: 1.492 GiB, 1.64% gc time)

julia> @time f_dots_and_prealloc_cispi(10_000)
  2.085591 seconds (6 allocations: 234.686 KiB)

julia> @time f_dots_and_prealloc_cispi_dot(10_000)
  2.024006 seconds (6 allocations: 234.686 KiB)

JADekker · May 3, 2025, 8:34pm

Yeah I see I forgot to take the pi out indeed

DNF · May 4, 2025, 7:49am

Indeed, for @josujugo: this is a general performance principle in ‘fast’ languages, like Julia. Since you don’t have to vectorize operations for speed, it’s actually wasteful to use vectorization when doing reductions (sum is a reduction, it ‘reduces’ an array to a single number).

For reductions you should generally avoid broadcasting and just

get one input number
feed it through your mapping (sin/exp etc)
add it to your sum

This is basically what Benny’s code is doing. There are some extra subtleties surrounding simd and numerical inaccuracies, but the above is the general principle. Just avoid intermediate arrays completely. Even if you pre-allocate, as @JADekker’s code does, that means you must iterate twice over the array.

DNF · May 4, 2025, 8:00am

And I’ll just nitpick you Matlab code for good measure

You shouldn’t write [0:N], just write 0:N or (0:N). I know it works, because of how Matlab treats vectors, but it’s not good style, and I’m pretty sure that the Matlab linter will put some squiggly lines under that code telling you not to do it.

It looks like you are making a range and then putting it into an array, and even though it ends up correct it’s confusing to the reader. If you do this in Julia (which you didn’t), things will not work correctly.

josujugo · May 4, 2025, 9:03am

Ok, thanks DNF, I will take into account. Clearly, I must learn more about Julia language syntax.

sgaure · May 4, 2025, 9:10am

The @. operator just inserts dots in assignments and function calls. There’s normally no difference between using @. and individual dots:

julia> @macroexpand @. z = x + f(y)
:(z .= (+).(x, f.(y)))

or in lowered form:

julia> Meta.@lower z .= x .+ f.(y)
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = z
│   %2 = +
│   %3 = x
│   %4 = f
│   %5 = y
│   %6 = Base.broadcasted(%4, %5)
│   %7 = Base.broadcasted(%2, %3, %6)
│   %8 = Base.materialize!(%1, %7)
└──      return %8
))))

julia> Meta.@lower(@. z = x + f(y))
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = z
│   %2 = +
│   %3 = x
│   %4 = f
│   %5 = y
│   %6 = Base.broadcasted(%4, %5)
│   %7 = Base.broadcasted(%2, %3, %6)
│   %8 = Base.materialize!(%1, %7)
└──      return %8
))))

Topic		Replies	Views
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8395	March 5, 2017
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1885	June 25, 2023
Matlab versus Julia General Usage	33	4940	July 15, 2021
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19759	October 15, 2019
Porting code from MatLab - performance tips New to Julia	18	424	June 26, 2024

Simple speed comparison with Matlab

Related topics