I am new using Julia and I want to check differences with Matlab.
I have written a simple Matlab example:
N=100000
tic
for K=[1:N]
sum(cos(2 * pi * [0:N]/N) .* exp(-j * 2 * K * pi/N * [0:N]))/N
end
toc
I have traslated the example to Julia in this way:
N=100000 @time begin
for K in 1:N
sum(cos.(2 * pi * (0:N)/N) .* exp.(-1im * 2 * K * pi/N * (0:N)))/N
end
end
I have obtained better results with Matlab (80 sec. vs 210 sec.).
I would like to know if the problem is my traslation (which is very direct) or simply Matlab is faster for this kind of operations in my machine (Apple ARM M2 Max).
I am using the latest (stable) versions of Julia and Matlab.
This is a known issue. Matlab does a better job of vectorized elementary functions than Julia. That said, you could make this a decent amount faster with sum(cospi.(2/N * (0:N)) .* cispi.(-2 * K / N * (0:N)))/N. cis stands for cos+im*sin and is a faster way to compute exp(im*x), and Julia provides cospi and associated trig functions to do a cheaper and easier reduction. I also think this is missing a couple dots which would speed things up.
@josujugo: To elaborate on what Oscar is saying; you can certainly make some small modifications that on my device (Mac with M1) led to roughly a doubling of the speed. The first is to avoid unnecessary allocations by
My fastest version is on 84 seconds, just as your Matlab version. You can even multithread this for an attractive speed-up!
using Pkg; Pkg.activate("."); Pkg.instantiate()
using BenchmarkTools
function f(N)
for K in 1:N
sum(cos.(2 * pi * (0:N) / N) .* exp.(-2im * K * pi / N * (0:N))) / N
end
end
function f_dots(N)
for K in 1:N
sum(cos.(2 .* pi .* (0:N) ./ N) .* exp.(-2im .* K .* pi ./ N .* (0:N))) / N
end
end
function f_dots_and_prealloc(N)
inters = Vector{ComplexF64}(undef, N+1)
for K in 1:N
inters .= cos.(2 .* pi .* (0:N) ./ N) .* exp.(-2im .* K .* pi ./ N .* (0:N))
sum(inters) / N
end
end
function f_dots_and_prealloc_2(N)
inters = Vector{ComplexF64}(undef, N+1)
cos_only = cos.(2 .* pi .* (0:N) ./ N)
for K in 1:N
inters .= cos_only .* exp.(-2im .* K .* pi ./ N .* (0:N))
sum(inters) / N
end
end
function f_dots_and_prealloc_cispi(N)
inters = Vector{ComplexF64}(undef, N+1)
cos_only = cos.(2 .* pi .* (0:N) ./ N)
for K in 1:N
inters .= cos_only .* cispi.(-2 .* K .* pi ./ N .* (0:N))
sum(inters) / N
end
end
function test(N)
@time f(N)
@time f_dots(N)
@time f_dots_and_prealloc(N)
@time f_dots_and_prealloc_2(N)
@time f_dots_and_prealloc_cispi(N)
end
test(10_000)
Running test(100_000) gave me (not plugged in to power!) a runtime of 84 seconds for f_dots_and_prealloc_cispi, which I would expect to be even faster on your device!
Also note that if youâre trying to time things, it is always good to wrap it inside a function, to let the compiler works its magic! If I directly run
N = 10_000
@time begin for K in 1:N
sum(cos.(2 * pi * (0:N) / N) .* exp.(-2im * K * pi / N * (0:N))) / N
end end
I get
1.971349 seconds (178.98 k allocations: 1.533 GiB, 3.04% gc time)
using LinearAlgebra
function f_dots_and_prealloc_cispi_dot(N)
inters = Vector{ComplexF64}(undef, N+1)
cos_only = cos.(2 .* pi .* (0:N) ./ N)
for K in 1:N
inters .= cispi.(-2 .* K .* pi ./ N .* (0:N))
cos_only â inters / N
end
end
Matlab will compile your block to avoid alocating too much and use threading for the maping, so in julia you should start julia with julia -t auto and run
julia> using ThreadsX
julia> function f_dots_and_prealloc_cispi(N)
inters = Vector{ComplexF64}(undef, N+1)
for K in 1:N
ThreadsX.map!(n->cos(2 * pi * n/N) * exp(-1im * 2 * K * pi/N * n),inters,0:N)
sum(inters)
end
end
f_dots_and_prealloc_cispi (generic function with 1 method)
julia> @time f_dots_and_prealloc_cispi(N)
37.008138 seconds (97.44 M allocations: 10.244 GiB, 5.82% gc time, 1.52% compilation time)
you will need to add ThreadsX.jl (or any other threading package hundling map!). Julia does not like threading by default (beside for blas operation)
Global variables are bad. In particular if they are not typed with a concrete type. If globals are needed, they should be declared as const. E.g. N::Int = 10000, or const N = 10000.
Code should be in functions. The type of every expression in a function should be possible to deduce from the types of the actual arguments. Donât do things like x = ifelse(something, 0, 1.0). Because 0 is an Int, and 1.0 is a Float64. Unless something is a compile time constant, the compiler wonât know if x will be an Int or a Float64.
Loops are not bad. Itâs often better than âvectorizedâ code.
Allocations can be bad. In particular in threaded code.
MATLAB also has cospi, so itâs fair to make that change there as well, though Iâm not sure if thatâll make a performance impact. As for cis, thereâs good indication that MATLAB optimizes complex exponentials (not the same as exp with complex input), being on par with a tabulated implementation. It outpaces the equivalent cos(Q)+1i*sin(Q) by around the expected gains from Juliaâs cis(x) over cos(x)+im*sin(x), which is in turn faster than exp(::Complex). That benchmark is as far as they could go, I imagine itâs harder to dig further into MATLABâs implementation of exp or possibly the JIT. MATLAB isnât too clear on how its type system treats complex numbers, so a possible explanation for that benchmark is that (unlike Julia), wholly imaginary inputs are somehow distinguished from other complex numbers and allows exp in particular to use a cheaper cis.
exp secretly doing cis probably wouldnât explain all or even most of the 2.6x difference observed, Iâd hazard a guess that multithreading plays a role. In general, base MATLAB is optimized pretty well, I wouldnât worry about losing performance there; in fact, more direct control over performance can also mean requiring more effort, which I think Julia demonstrated here. You may have to resort to MEX (write C/C++ or try to compile it from MATLAB) to optimize your MATLAB code in more cases, but you may not run into those cases in practice, especially if youâre mostly sticking to highly optimized base functions like this.
josujugoâs dots for Julia are actually nearly as good as it gets, and dotting everything with @. isnât a general performance boost. The undotted operations among scalars and ranges are fast, and the dotted operations are actually special-cased to do the same thing when order of operations allow for it. JADekker demonstrated that adding those dots alone doesnât reduce allocations for a significant performance boost; the modest gain is likely due to optimization of the fused broadcast kernel. In some cases, dotting everything may unnecessarily put more operations into the broadcast loop that the compiler canât hoist back out, or add extra dimensions to a loop that makes it more expensive than carefully separated loops.
we probably should add a branch for exp(::Complex) that checks for zero real part and calls cis for you. Itâs probably fairly common for people to not use cis and exp(::Complex) is expensive enough that itâs probably worth the branch.
exp(::Complex} has a branch checking for zero imaginary part to make a similar optimization as cis, funnily enough. Not sure if another check for zero real part will get the (50%-ish for me) gains to approach cis though, there are many runtime checks in there that cis just doesnât have to make.
Shouldnât the pi also be taken out of the cispi input expression? cis is what takes im out.
Also to add to my caution about dotting everything, itâs possible to go even further here and remove all allocations and throw every operation into the kernel for a performance loss (though zero allocations may be a desirable tradeoff sometimes):
julia> function f_noalloc(N)
for K in 1:N
# sum(cos.(2 * pi * (0:N)/N) .* exp.(-1im * 2 * K * pi/N * (0:N)))/N
sum(x -> cos(2*pi*x/N) * exp(-1im * 2 * K * pi/N * x), 0:N)/N
end
end
f_noalloc (generic function with 1 method)
julia> function f_noalloc_cispi(N)
for K in 1:N
# sum(x -> cospi(2*x/N) * cispi(-12 * K/N * x), 0:N)/N
sum(x -> cos(2*pi*x/N) * cispi(-12 * K/N * x), 0:N)/N
end
end
f_noalloc_cispi (generic function with 1 method)
julia> @time f_noalloc(10_000)
4.222765 seconds
julia> @time f_noalloc_cispi(10_000)
2.223600 seconds
julia> # see JADekker's implementations for the following
julia> @time f(10_000)
3.875294 seconds (30.00 k allocations: 1.492 GiB, 1.64% gc time)
julia> @time f_dots_and_prealloc_cispi(10_000)
2.085591 seconds (6 allocations: 234.686 KiB)
julia> @time f_dots_and_prealloc_cispi_dot(10_000)
2.024006 seconds (6 allocations: 234.686 KiB)
Indeed, for @josujugo: this is a general performance principle in âfastâ languages, like Julia. Since you donât have to vectorize operations for speed, itâs actually wasteful to use vectorization when doing reductions (sum is a reduction, it âreducesâ an array to a single number).
For reductions you should generally avoid broadcasting and just
get one input number
feed it through your mapping (sin/exp etc)
add it to your sum
This is basically what Bennyâs code is doing. There are some extra subtleties surrounding simd and numerical inaccuracies, but the above is the general principle. Just avoid intermediate arrays completely. Even if you pre-allocate, as @JADekkerâs code does, that means you must iterate twice over the array.
And Iâll just nitpick you Matlab code for good measure
You shouldnât write [0:N], just write 0:N or (0:N). I know it works, because of how Matlab treats vectors, but itâs not good style, and Iâm pretty sure that the Matlab linter will put some squiggly lines under that code telling you not to do it.
It looks like you are making a range and then putting it into an array, and even though it ends up correct itâs confusing to the reader. If you do this in Julia (which you didnât), things will not work correctly.