# How can this code be faster than MATLAB?

``````using BenchmarkTools

zvec, δvec = randn(10000) * 1e-2, randn(10000) * 1e-3

# Physical constant
const restenergy = 0.511e6
const clight = 3e8

# Computation settings
nturns = 10001

# Machine parameter
struct Par
v1::Float64
v2::Float64
ϕs::Float64
ϕ2s::Float64
h1::Int64
h::Int64
circum::Float64
centerenergy::Float64
αc::Float64
r::Float64
end
function Par(v1, v2, ϕs, ϕ2s, h1, h, circum, centerenergy, αc)
Par(v1, v2, ϕs, ϕ2s, h1, h, circum, centerenergy, αc, v2 / v1)
end

function _revolution_cache(p::Par)
c1 = p.v1 / p.centerenergy
c2 = sin(p.ϕs) + p.r * sin(p.ϕ2s)
c3 = 1 / (p.centerenergy / restenergy)^2
k1 = 2 * p.h1 * pi / p.circum
k2 = 2 * p.h1 * p.h * pi / p.circum
(; c1, c2, c3, k1, k2)
end

function revolution_with_cache(z::Number, δ::Number, p::Par, c)
δ = δ + c.c1 * (sin(p.ϕs - c.k1 * z) + p.r * sin(p.ϕ2s - c.k2 * z) - c.c2)
η = p.αc - c.c3 / (1 + δ)^2
z = z - p.circum * η * δ
z, δ
end

function revolution!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
c = _revolution_cache(p)
nparticles = length(zvec)
@inbounds for j = 1:nturns
@inbounds @simd for i = 1:nparticles
zvec[i], δvec[i] = revolution_with_cache(zvec[i], δvec[i], p, c)
end
end
zvec, δvec
end

p = Par(3.6395053705870048e+06, 655751.6971214055228756, 2.1575815005097385, 6.0620746573056303, 756, 3, 1360.4, 6e9, 1.56e-5)

@btime revolution!(\$zvec, \$δvec, \$nturns, \$p);
``````

Hello everyone. I’m trying to modify this code faster than MATLAB.

``````zvec = randn(10000,1) * 1e-2;
deltavec = randn(10000,1) * 1e-3;

%%

restenergy = 0.511e6;
clight = 3e8;

nturns = 10001;
nparticle = length(zvec);

v1 = 3.6395053705870048e+06;
v2 = 655751.6971214055228756;
r = v2/v1;
phi_s = 2.1575815005097385;
phi_2s = 6.0620746573056303;
h1 = 756;
h = 3;
circum = 1360.4;
centerenergy = 6e9;
alpha_c = 0.0000156;

%%

tic;

for i = 1:nturns
deltavec = deltavec + v1/centerenergy * (sin(phi_s - 2 * h1 * pi / circum .* zvec) - sin(phi_s) + r * sin(phi_2s - 2 * h1 * h * pi / circum .* zvec) - r * sin(phi_2s));
etavec = alpha_c - 1./(centerenergy/restenergy .* (1 + deltavec)).^2;
zvec = zvec - circum .* etavec .* deltavec;
end

toc
``````

On my computer, the Julia code runs for approximately 1.73 seconds, while the MATLAB code runs for 0.63 seconds.

How else can I optimize the code to double or triple its speed?

Great that you are already using BenchmarkTools!

Did you check if your inner function `revolution_with_cache` is allocating any memory, e.g. using the @time macro?

``````julia> c = _revolution_cache(p);

julia> @time revolution_with_cache(0.0, 0.0, p, c);
0.004438 seconds (133 allocations: 7.297 KiB, 99.78% compilation time)

julia> @time revolution_with_cache(0.0, 0.0, p, c);
0.000005 seconds (1 allocation: 32 bytes)
``````

Hello,

`0.000005`s seems okay. Calling 10000 times takes `0.05`s.

But you are calling it 10000*10000 times, or am I wrong?

You have a nested loop…

Sorry, my fault

So it takes …`500` seconds? I don’t know, I’m a novice

You could pass `zvec` and `δvec` and i to this function to avoid the allocation and modify the elements in place…

You could also try to use `@inline` in front of the function to avoid the call overhead…

``````function revolution!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
c = _revolution_cache(p)
nparticles = length(zvec)
@inbounds for j = 1:nturns
@inbounds @simd for i = 1:nparticles
δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
end
end
zvec, δvec
end
``````

I tried this.
This function is absolutely inlined and replace values in place.
But,

``````julia> @btime revolution!(\$zvec, \$δvec, \$nturns, \$p);
1.735 s (0 allocations: 0 bytes)

julia> @btime revolution!(\$zvec, \$δvec, \$nturns, \$p);
1.729 s (0 allocations: 0 bytes)
``````

I can reproduce you results. The next question is, is using Matlab multiple threads to calculate the result?

This is the code:

``````for i = 1:nturns
deltavec = deltavec + v1/centerenergy * (sin(phi_s - 2 * h1 * pi / circum .* zvec) - sin(phi_s) + r * sin(phi_2s - 2 * h1 * h * pi / circum .* zvec) - r * sin(phi_2s));
etavec = alpha_c - 1./(centerenergy/restenergy .* (1 + deltavec)).^2;
zvec = zvec - circum .* etavec .* deltavec;
end
``````

Any Matlab expert around who could answer this question?

1 Like

Ok, if I start matlab with one thread:

``````matlab --use-single-comp-thread
``````

your code takes 1.284115s to execute, the Julia version needs 955.048 ms, so Julia is faster.

The question now is, how to modify your Julia code such that it uses multi-threading…

I think I leave the answer to others, must work now…

But have a look at: GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.

UPDATE:
When using LoopVectorization I get:

``````# Matlab
# Julia with @tturbo macro:
``````

So on Ryzen 7950x Julia can be 11 times faster than Matlab.

2 Likes

LoopVectorization.jl will help those `sin` calls.

``````julia> @btime revolution2!(\$zvec, \$δvec, \$nturns, \$p);
1.994 s (0 allocations: 0 bytes)

julia> @btime revolution3!(\$zvec, \$δvec, \$nturns, \$p);
251.768 ms (0 allocations: 0 bytes)

julia> @btime revolution4!(\$zvec, \$δvec, \$nturns, \$p);
30.650 ms (0 allocations: 0 bytes)
``````

This was with

``````julia> function revolution3!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
c = _revolution_cache(p)
nparticles = length(zvec)
@inbounds for j = 1:nturns
@turbo for i = 1:nparticles
δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
end
end
zvec, δvec
end
revolution3! (generic function with 1 method)
``````

While `revolution2!` was `@inbounds @simd` instead, and `revolution4!` with `@tturbo`.
Results will vary, but it should help.

4 Likes

can this help?

``````@inbounds for j = 1:nturns
δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
end
``````
1 Like

I get

``````julia> function revolution5!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
c = _revolution_cache(p)
nparticles = length(zvec)
@inbounds for j = 1:nturns
@inbounds begin; δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]; end
end
end
zvec, δvec
end
revolution5! (generic function with 1 method)

julia> @btime revolution5!(\$zvec, \$δvec, \$nturns, \$p);
437.087 ms (1666342 allocations: 185.60 MiB)

Which is an improvement, but on 28 threads, it is still substantially slower than `@turbo` on a single thread.