How can this code be faster than MATLAB?

using BenchmarkTools

zvec, δvec = randn(10000) * 1e-2, randn(10000) * 1e-3

# Physical constant
const restenergy = 0.511e6
const clight = 3e8

# Computation settings
nturns = 10001

# Machine parameter
struct Par
    v1::Float64
    v2::Float64
    ϕs::Float64
    ϕ2s::Float64
    h1::Int64
    h::Int64
    circum::Float64
    centerenergy::Float64
    αc::Float64
    r::Float64
end
function Par(v1, v2, ϕs, ϕ2s, h1, h, circum, centerenergy, αc)
    Par(v1, v2, ϕs, ϕ2s, h1, h, circum, centerenergy, αc, v2 / v1)
end

function _revolution_cache(p::Par)
    c1 = p.v1 / p.centerenergy
    c2 = sin(p.ϕs) + p.r * sin(p.ϕ2s)
    c3 = 1 / (p.centerenergy / restenergy)^2
    k1 = 2 * p.h1 * pi / p.circum
    k2 = 2 * p.h1 * p.h * pi / p.circum
    (; c1, c2, c3, k1, k2)
end

function revolution_with_cache(z::Number, δ::Number, p::Par, c)
    δ = δ + c.c1 * (sin(p.ϕs - c.k1 * z) + p.r * sin(p.ϕ2s - c.k2 * z) - c.c2)
    η = p.αc - c.c3 / (1 + δ)^2
    z = z - p.circum * η * δ
    z, δ
end

function revolution!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
    c = _revolution_cache(p)
    nparticles = length(zvec)
    @inbounds for j = 1:nturns
        @inbounds @simd for i = 1:nparticles
            zvec[i], δvec[i] = revolution_with_cache(zvec[i], δvec[i], p, c)
        end
    end
    zvec, δvec
end

p = Par(3.6395053705870048e+06, 655751.6971214055228756, 2.1575815005097385, 6.0620746573056303, 756, 3, 1360.4, 6e9, 1.56e-5)

@btime revolution!($zvec, $δvec, $nturns, $p);

Hello everyone. I’m trying to modify this code faster than MATLAB.

zvec = randn(10000,1) * 1e-2;
deltavec = randn(10000,1) * 1e-3;

%%

restenergy = 0.511e6;
clight = 3e8;

nturns = 10001;
nparticle = length(zvec);

v1 = 3.6395053705870048e+06;
v2 = 655751.6971214055228756;
r = v2/v1;
phi_s = 2.1575815005097385;
phi_2s = 6.0620746573056303;
h1 = 756;
h = 3;
circum = 1360.4;
centerenergy = 6e9;
alpha_c = 0.0000156;

%%

tic;

for i = 1:nturns
    deltavec = deltavec + v1/centerenergy * (sin(phi_s - 2 * h1 * pi / circum .* zvec) - sin(phi_s) + r * sin(phi_2s - 2 * h1 * h * pi / circum .* zvec) - r * sin(phi_2s));
    etavec = alpha_c - 1./(centerenergy/restenergy .* (1 + deltavec)).^2;
    zvec = zvec - circum .* etavec .* deltavec;
end

toc

On my computer, the Julia code runs for approximately 1.73 seconds, while the MATLAB code runs for 0.63 seconds.

How else can I optimize the code to double or triple its speed?

Please read: Performance Tips · The Julia Language

Great that you are already using BenchmarkTools!

Did you check if your inner function revolution_with_cache is allocating any memory, e.g. using the @time macro?

julia> c = _revolution_cache(p);

julia> @time revolution_with_cache(0.0, 0.0, p, c);
  0.004438 seconds (133 allocations: 7.297 KiB, 99.78% compilation time)

julia> @time revolution_with_cache(0.0, 0.0, p, c);
  0.000005 seconds (1 allocation: 32 bytes)

Hello,

0.000005s seems okay. Calling 10000 times takes 0.05s.

But you are calling it 10000*10000 times, or am I wrong?

You have a nested loop…

:dizzy_face:Sorry, my fault

So it takes …500 seconds? I don’t know, I’m a novice :dizzy_face:

You could pass zvec and δvec and i to this function to avoid the allocation and modify the elements in place…

You could also try to use @inline in front of the function to avoid the call overhead…

function revolution!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
    c = _revolution_cache(p)
    nparticles = length(zvec)
    @inbounds for j = 1:nturns
        @inbounds @simd for i = 1:nparticles
            δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
            zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
        end
    end
    zvec, δvec
end

I tried this.
This function is absolutely inlined and replace values in place.
But,

julia> @btime revolution!($zvec, $δvec, $nturns, $p);
  1.735 s (0 allocations: 0 bytes)

julia> @btime revolution!($zvec, $δvec, $nturns, $p);
  1.729 s (0 allocations: 0 bytes)

I can reproduce you results. The next question is, is using Matlab multiple threads to calculate the result?

This is the code:

for i = 1:nturns
    deltavec = deltavec + v1/centerenergy * (sin(phi_s - 2 * h1 * pi / circum .* zvec) - sin(phi_s) + r * sin(phi_2s - 2 * h1 * h * pi / circum .* zvec) - r * sin(phi_2s));
    etavec = alpha_c - 1./(centerenergy/restenergy .* (1 + deltavec)).^2;
    zvec = zvec - circum .* etavec .* deltavec;
end

Any Matlab expert around who could answer this question?

1 Like

Ok, if I start matlab with one thread:

matlab --use-single-comp-thread

your code takes 1.284115s to execute, the Julia version needs 955.048 ms, so Julia is faster.

The question now is, how to modify your Julia code such that it uses multi-threading…

I think I leave the answer to others, must work now…

But have a look at: GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.

UPDATE:
When using LoopVectorization I get:

# Matlab
# singlethreaded: 1.28s
# multithreaded:  0.39s
# Julia with @tturbo macro:
# singlethreaded: 0.26s
# multithreaded:  0.035s

So on Ryzen 7950x Julia can be 11 times faster than Matlab.

2 Likes

LoopVectorization.jl will help those sin calls.

julia> @btime revolution2!($zvec, $δvec, $nturns, $p);
  1.994 s (0 allocations: 0 bytes)

julia> @btime revolution3!($zvec, $δvec, $nturns, $p);
  251.768 ms (0 allocations: 0 bytes)

julia> @btime revolution4!($zvec, $δvec, $nturns, $p);
  30.650 ms (0 allocations: 0 bytes)

This was with

julia> function revolution3!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
           c = _revolution_cache(p)
           nparticles = length(zvec)
           @inbounds for j = 1:nturns
               @turbo for i = 1:nparticles
                   δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
                   zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
               end
           end
           zvec, δvec
       end
revolution3! (generic function with 1 method)

While revolution2! was @inbounds @simd instead, and revolution4! with @tturbo.
Results will vary, but it should help.

4 Likes

can this help?

@inbounds for j = 1:nturns
    Threads.@threads for i = 1:nparticles
        δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
        zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
    end
1 Like

I get

julia> function revolution5!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
           c = _revolution_cache(p)
           nparticles = length(zvec)
           @inbounds for j = 1:nturns
               Threads.@threads for i = 1:nparticles
                   @inbounds begin; δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
                   zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]; end
               end
           end
           zvec, δvec
       end
revolution5! (generic function with 1 method)

julia> @btime revolution5!($zvec, $δvec, $nturns, $p);
  437.087 ms (1666342 allocations: 185.60 MiB)

julia> Threads.nthreads(), Sys.CPU_THREADS
(28, 28)

Which is an improvement, but on 28 threads, it is still substantially slower than @turbo on a single thread.

4 Likes

Thank you all.

1 Like