How can this code be faster than MATLAB?

XuJingye2022 · July 5, 2023, 3:44am

using BenchmarkTools

zvec, δvec = randn(10000) * 1e-2, randn(10000) * 1e-3

# Physical constant
const restenergy = 0.511e6
const clight = 3e8

# Computation settings
nturns = 10001

# Machine parameter
struct Par
    v1::Float64
    v2::Float64
    ϕs::Float64
    ϕ2s::Float64
    h1::Int64
    h::Int64
    circum::Float64
    centerenergy::Float64
    αc::Float64
    r::Float64
end
function Par(v1, v2, ϕs, ϕ2s, h1, h, circum, centerenergy, αc)
    Par(v1, v2, ϕs, ϕ2s, h1, h, circum, centerenergy, αc, v2 / v1)
end

function _revolution_cache(p::Par)
    c1 = p.v1 / p.centerenergy
    c2 = sin(p.ϕs) + p.r * sin(p.ϕ2s)
    c3 = 1 / (p.centerenergy / restenergy)^2
    k1 = 2 * p.h1 * pi / p.circum
    k2 = 2 * p.h1 * p.h * pi / p.circum
    (; c1, c2, c3, k1, k2)
end

function revolution_with_cache(z::Number, δ::Number, p::Par, c)
    δ = δ + c.c1 * (sin(p.ϕs - c.k1 * z) + p.r * sin(p.ϕ2s - c.k2 * z) - c.c2)
    η = p.αc - c.c3 / (1 + δ)^2
    z = z - p.circum * η * δ
    z, δ
end

function revolution!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
    c = _revolution_cache(p)
    nparticles = length(zvec)
    @inbounds for j = 1:nturns
        @inbounds @simd for i = 1:nparticles
            zvec[i], δvec[i] = revolution_with_cache(zvec[i], δvec[i], p, c)
        end
    end
    zvec, δvec
end

p = Par(3.6395053705870048e+06, 655751.6971214055228756, 2.1575815005097385, 6.0620746573056303, 756, 3, 1360.4, 6e9, 1.56e-5)

@btime revolution!($zvec, $δvec, $nturns, $p);

Hello everyone. I’m trying to modify this code faster than MATLAB.

zvec = randn(10000,1) * 1e-2;
deltavec = randn(10000,1) * 1e-3;

%%

restenergy = 0.511e6;
clight = 3e8;

nturns = 10001;
nparticle = length(zvec);

v1 = 3.6395053705870048e+06;
v2 = 655751.6971214055228756;
r = v2/v1;
phi_s = 2.1575815005097385;
phi_2s = 6.0620746573056303;
h1 = 756;
h = 3;
circum = 1360.4;
centerenergy = 6e9;
alpha_c = 0.0000156;

%%

tic;

for i = 1:nturns
    deltavec = deltavec + v1/centerenergy * (sin(phi_s - 2 * h1 * pi / circum .* zvec) - sin(phi_s) + r * sin(phi_2s - 2 * h1 * h * pi / circum .* zvec) - r * sin(phi_2s));
    etavec = alpha_c - 1./(centerenergy/restenergy .* (1 + deltavec)).^2;
    zvec = zvec - circum .* etavec .* deltavec;
end

toc

On my computer, the Julia code runs for approximately 1.73 seconds, while the MATLAB code runs for 0.63 seconds.

How else can I optimize the code to double or triple its speed?

ufechner7 · July 5, 2023, 3:57am

Please read: Performance Tips · The Julia Language

Great that you are already using BenchmarkTools!

Did you check if your inner function revolution_with_cache is allocating any memory, e.g. using the @time macro?

XuJingye2022 · July 5, 2023, 4:21am

julia> c = _revolution_cache(p);

julia> @time revolution_with_cache(0.0, 0.0, p, c);
  0.004438 seconds (133 allocations: 7.297 KiB, 99.78% compilation time)

julia> @time revolution_with_cache(0.0, 0.0, p, c);
  0.000005 seconds (1 allocation: 32 bytes)

Hello,

0.000005s seems okay. Calling 10000 times takes 0.05s.

ufechner7 · July 5, 2023, 4:27am

But you are calling it 10000*10000 times, or am I wrong?

You have a nested loop…

XuJingye2022 · July 5, 2023, 4:29am

Sorry, my fault

So it takes …500 seconds? I don’t know, I’m a novice

ufechner7 · July 5, 2023, 4:39am

You could pass zvec and δvec and i to this function to avoid the allocation and modify the elements in place…

You could also try to use @inline in front of the function to avoid the call overhead…

XuJingye2022 · July 5, 2023, 5:22am

function revolution!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
    c = _revolution_cache(p)
    nparticles = length(zvec)
    @inbounds for j = 1:nturns
        @inbounds @simd for i = 1:nparticles
            δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
            zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
        end
    end
    zvec, δvec
end

I tried this.
This function is absolutely inlined and replace values in place.
But,

julia> @btime revolution!($zvec, $δvec, $nturns, $p);
  1.735 s (0 allocations: 0 bytes)

julia> @btime revolution!($zvec, $δvec, $nturns, $p);
  1.729 s (0 allocations: 0 bytes)

ufechner7 · July 5, 2023, 5:55am

I can reproduce you results. The next question is, is using Matlab multiple threads to calculate the result?

This is the code:

for i = 1:nturns
    deltavec = deltavec + v1/centerenergy * (sin(phi_s - 2 * h1 * pi / circum .* zvec) - sin(phi_s) + r * sin(phi_2s - 2 * h1 * h * pi / circum .* zvec) - r * sin(phi_2s));
    etavec = alpha_c - 1./(centerenergy/restenergy .* (1 + deltavec)).^2;
    zvec = zvec - circum .* etavec .* deltavec;
end

Any Matlab expert around who could answer this question?

ufechner7 · July 5, 2023, 6:07am

Ok, if I start matlab with one thread:

matlab --use-single-comp-thread

your code takes 1.284115s to execute, the Julia version needs 955.048 ms, so Julia is faster.

The question now is, how to modify your Julia code such that it uses multi-threading…

I think I leave the answer to others, must work now…

But have a look at: GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.

UPDATE:
When using LoopVectorization I get:

# Matlab
# singlethreaded: 1.28s
# multithreaded:  0.39s
# Julia with @tturbo macro:
# singlethreaded: 0.26s
# multithreaded:  0.035s

So on Ryzen 7950x Julia can be 11 times faster than Matlab.

Elrod · July 5, 2023, 6:08am

LoopVectorization.jl will help those sin calls.

julia> @btime revolution2!($zvec, $δvec, $nturns, $p);
  1.994 s (0 allocations: 0 bytes)

julia> @btime revolution3!($zvec, $δvec, $nturns, $p);
  251.768 ms (0 allocations: 0 bytes)

julia> @btime revolution4!($zvec, $δvec, $nturns, $p);
  30.650 ms (0 allocations: 0 bytes)

This was with

julia> function revolution3!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
           c = _revolution_cache(p)
           nparticles = length(zvec)
           @inbounds for j = 1:nturns
               @turbo for i = 1:nparticles
                   δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
                   zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
               end
           end
           zvec, δvec
       end
revolution3! (generic function with 1 method)

While revolution2! was @inbounds @simd instead, and revolution4! with @tturbo.
Results will vary, but it should help.

longemen3000 · July 5, 2023, 6:12am

can this help?

@inbounds for j = 1:nturns
    Threads.@threads for i = 1:nparticles
        δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
        zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]
    end

Elrod · July 5, 2023, 6:14am

I get

julia> function revolution5!(zvec::Vector, δvec::Vector, nturns::Int64, p::Par)
           c = _revolution_cache(p)
           nparticles = length(zvec)
           @inbounds for j = 1:nturns
               Threads.@threads for i = 1:nparticles
                   @inbounds begin; δvec[i] += c.c1 * (sin(p.ϕs - c.k1 * zvec[i]) + p.r * sin(p.ϕ2s - c.k2 * zvec[i]) - c.c2)
                   zvec[i] -= p.circum * (p.αc - c.c3 / (1 + δvec[i])^2) * δvec[i]; end
               end
           end
           zvec, δvec
       end
revolution5! (generic function with 1 method)

julia> @btime revolution5!($zvec, $δvec, $nturns, $p);
  437.087 ms (1666342 allocations: 185.60 MiB)

julia> Threads.nthreads(), Sys.CPU_THREADS
(28, 28)

Which is an improvement, but on 28 threads, it is still substantially slower than @turbo on a single thread.

XuJingye2022 · July 5, 2023, 6:22am

Thank you all.

Topic		Replies	Views
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8400	March 5, 2017
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1908	June 25, 2023
Matlab versus Julia General Usage	33	4947	July 15, 2021
Simple speed comparison with Matlab New to Julia question	33	1257	May 6, 2025
MATLAB outperforms Julia (20 times faster) running this nested loop Performance question , performance , matlab , time , finitediff	20	5403	January 10, 2023

How can this code be faster than MATLAB?

Related topics