Why is this Julia code considerably slower than Matlab

zsoerenm · March 1, 2017, 5:04pm

Using Yeppp and devectorization I got it 3 x faster than my initial implementation and almost 5 x faster when parallelized:

@benchmark Testing.performance_test()
BenchmarkTools.Trial: 
  memory estimate:  534.06 MiB
  allocs estimate:  69
  --------------
  minimum time:     1.241 s (9.83% GC)
  median time:      1.246 s (9.52% GC)
  mean time:        1.289 s (12.10% GC)
  maximum time:     1.421 s (19.32% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

@benchmark Testing.performance_test_parallel()
BenchmarkTools.Trial: 
  memory estimate:  198.43 MiB
  allocs estimate:  2075
  --------------
  minimum time:     417.643 ms (26.01% GC)
  median time:      893.489 ms (11.51% GC)
  mean time:        826.470 ms (12.60% GC)
  maximum time:     921.135 ms (11.17% GC)
  --------------
  samples:          7
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Here is the code:

module Testing

  using Yeppp;

  function performance_test()
    N = 2_000_000;
    range = collect(1:N);

    steering_vectors = complex(ones(4,11), ones(4,11));
    carrier_signal = zeros(Complex{Float64}, N);
    cos_sig, sin_sig = zeros(N), zeros(N);
    sum_signal = zeros(Complex{Float64}, 4, length(range));
    for i = 1:11
        arg = (2π * 1.023e6 / 4e6) .* range .+ (40π/180);
        Yeppp.sin!(sin_sig, arg);
        Yeppp.cos!(cos_sig, arg);
        carrier_signal .= complex.(cos_sig, sin_sig);
        @inbounds for j = 1:4, k = 1:N ; sum_signal[j, k] += steering_vectors[j, i] * carrier_signal[k]; end
    end
    return sum_signal;
  end

  function performance_test_parallel()
    N = 2_000_000;
    range = collect(1:N);

    steering_vectors = complex(ones(4,11), ones(4,11));
    carrier_signal = zeros(Complex{Float64}, N);
    cos_sig, sin_sig = zeros(N), zeros(N);
    sum_signal_temp = zeros(Complex{Float64}, 4, length(range));
    sum_signal = convert(SharedArray, sum_signal_temp);
    @parallel for i = 1:11
        arg = (2π * 1.023e6 / 4e6) .* range .+ (40π/180);
        Yeppp.sin!(sin_sig, arg);
        Yeppp.cos!(cos_sig, arg);
        carrier_signal .= complex.(cos_sig, sin_sig);
        @inbounds for j = 1:4, k = 1:N ; sum_signal[j, k] += steering_vectors[j, i] * carrier_signal[k]; end
    end
    return sum_signal;
  end
end

Yeppp seems not that fast as Apple Accelerate. Without Yeppp the unparallelized code takes 2.2s instead of 1.2s.

yuyichao · March 1, 2017, 5:24pm

IIRC you have a skylake CPU so Yeppp can be slow due to https://github.com/JuliaMath/Yeppp.jl/issues/32 (unless you somehow managed to compile a latest Yeppp from source)

TsurHerman · March 1, 2017, 6:06pm

I took your original matlab code
took out:

carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);

outside of the loop because it is static.
and ran comparison between julia 0.6 and matlab 2016a

Matlab code:

function Untitled()

range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));
sum_signal = zeros(4,length(range));
carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
for i = 1:11
    steered_signal = steering_vectors(:,i) * carrier_signal;
    sum_signal = sum_signal + steered_signal;
end

Julia code after after fixing all the complains from the compiler, which resulted in the fully "fused: code

@inbounds function test_perf3()
    range = 1:2000000;

    steering_vectors = complex.(randn(4,11), randn(4,11));
    sum_signal = complex.(zeros(4,length(range)));
    carrier_signal = exp.(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)';
    for i = 1:11
        sum_signal .+= steering_vectors[:,i] .* carrier_signal;
    end
end

note that the code is very much the same between matlab and Julia.

O.k now for the results on my machine:
Matlab:1.35 seconds

Julia 0.6 “dot-fused” loops: 0.86 seconds
0.857758 seconds (47 allocations: 213.627 MiB, 12.08% gc time)

Julia 0.5 : 5.4 seconds

yuyichao · March 1, 2017, 6:28pm

Don’t do that since it has been repeatedly mentioned in the thread that this can’t be done in the real code and it’s actually where most of the time is spent.

TsurHerman · March 1, 2017, 6:38pm

well for the sin cos performance ,it was discussed already in this thread pointing out to
better implementations.

I wanted to show the simplicity of converting code and the improvement using dot calls

lobingera · March 1, 2017, 7:02pm

btw: sincos as function to provide sin and cos for the same argument should afair be available as operation in the FPU. I don’t know which lib-m exposes this, but i’m pretty sure matlab detects the exp(i * pi* something) and uses this.

yuyichao · March 1, 2017, 7:12pm

It’s a slow x87 instructions so I don’t think anything is using it.

DNF · March 1, 2017, 7:14pm

Calling exp(im...) is extremely common in my line of work. An expim(::Real) function or something like that would be quite useful.

giordano · March 1, 2017, 7:52pm

You’re looking for cis function.

DNF · March 1, 2017, 7:55pm

The difference is marginal. Sometimes I see no difference, and sometimes it saves <1 sec. Which is odd because the timing of

carrier_signal .= exp.(im.*α)

vs

carrier_signal .= complex.(cos.(α), sin.(α))

indicates that I should be saving 2-3 sec.

DNF · March 1, 2017, 8:11pm

Indeed I am! Thanks!

I also see that There is an AppleAccelerate.cis! function. Using this drops the single-threaded runtime to 325ms.

RoyiAvital · March 1, 2017, 8:13pm

So it seems that the biggest gain is avoiding exponent of complex number and working efficiently with memory.

I conclude that since @zsoerenm in his post managed to beat MATLAB using Julia (Even his result without Yepp).

Which means, as I guessed at first, this was all about efficiency of the code and not only the Multi Threading of MATLAB.

Namely MATLAB is better than Julia when the code is written in vectorized form not only because of Multi Threading but because it does it more efficiently.

@TsurHerman, You use very old MATLAB.
MATLAB R2016b (And R2017a which is around the corner) are much much more efficient.

Yet if Julia is written in most efficient manner it beats MATLAB.

Hence the conclusion is the same as before in the previous thread.
Priority number 1 is to minimize the gap between the code generated by Julia in Vector Form to what Julia can do in non vectorized form.

ChrisRackauckas · March 1, 2017, 8:54pm

See v0.6 + future PRs for stack-allocated views. This has already been mentioned.

yuyichao · March 1, 2017, 8:58pm

And just to be clear, the logic here is wrong.

There are multiple ways to improve performance and observing that you can improve the performance in one way in julia doesn’t means it was the difference between the original julia and MATLAB version.

If the question is “Why is this Julia code considerably slower than Matlab”, which is still the title, then the answer is absolutely that MATLAB is doing the most expensive part, which you correctly identified as the exp part, with multiple threads.

All of the methods mentioned in the threads that beats MATLAB with single thread code in julia mentioned in this thread are using SIMD which is not what MATLAB does. It is very easy to get a speed up from SIMD that beats multiple threads since all the CPU models mentioned in this thread support no more than 4 hardware cores but also support <4 x double> which basically means a 4x speed up (AFAICT with the libmvec implementation that gcc uses, the speed up is pretty linear in vector size). Basically even though these give you a big speed up which is reasonably easy to do in julia it is still not the answer to the question.

That said, vectorization (SIMD) of function call is a hard problem that is being worked on. It requires changes to LLVM so that it can recognize julia functions to be vectorized.

RoyiAvital · March 2, 2017, 6:54am

@yuyichao, You keep saying that but it is wrong.
I really kept looking carefully all data spread here in various posts.

For the original code (Vectorized Form) MATLAB is faster in Single Threaded code (See @zsoerenm post, He gets ~3.3 [Sec] for Single Threaded MATLAB vs. ~3.95 [Sec] for Julia) .
Moreover, @DNF in his post answering my question wrote that it seems most of the gain is due calculation sin and cos in efficient way (In place) instead of the exp() of complex argument.

So the facts are easy (We are talking Single Thread mode only, both Julia and MATLAB):

MATLAB is faster in Single Thread in the vectorized code form (See original post).
It seems both MATLAB and Julia implementation for exp(), cos() and sin() aren’t SIMD accelerated in the current form.
Julia, using Built In sin() and cos() and avoiding calculation of exp() for Complex Argument is faster in Devectorized Form.
The Devectroized form mainly handled memory better and avoids intermediate variables.

Wrap all those facts and what you get that MATLAB, in Single Thread for the Vectroized Form was faster due to:

Better memory management.
Probably decomposing exp(1i * realArg) into complex(cos(realArg), sin(realArg)).

Regarding the first, it seems those are fixed in Julia 0.6 (Though with usage of ., I wonder if some optimization can be made without it even).

About the second remark, no need to change anything in Julia as @giordano pointed out it is already there in Julia (See cis()).
The user needs to be aware of that (Better than add heuristic in my opinion so the choice is correct).

mkborregaard · March 2, 2017, 7:25am

I’ll say it’s been a pleasure to see one of these now classic “matlab is faster than julia” threads come to this resolution. Also very good signs for the development of Julia to see the 0.6 code even faster. And @TsurHerman s code for 0.6 is nicely readable and intuitive, while being fast

yuyichao · March 2, 2017, 1:41pm

I’m not saying those aren’t the case. But I am saying that the multi threading is what causing the biggest difference. I also just don’t count this small difference because the multi threaded matlab result actually is also faster than what’s in the original post by the same factor.

barche · March 3, 2017, 12:30pm

I couldn’t resist running this too, on 0.6. Modifying the original code to use broadcasting and threading for the main loop:

function performance_test()
    range = 1:2000000

    steering_vectors = complex(randn(4,11), randn(4,11)*2)

    sum_signal = zeros(Complex{Float64}, 4, length(range))
    steered_signal = zeros(Complex{Float64}, 4, length(range))
    carrier_signal = zeros(Complex{Float64}, length(range))
    Threads.@threads for i = 1:11
        carrier_signal .= exp.(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)
        steered_signal .= steering_vectors[:,i] .* carrier_signal.'
        sum_signal .+= steered_signal
    end
    return sum_signal
end

Julia timings:

  0.999100 seconds (6.83 k allocations: 275.003 MiB, 1.72% gc time)
  1.080039 seconds (131 allocations: 274.674 MiB, 8.55% gc time)

Matlab 2016b on the same machine, default options (i.e. using threading), using the original code I get 1.57 s, so using the latest Julia with the recommended practices does result in faster execution than Matlab. I could not find any difference between libm and openlibm on my machine.

ScottPJones · March 3, 2017, 12:45pm

Is that with -O3?
Would simplifying the calculation of of carrier_signal help (i.e. using cis, possibly? or using the simplified calculation I’d come up with for the imaginary part: muladd(1.606924642311179, k, 0.6981317007977318) (where k is the range value).

barche · March 3, 2017, 1:35pm

It was with the default -O2, but -O3 didn’t change anything.

I’d be happy to try a simplified version, but please feed me a line to copy/paste in that case Also, for fairness the equivalent simplification should be tested in Matlab in that case.

Topic		Replies	Views
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4340	December 30, 2022
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1881	June 25, 2023
Matlab versus Julia General Usage	33	4936	July 15, 2021
How to make my Julia code fast like MATLAB? New to Julia question	19	2020	May 25, 2021
Simple speed comparison with Matlab New to Julia question	33	1226	May 6, 2025

Why is this Julia code considerably slower than Matlab

Related topics