Why is this Julia code considerably slower than Matlab

Using Yeppp and devectorization I got it 3 x faster than my initial implementation and almost 5 x faster when parallelized:

@benchmark Testing.performance_test()
BenchmarkTools.Trial: 
  memory estimate:  534.06 MiB
  allocs estimate:  69
  --------------
  minimum time:     1.241 s (9.83% GC)
  median time:      1.246 s (9.52% GC)
  mean time:        1.289 s (12.10% GC)
  maximum time:     1.421 s (19.32% GC)
  --------------
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

@benchmark Testing.performance_test_parallel()
BenchmarkTools.Trial: 
  memory estimate:  198.43 MiB
  allocs estimate:  2075
  --------------
  minimum time:     417.643 ms (26.01% GC)
  median time:      893.489 ms (11.51% GC)
  mean time:        826.470 ms (12.60% GC)
  maximum time:     921.135 ms (11.17% GC)
  --------------
  samples:          7
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Here is the code:

module Testing

  using Yeppp;

  function performance_test()
    N = 2_000_000;
    range = collect(1:N);

    steering_vectors = complex(ones(4,11), ones(4,11));
    carrier_signal = zeros(Complex{Float64}, N);
    cos_sig, sin_sig = zeros(N), zeros(N);
    sum_signal = zeros(Complex{Float64}, 4, length(range));
    for i = 1:11
        arg = (2Ļ€ * 1.023e6 / 4e6) .* range .+ (40Ļ€/180);
        Yeppp.sin!(sin_sig, arg);
        Yeppp.cos!(cos_sig, arg);
        carrier_signal .= complex.(cos_sig, sin_sig);
        @inbounds for j = 1:4, k = 1:N ; sum_signal[j, k] += steering_vectors[j, i] * carrier_signal[k]; end
    end
    return sum_signal;
  end

  function performance_test_parallel()
    N = 2_000_000;
    range = collect(1:N);

    steering_vectors = complex(ones(4,11), ones(4,11));
    carrier_signal = zeros(Complex{Float64}, N);
    cos_sig, sin_sig = zeros(N), zeros(N);
    sum_signal_temp = zeros(Complex{Float64}, 4, length(range));
    sum_signal = convert(SharedArray, sum_signal_temp);
    @parallel for i = 1:11
        arg = (2Ļ€ * 1.023e6 / 4e6) .* range .+ (40Ļ€/180);
        Yeppp.sin!(sin_sig, arg);
        Yeppp.cos!(cos_sig, arg);
        carrier_signal .= complex.(cos_sig, sin_sig);
        @inbounds for j = 1:4, k = 1:N ; sum_signal[j, k] += steering_vectors[j, i] * carrier_signal[k]; end
    end
    return sum_signal;
  end
end

Yeppp seems not that fast as Apple Accelerate. Without Yeppp the unparallelized code takes 2.2s instead of 1.2s.

IIRC you have a skylake CPU so Yeppp can be slow due to https://github.com/JuliaMath/Yeppp.jl/issues/32 (unless you somehow managed to compile a latest Yeppp from source)

I took your original matlab code
took out:

carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);

outside of the loop because it is static.
and ran comparison between julia 0.6 and matlab 2016a

Matlab code:

function Untitled()

range = 1:2000000;

steering_vectors = complex(randn(4,11), randn(4,11));
sum_signal = zeros(4,length(range));
carrier_signal = exp(2i * pi * 1.023e6 * range / 4e6 + 1i * 40 * pi / 180);
for i = 1:11
    steered_signal = steering_vectors(:,i) * carrier_signal;
    sum_signal = sum_signal + steered_signal;
end

Julia code after after fixing all the complains from the compiler, which resulted in the fully "fused: code

@inbounds function test_perf3()
    range = 1:2000000;

    steering_vectors = complex.(randn(4,11), randn(4,11));
    sum_signal = complex.(zeros(4,length(range)));
    carrier_signal = exp.(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)';
    for i = 1:11
        sum_signal .+= steering_vectors[:,i] .* carrier_signal;
    end
end

note that the code is very much the same between matlab and Julia.

O.k now for the results on my machine:
Matlab:1.35 seconds

Julia 0.6 ā€œdot-fusedā€ loops: 0.86 seconds
0.857758 seconds (47 allocations: 213.627 MiB, 12.08% gc time)

Julia 0.5 : 5.4 seconds

3 Likes

Donā€™t do that since it has been repeatedly mentioned in the thread that this canā€™t be done in the real code and itā€™s actually where most of the time is spent.

well for the sin cos performance ,it was discussed already in this thread pointing out to
better implementations.

I wanted to show the simplicity of converting code and the improvement using dot calls

btw: sincos as function to provide sin and cos for the same argument should afair be available as operation in the FPU. I donā€™t know which lib-m exposes this, but iā€™m pretty sure matlab detects the exp(i * pi* something) and uses this.

Itā€™s a slow x87 instructions so I donā€™t think anything is using it.

Calling exp(im...) is extremely common in my line of work. An expim(::Real) function or something like that would be quite useful.

2 Likes

Youā€™re looking for cis function.

4 Likes

The difference is marginal. Sometimes I see no difference, and sometimes it saves <1 sec. Which is odd because the timing of

carrier_signal .= exp.(im.*Ī±)

vs

carrier_signal .= complex.(cos.(Ī±), sin.(Ī±))

indicates that I should be saving 2-3 sec.

Indeed I am! Thanks!

I also see that There is an AppleAccelerate.cis! function. Using this drops the single-threaded runtime to 325ms.

1 Like

So it seems that the biggest gain is avoiding exponent of complex number and working efficiently with memory.

I conclude that since @zsoerenm in his post managed to beat MATLAB using Julia (Even his result without Yepp).

Which means, as I guessed at first, this was all about efficiency of the code and not only the Multi Threading of MATLAB.

Namely MATLAB is better than Julia when the code is written in vectorized form not only because of Multi Threading but because it does it more efficiently.

@TsurHerman, You use very old MATLAB.
MATLAB R2016b (And R2017a which is around the corner) are much much more efficient.

Yet if Julia is written in most efficient manner it beats MATLAB.

Hence the conclusion is the same as before in the previous thread.
Priority number 1 is to minimize the gap between the code generated by Julia in Vector Form to what Julia can do in non vectorized form.

See v0.6 + future PRs for stack-allocated views. This has already been mentioned.

And just to be clear, the logic here is wrong.

There are multiple ways to improve performance and observing that you can improve the performance in one way in julia doesnā€™t means it was the difference between the original julia and MATLAB version.

If the question is ā€œWhy is this Julia code considerably slower than Matlabā€, which is still the title, then the answer is absolutely that MATLAB is doing the most expensive part, which you correctly identified as the exp part, with multiple threads.

All of the methods mentioned in the threads that beats MATLAB with single thread code in julia mentioned in this thread are using SIMD which is not what MATLAB does. It is very easy to get a speed up from SIMD that beats multiple threads since all the CPU models mentioned in this thread support no more than 4 hardware cores but also support <4 x double> which basically means a 4x speed up (AFAICT with the libmvec implementation that gcc uses, the speed up is pretty linear in vector size). Basically even though these give you a big speed up which is reasonably easy to do in julia it is still not the answer to the question.

That said, vectorization (SIMD) of function call is a hard problem that is being worked on. It requires changes to LLVM so that it can recognize julia functions to be vectorized.

6 Likes

@yuyichao, You keep saying that but it is wrong.
I really kept looking carefully all data spread here in various posts.

For the original code (Vectorized Form) MATLAB is faster in Single Threaded code (See @zsoerenm post, He gets ~3.3 [Sec] for Single Threaded MATLAB vs. ~3.95 [Sec] for Julia) .
Moreover, @DNF in his post answering my question wrote that it seems most of the gain is due calculation sin and cos in efficient way (In place) instead of the exp() of complex argument.

So the facts are easy (We are talking Single Thread mode only, both Julia and MATLAB):

  1. MATLAB is faster in Single Thread in the vectorized code form (See original post).
  2. It seems both MATLAB and Julia implementation for exp(), cos() and sin() arenā€™t SIMD accelerated in the current form.
  3. Julia, using Built In sin() and cos() and avoiding calculation of exp() for Complex Argument is faster in Devectorized Form.
  4. The Devectroized form mainly handled memory better and avoids intermediate variables.

Wrap all those facts and what you get that MATLAB, in Single Thread for the Vectroized Form was faster due to:

  1. Better memory management.
  2. Probably decomposing exp(1i * realArg) into complex(cos(realArg), sin(realArg)).

Regarding the first, it seems those are fixed in Julia 0.6 (Though with usage of ., I wonder if some optimization can be made without it even).

About the second remark, no need to change anything in Julia as @giordano pointed out it is already there in Julia (See cis()).
The user needs to be aware of that (Better than add heuristic in my opinion so the choice is correct).

1 Like

Iā€™ll say itā€™s been a pleasure to see one of these now classic ā€œmatlab is faster than juliaā€ threads come to this resolution. Also very good signs for the development of Julia to see the 0.6 code even faster. And @TsurHerman s code for 0.6 is nicely readable and intuitive, while being fast :slight_smile:

1 Like

Iā€™m not saying those arenā€™t the case. But I am saying that the multi threading is what causing the biggest difference. I also just donā€™t count this small difference because the multi threaded matlab result actually is also faster than whatā€™s in the original post by the same factor.

1 Like

I couldnā€™t resist running this too, on 0.6. Modifying the original code to use broadcasting and threading for the main loop:

function performance_test()
    range = 1:2000000

    steering_vectors = complex(randn(4,11), randn(4,11)*2)

    sum_signal = zeros(Complex{Float64}, 4, length(range))
    steered_signal = zeros(Complex{Float64}, 4, length(range))
    carrier_signal = zeros(Complex{Float64}, length(range))
    Threads.@threads for i = 1:11
        carrier_signal .= exp.(2im * pi * 1.023e6 * range / 4e6 + 1im * 40 * pi / 180)
        steered_signal .= steering_vectors[:,i] .* carrier_signal.'
        sum_signal .+= steered_signal
    end
    return sum_signal
end

Julia timings:

  0.999100 seconds (6.83 k allocations: 275.003 MiB, 1.72% gc time)
  1.080039 seconds (131 allocations: 274.674 MiB, 8.55% gc time)

Matlab 2016b on the same machine, default options (i.e. using threading), using the original code I get 1.57 s, so using the latest Julia with the recommended practices does result in faster execution than Matlab. I could not find any difference between libm and openlibm on my machine.

4 Likes

Is that with -O3?
Would simplifying the calculation of of carrier_signal help (i.e. using cis, possibly? or using the simplified calculation Iā€™d come up with for the imaginary part: muladd(1.606924642311179, k, 0.6981317007977318) (where k is the range value).

It was with the default -O2, but -O3 didnā€™t change anything.

Iā€™d be happy to try a simplified version, but please feed me a line to copy/paste in that case :wink: Also, for fairness the equivalent simplification should be tested in Matlab in that case.